Commit Graph

11 Commits

Author SHA1 Message Date
111f40ddf0 p1-6: kb-store-sqlite test suite (8 categories)
All 8 test categories from the task plan, plus a JobRepo subset:

  migration   — tests/migration.rs: fresh DB after run_migrations
                exposes every required §5 table + index.
  unit (copy) — tests/asset_writer.rs: copy mode writes file with
                mode 0o644 + correct bytes.
  unit (ref)  — tests/asset_writer.rs: reference mode does not write
                file; row records source path.
  unit (cs)   — tests/asset_writer.rs: tampered checksum returns a
                Conflict-flavoured anyhow error.
  unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
                doc_version 1→2; tags re-derived.
  unit (rb)   — tests/idempotency.rs: put_blocks with FK violation
                rolls back; pre-existing rows unchanged.
  contract    — tests/contract_roundtrip.rs: drives kb-parse-md +
                kb-normalize + kb-chunk on
                fixtures/markdown/code-and-table.md, persists, then
                reloads via DocumentStore::get_document /
                get_chunk and asserts byte-equal round-trip.
  snapshot    — tests/ingest_report_snapshot.rs +
                snapshots/ingest_report.snapshot.json: pin the wire
                JSON form of kb_core::IngestReport for an inline
                fixture run.
  jobs        — tests/jobs.rs: create → progress → finish flow;
                error message round-trip; list filters on status/kind.

Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
2026-04-30 17:13:03 +00:00
a3390d5171 p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL
New workspace member crate `kb-store-sqlite` (allowed deps only:
kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json,
time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md /
kb-normalize / kb-chunk for the contract round-trip test).

Migration V001 replaces the P0-1 stub with the full §5 DDL (assets,
documents, document_tags, blocks, chunks with policy_hash,
embedding_records, jobs, ingest_runs, answers, eval_runs,
eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers
remain deferred to V002 (P2-1).

Public surface per task spec:
  SqliteStore::open / run_migrations / put_asset_with_bytes
  impl DocumentStore for SqliteStore (7 trait methods)
  impl JobRepo for SqliteStore (4 trait methods)
  StoreError { Sqlx, Migration, Conflict }

Behavior:
- Pragmas at open: foreign_keys=ON, journal_mode=WAL,
  synchronous=NORMAL, temp_store=MEMORY.
- Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to
  data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else
  reference. blake3(bytes) verified against asset.checksum; mismatch →
  Conflict.
- Idempotency: put_document UPSERTs and bumps doc_version + 1 on
  conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags
  re-derived per put_document.
- get_document rehydrates blocks via payload_json ordered by stream
  ordinal.
- list_documents builds dynamic WHERE from DocFilter (lang / trust_min
  / path_glob via GLOB / tags_any via document_tags subquery).
- JobRepo: jobs.kind/status are stored as lowercase enum tags; create
  mints a 32-hex JobId via blake3(kind || payload || nanos).

Tests follow in subsequent commits.
2026-04-30 17:08:36 +00:00
58f7b8573d p1-5: add long-section fixture + Vec<Chunk> snapshot test
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:

  - Heading boundaries hold across H1 → H2 → H1 transitions
  - The code block emits one chunk at 427 tokens > target=200
  - The table stays single-chunk
  - Gamma's paragraph stream splits with one block of overlap seed

A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.

Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:33:29 +00:00
8142449eb7 p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:27:42 +00:00
e0df42984e p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path)
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.

I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.

I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.

Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
  blake3 (unused); keep tracing (now wired) + unicode-normalization
  (now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
  fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
  fixture_blocks_five helper) so the determinism property is exercised
  on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
  Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
  end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
  empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
  == events[2].at to pin the shared-now_utc invariant.

New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)

kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:41:50 +00:00
c0096ce44b p1-4: scaffold kb-normalize crate
Add the workspace member, `Cargo.toml` with the §8-allowed dep set
(kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer,
blake3, unicode-normalization, time, anyhow, tracing) and a stubbed
`build_canonical_document` that pins the public signature plus
`doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so
the integration snapshot test (added later in this series) can drive
a fixture through the real parser without violating the production
boundary — `cargo tree -p kb-normalize --depth 1 --edges normal`
confirms no parser implementation appears in the regular dep tree.

`id_for_doc` and `id_for_block` are re-exported from kb-core (which
holds the canonical recipe per §4.2); kb-normalize is the canonical
*entry point* per design §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:16:53 +00:00
4e7e9cad87 p1-3: add parse_blocks (pulldown-cmark walker) submodule
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.

Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.

15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
2026-04-30 14:14:34 +00:00
a86b463fc4 p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
2026-04-30 12:55:20 +00:00
7c75e10b2c p1-1: scaffold kb-source-fs crate (FsSourceConnector)
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.

Modules:
  - hash:      streaming blake3::Hasher + 64 KiB read buffer (no whole-file
               loads); pinned digests for empty input and "hello world".
  - media:     extension → MediaType (markdown/pdf/image/audio/other).
  - walker:    ignore::OverrideBuilder for filter union; walkdir with
               manual visited-set cycle protection on top of follow_links.
  - connector: public FsSourceConnector::new(&Config) +
               SourceConnector::scan(&SourceScope) impl. Uses
               kb_core::to_posix for WorkspacePath construction (carries
               P0-1 # rejection through unchanged) and kb_core::id_for_asset
               for AssetId derivation. Storage variant signals intent only;
               actual byte copy is P1-6's responsibility.

Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:34 +00:00
d2c8728095 p0-1: address review (drop unused thiserror dep, document kb-core reserve)
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
  (unused — none of those crates' src trees reference thiserror; CoreError
  in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
  CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
  to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
  satisfies `clippy::manual_is_ascii_check` under `-D warnings`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:55:39 +00:00
f86df99fe9 p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:

- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
  IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
  pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
  round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
  (§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
  Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
  plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
  / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).

20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:37 +00:00