First real Embedder implementation. Wraps fastembed-rs (ONNX runtime)
with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME}
template expansion so model files land under config.storage.model_dir/
fastembed/ without polluting kb-config's public API.
Public surface:
- pub struct FastembedEmbedder
- pub fn new(config: &Config) -> Result<Self>
- impl kb_core::Embedder (via kb-embed re-export)
Behavior:
- Default model multilingual-e5-small (384 dims). model_id and
model_version come from config.models.embedding.{model,version}.
- Pre-load dim check via TextEmbedding::get_model_info: dim mismatch
bails before paying the ~470MB ONNX init cost.
- e5 prefix applied BEFORE tokenization: "passage: " for
EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned
by prefix_input unit tests.
- Batches inputs into chunks of config.models.embedding.batch_size,
concatenates results in input order.
- L2 normalization is performed by fastembed 4.9's default transformer
pipeline (verified at fastembed/src/text_embedding/output.rs:43);
we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so
a future fastembed bump that drops this invariant fails loudly.
- Synchronous (no async runtime). Mutex serializes calls into the
underlying ONNX session — conservative; ORT Session is Send+Sync but
callers (kb-app indexer) batch sequentially anyway. Revisit if
profiling shows contention.
- First-run model download surfaces via tracing::info before/after
TextEmbedding::try_new — users no longer stare at a silent 30-60s
pause during the 470MB pull.
Tests:
- 11 default-lane tests covering: check_dim match/mismatch (no model
load), prefix_input Document/Query/empty, resolve_model
known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set
+ XDG_DATA_HOME unset (falls back to ~/.local/share with recursive
~ expansion). XDG tests serialize on a Mutex + RAII guard since
edition 2024 makes set_var/remove_var unsafe.
- 7 #[ignore] integration tests covering: full construction with
default config, dim-mismatch belt-and-braces, Document vs Query
cosine differential, L2 unit norm, byte-equal determinism, batch-64
performance under 5s, snapshot-hash stability over a 5-sentence
multilingual fixture.
- Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints
the captured hash and panics with paste-back instructions, so first
--ignored run forces the maintainer to pin the baseline rather than
silently passing.
- Workspace: 222 tests pass (default lane); clippy clean.
Allowed deps respected: kb-config, kb-embed (re-exports kb-core
trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and
ort enter transitively through fastembed; reqwest/hyper/hf-hub also
transitive (model download is fastembed's responsibility per spec
carve-out). No direct kb-core dep needed — re-exports cover it.
Pinned to fastembed 4.x rather than the recent 5.x to limit blast
radius; consider bump when p3-3 (lancedb-store) consumes the embedder
output shape.
Out of scope: reranker (P+), Ollama embedding endpoint, candle
adapter, image embeddings (P6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establishes the kb-embed trait crate so concrete embedding adapters
(p3-2 fastembed, future ollama-embed/candle) target a stable surface.
Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic
mock for downstream tests.
MockEmbedder (cfg(feature = "mock"), default OFF):
- Per-component hash recipe: blake3(seed_le8 || kind_byte ||
text_len_le8 || text || i_le8). Length-prefixed text avoids the
domain-separation ambiguity where two (text, i) pairs could shift
bytes between text tail and the i field.
- Document = 0u8, Query = 1u8 — same text different kind yields
different vectors (mirrors e5 prefix behaviour).
- Per component: blake3 first 8 bytes → u64 → reinterpret as i64 →
f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32
rounds to -1.0; range [-1, 1] holds.
- L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision
loss) before f32 cast. Zero-norm guard skips the divide.
- with_seed(...) constructor lets two embedders share identity but
produce different vectors — useful for downstream parametric tests.
Helpers:
- assert_vector_shape(vecs, dims) — len + finite check.
- assert_unit_norm(vecs, tolerance) — caller-supplied tolerance;
5e-4 documented as safe for dims=384 under f32 epsilon × √dims.
Tests:
- cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests.
- cargo test -p kb-embed --features mock: 7 tests including 100-case
proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within
tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text),
Doc(text1) ≠ Doc(text2).
- All 220 workspace tests pass; clippy clean for both default and
mock-on feature configurations.
Symbol gating: nm on the release rlib confirms zero MockEmbedder
symbols under default features; three trait impl symbols under
--features mock. Spec invariant "release builds MUST NOT include
MockEmbedder" verified at the symbol level.
Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing,
plus anyhow (forced by trait return type) and blake3 (justified by
the determinism contract; already in workspace lockfile via kb-core).
No fastembed/ort/tokenizers anywhere.
Out of scope: real adapter (p3-2), reranker traits (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.
New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
token into a quoted literal (inner " doubled); single-quote-wrapped
text passes through verbatim as raw FTS5 syntax. Empty query
short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
enforces the char cap on the way out (chars per design §6.4 — accepts
the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
Region / Time forwarded; Byte / empty array fall back to Line{1,1}
with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
uses globset with literal_separator(true) — guarantees '*' does not
cross '/' per spec Risks/notes — applied as Rust post-filter with
+128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
tiebreaker on identical bm25). Tested explicitly with two chunks of
identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
set, vector_* None.
kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
Read-only contract is doc-only — kb-search needs MutexGuard for
prepare_cached + iter, which a closure-scoped wrapper would awkwardly
constrain. Closure variant left as a P3 follow-up.
Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.
Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).
Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual
table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED
chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror
every chunks mutation into chunks_fts inside the host transaction.
V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT
so existing P1 databases gain searchability without re-ingest. Refinery
bookkeeping makes double-apply naturally idempotent.
Adds rebuild_chunks_fts(&Connection) escape hatch for kb index
--rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT
instead of Transaction so callers can invoke from inside an outer
transaction; WAL serializes writers so no DELETE/INSERT race vs.
concurrent chunks mutators is possible.
Tests (10): V001-only → V002 cold-upgrade backfill (literal path),
chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild
idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5
verbatim diff guard (anchored extraction from both files), WAL/SHM
release on store drop. All 185 workspace tests pass.
Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no
new deps). FTS query implementation deferred to p2-2 (lexical-retriever).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M9: add a `TODO(P2/P3)` comment near the NULL persistence at
documents.rs (put_chunks). The `section_label` column exists in the §5.5
DDL but neither the in-memory Chunk struct nor the §2.6 wire schema
carries the field, so NULL is the correct canonical value today —
flag the future-bump intent in-line rather than leaving it implicit.
M10: add a one-line invariant comment near the i64 -> u32 narrowing for
`doc_version` in `get_document`. The invariant is documented at the
write site (UPSERT bumps by 1 per re-ingest) — restate it at the read
site so the cast is not silently load-bearing.
No behaviour change. No tests touched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not
sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No
external pattern matches; the variant is reached only via `#[from]`.
M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a
test that never landed. Delete it so the dead-code allow goes with it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Important review fixes on the kb-store-sqlite write path:
I1. Atomic asset write. put_asset_with_bytes now stages bytes to
`<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into
place (atomic on POSIX same-fs). On any failure between staging and
rename we best-effort `remove_file` the temp so the previous orphan
risk on UPSERT failure is gone. Reference mode is unchanged (no I/O,
no orphan risk).
I2. Poison-tolerant mutex. New `lock_conn` helper does
`.lock().unwrap_or_else(|p| p.into_inner())`, so a single panic mid-
transaction no longer poisons every subsequent store call. The
rusqlite Transaction Drop already rolls back on panic, leaving the
Connection state safe to reuse. All 13 prior `.expect("sqlite mutex
poisoned")` sites in store.rs / documents.rs / jobs.rs now route
through `lock_conn`.
I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a
hand-construction bypass the `FromStr` 32-hex invariant. Added
`validate_asset_id` (32 ASCII hex chars) at every store entry that
turns an AssetId into a path: `put_asset_with_bytes` and
`DocumentStore::put_asset`. This shuts a potential path-traversal via
`assets_path_for`'s `&id[..2]` shard slice.
Tests:
- `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a
row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips
a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final
file and no leaked `*.tmp.*`.
- `put_asset_with_bytes_rejects_invalid_asset_id` — passes
`AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`).
Asserts error and zero filesystem artifacts under `data_dir/assets/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 8 test categories from the task plan, plus a JobRepo subset:
migration — tests/migration.rs: fresh DB after run_migrations
exposes every required §5 table + index.
unit (copy) — tests/asset_writer.rs: copy mode writes file with
mode 0o644 + correct bytes.
unit (ref) — tests/asset_writer.rs: reference mode does not write
file; row records source path.
unit (cs) — tests/asset_writer.rs: tampered checksum returns a
Conflict-flavoured anyhow error.
unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
doc_version 1→2; tags re-derived.
unit (rb) — tests/idempotency.rs: put_blocks with FK violation
rolls back; pre-existing rows unchanged.
contract — tests/contract_roundtrip.rs: drives kb-parse-md +
kb-normalize + kb-chunk on
fixtures/markdown/code-and-table.md, persists, then
reloads via DocumentStore::get_document /
get_chunk and asserts byte-equal round-trip.
snapshot — tests/ingest_report_snapshot.rs +
snapshots/ingest_report.snapshot.json: pin the wire
JSON form of kb_core::IngestReport for an inline
fixture run.
jobs — tests/jobs.rs: create → progress → finish flow;
error message round-trip; list filters on status/kind.
Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
The snapshot now includes the policy_hash field on every Chunk per the
preceding kb-core schema change. chunk_ids are unchanged because the
chunk_id recipe (§4.2) already incorporated policy_hash via the chunker —
the field is simply now visible in the wire form.
Regenerated via:
UPDATE_SNAPSHOTS=1 cargo test -p kb-chunk long_section_chunks_snapshot
Set the new policy_hash field on every emitted Chunk to the same hex
already computed for the chunk_id recipe (§4.2). No recipe / chunk_id
change — only the field on the struct is now populated.
Pairs with the kb-core hotfix (preceding commit) and unblocks P1-6's
DocumentStore::put_chunks to read chunk.policy_hash directly per §5.5.
Add policy_hash: String to kb_core::Chunk to align with the §5.5 SQLite
schema (chunks.policy_hash NOT NULL), so kb-store-sqlite persistence is a
straight field copy rather than a recompute.
This is a §9 schema migration:
- §5.5 (the persistence schema) is authoritative.
- §3.5 (the domain model) must accommodate.
The chunker already computed policy_hash for the chunk_id recipe (§4.2);
P1-5 stored it implicitly. We now hold it explicitly on the Chunk so any
DocumentStore::put_chunks impl can read it directly.
Follow-up commits update kb-chunk to populate the field and refresh the
P1-5 snapshot baseline accordingly.
Doc-only follow-ups for review minors I1, M3, M4. No behavior change.
* I1: rustdoc on `MdHeadingV1Chunker` now records that
`policy.respect_markdown_headings` flows into `policy_hash` only;
the `md-heading-v1` variant unconditionally treats headings as
boundaries by name. To disable heading awareness, ship a different
`chunker_version` (none in P1-5).
* M3: `# Panics` rustdoc on `policy_hash` documents the
unreachable-in-practice failure mode of
`serde_json_canonicalizer::to_vec` and explains why the `expect` is
retained as future-proofing.
* M4: Inline comment at the `would_exceed` decision noting that
`acc.text_tokens` already includes the prior chunk's overlap seed,
and that the I3 clamp guarantees a flush here never produces a
chunk smaller than the seed budget.
* Heading-path bullet in the behavior contract updated to reflect the
I2 fix wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A pathological `ChunkPolicy { overlap_tokens >= target_tokens }`
caused `md-heading-v1` to degenerate into 1-block-per-chunk: the
seeded `acc.text_tokens` already exceeded `target_tokens` before any
fresh content landed, so the next paragraph immediately tripped the
`would_exceed` flush.
Clamp the seed budget in `collect_overlap_seed` to
`min(overlap_tokens, target_tokens / 2)`. This guarantees that after
seeding, the chunk has at least `target/2` worth of room for new
content before flushing, restoring the intended paragraph-overlap
behavior on every reasonable and unreasonable policy.
Adds a regression test pinning a 50/200 (overlap = 4× target) policy
and asserting every emitted chunk holds ≥2 blocks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a Heading block opens a chunk and is followed by another Heading
or an atomic block (Code, Table, ImageRef, AudioRef) with no
intervening prose, the prior fallback used `common.heading_path` from
the heading itself — which per kb-normalize convention does NOT
include the heading's own text. Result: heading-only and heading-led
chunks for `# Alpha\n## Beta\n...` patterns landed with
`heading_path = []`, losing citation context.
Synthesize the leading heading into the chunk's heading_path when
blocks[0] is a Heading: parent path + heading.text. The first
non-Heading branch (existing logic for normal mid-section chunks) is
unchanged.
`chunk_id` recipe is `(doc_id, chunker_version, block_ids,
policy_hash)` — `heading_path` is not in the recipe, so this fix does
NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json`
also unchanged because every heading in that fixture is followed by a
paragraph (the bug only triggers on direct heading→heading or
heading→atomic adjacency).
Adds `heading_with_parents` test helper and a regression test pinning
the `# Alpha\n## Beta\n[code]` pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:
- Heading boundaries hold across H1 → H2 → H1 transitions
- The code block emits one chunk at 427 tokens > target=200
- The table stays single-chunk
- Gamma's paragraph stream splits with one block of overlap seed
A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.
Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset
from the design (§0 / §14):
1. Heading is a hard boundary; the heading itself starts and is
included in its chunk so heading text is retrievable.
2. Code blocks never split, regardless of size.
3. Tables stay single-chunk (row-split deferred per task spec).
4. Long sections split at target_tokens with paragraph-level
overlap_tokens worth of seeded tail blocks.
5. ImageRef / AudioRef each become their own chunk with
token_estimate = 0.
6. heading_path lifts from the first contributing non-Heading
block; source_spans concatenate in document order.
7. chunk_id derives from id_for_chunk(doc_id, chunker_version,
block_ids, policy_hash) per §4.2.
Covers the unit + determinism rows of the P1-5 test plan: heading
boundary respected, 800-token code block stays single, small table
stays single, long paragraph chain splits with overlap, ImageRef
chunk has token_estimate=0, and 1000-iter chunk_id determinism.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M8: kb-parse-md frontmatter doc-comment claimed filename fallback was
P1-4's job; P1-4 spec did not include it. Reconcile: defer to a later
phase (P1-7 / kb-app integration) where the workspace_path filename is
known to the caller. Updated comment in build_metadata().
M9: kb-parse-md tests use the #[ignore] regenerator pattern, while
kb-normalize's integration test uses an UPDATE_SNAPSHOTS=1 env-var.
Migrating kb-parse-md is out of scope; one-line note added to
blocks_snapshots.rs mod doc-comment to flag the intentional split.
M11, M12: doc-only comments in lift_block (already added in the
previous commit) — list-item shared block_id rationale and the
intentional camel-case Debug-format for WarningKind in Provenance
notes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.
I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.
I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.
Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
blake3 (unused); keep tracing (now wired) + unicode-normalization
(now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
fixture_blocks_five helper) so the determinism property is exercised
on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
== events[2].at to pin the shared-now_utc invariant.
New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)
kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the integration snapshot test pinning the full `CanonicalDocument`
JSON for `fixtures/markdown/code-and-table.md` (run through the real
`kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only).
Non-deterministic `provenance.events[*].at` for the Parsed and
Normalized events is stripped before comparison; the Discovered
event's `at` is pinned by constructing the test `RawAsset` with a
fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate.
Add the 1000-iteration determinism property: same inputs ⇒ byte-
identical JSON (modulo the same stripped timestamps), in under one
second of wall-clock time. A regression in canonical JSON, BLAKE3
hashing, ordinal counting, or any other deterministic field would
surface here immediately.
The integration test depends on `kb-parse-md` only as a dev-dep, so
`cargo tree -p kb-normalize --depth 1 --edges normal` confirms no
parser implementation appears in the production dep tree per design
§8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build a `Provenance` with one event per pipeline stage (`Discovered`
sourced from `RawAsset.discovered_at`, then `Parsed` and `Normalized`
stamped with one shared `now_utc()` reading), plus one `Warning` event
per upstream warning. Sharing `now` between Parsed and Normalized
bounds intra-call timestamp jitter — event ordering is preserved by
`Vec` position regardless. Warning agents are routed back to the
upstream component (`kb-parse-md` for parse warnings,
`kb-normalize` for `ExtractFailed`).
Lift `metadata.user["title"]` and `metadata.user["lang"]` (where P1-2
stashes them since the `Metadata` struct itself does not carry those
fields) into `CanonicalDocument.title` / `CanonicalDocument.lang`.
Both keys are removed from the user map after lifting so the wire
form does not duplicate the data; missing keys default to empty
string / empty `Lang`. Other user-map keys survive.
Tests pin the event ordering, the warning routing, and the lift
behavior (including non-duplication in `metadata.user`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the §4.3 ordinal rule and §3.4 block lift. Each `ParsedBlock`
maps to a `kb_core::Block` variant carrying a `CommonBlock` whose
`block_id = id_for_block(doc_id, payload_kind, heading_path, ordinal,
source_span)`. Ordinals are scoped to `(heading_path, payload_kind)`,
0-based, in document order — three paragraphs under one H1 get 0/1/2,
a code block under the same H1 starts fresh at 0, a paragraph under a
different H1 also starts at 0.
`payload_kind` is the lowercase-no-spaces convention from §4.2:
"heading", "paragraph", "list", "code", "table", "quote", "imageref",
"audioref".
`ListBlock.items` re-uses the parent list's `CommonBlock` per §3.4 (no
per-item BlockId is allocated). `AudioRefBlock` placeholder fields
(`asset_id`, `duration_ms`) are filled in by P8 — for now we synthesize
the minimal record so the document is well-typed.
Tests pin the four §4.4 ID properties (1000-iteration determinism, NFC
≡ NFD Korean path, `./a/b.md` ≡ `a/b.md`, ordinal grouping). Provenance
and title/lang lift land in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the workspace member, `Cargo.toml` with the §8-allowed dep set
(kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer,
blake3, unicode-normalization, time, anyhow, tracing) and a stubbed
`build_canonical_document` that pins the public signature plus
`doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so
the integration snapshot test (added later in this series) can drive
a fixture through the real parser without violating the production
boundary — `cargo tree -p kb-normalize --depth 1 --edges normal`
confirms no parser implementation appears in the regular dep tree.
`id_for_doc` and `id_for_block` are re-exported from kb-core (which
holds the canonical recipe per §4.2); kb-normalize is the canonical
*entry point* per design §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)`
construction and match sites under the new struct-variant shape introduced
in the previous commit. `Inline::Link { text, href }` is unchanged.
The snapshot test in `tests/blocks_snapshots.rs` previously projected
`ParsedBlock` into a `BlockView`/`PayloadView` shim because the old
`Inline` could not serialize. With the schema fix in place we now
serialize `ParsedBlock` directly through serde — the shim and its
`flatten_inline` helper are removed. Inlines surface as structured
objects (`{"kind":"text","text":"…"}` etc.). Regenerated
`nested-headings.blocks.snapshot.json` to reflect the new shape via
the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json`
has no inlines and is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`#[serde(tag = "kind")]` rejects newtype variants whose payload is not a
struct, so 4 of 5 `Inline` variants (`Text(String)`, `Code(String)`,
`Strong(Vec<…>)`, `Emph(Vec<…>)`) failed to serialize at runtime — only
`Link { text, href }` worked. Convert every variant to struct form so the
internally-tagged shape is well-formed and round-trips through JSON.
Add `inline_serde_round_trip` covering all five variants. Per design §9,
this is a wire-schema migration; no `docs/wire-schema/v1/*.json` change
required since `Inline` is not directly referenced there. Callers in
kb-parse-md follow in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:
- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
`["Real"]`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.
Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.
Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.
New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous block-level image detector scanned paragraph source bytes
for the literal `` shape. That was fragile in three ways:
- `` leaked the title into `src` (`src "title"`)
- `` kept the angle brackets verbatim
- `![]()` had undefined behavior
Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:
- `image_count` increments on each `Start(Tag::Image)` and `image_src`
captures `dest_url` (which already strips angle brackets and excludes
the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
flag `non_image_text_seen`.
At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.
New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)
Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.
Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.
New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.
Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):
- Code → fenced text approximation, preserving lang hint + body
- Image → `` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation
Document order is preserved because flattening happens inside the
item's frame, before the item closes.
New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).
Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.
Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.
New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.
Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.
Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.
15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.
M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.
M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.
I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.
I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.
M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.
12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:
- frontmatter-only.md exercises the YAML happy path with most fields,
unknown keys, an `id:` field, and a non-UTC created_at (so the
baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
lingua autodetect result for our enabled language set.
A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
Implement the frontmatter submodule:
- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
catches unknown keys into a serde_json::Map for verbatim preservation
in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
title → frontmatter | BodyHints.first_h1 | (filename: caller)
aliases/tags→ frontmatter | []
lang → frontmatter | lingua autodetect on first 4 KB | hints
| "und"
created_at → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
updated_at → frontmatter | fs_mtime
source_type → frontmatter | "markdown"
trust_level → frontmatter | "primary"
id → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
cases + an explicit pin that `id:` does not feed id_for_doc.