feat(p1-5): kb-chunk md-heading-v1 chunker #10

Merged
altair823 merged 6 commits from feat/p1-5-chunk into main 2026-04-30 16:58:35 +00:00

6 Commits

Author SHA1 Message Date
ceeac9f974 p1-5: doc rationale for respect_markdown_headings, policy_hash panic, overlap accounting
Doc-only follow-ups for review minors I1, M3, M4. No behavior change.

* I1: rustdoc on `MdHeadingV1Chunker` now records that
  `policy.respect_markdown_headings` flows into `policy_hash` only;
  the `md-heading-v1` variant unconditionally treats headings as
  boundaries by name. To disable heading awareness, ship a different
  `chunker_version` (none in P1-5).

* M3: `# Panics` rustdoc on `policy_hash` documents the
  unreachable-in-practice failure mode of
  `serde_json_canonicalizer::to_vec` and explains why the `expect` is
  retained as future-proofing.

* M4: Inline comment at the `would_exceed` decision noting that
  `acc.text_tokens` already includes the prior chunk's overlap seed,
  and that the I3 clamp guarantees a flush here never produces a
  chunk smaller than the seed budget.

* Heading-path bullet in the behavior contract updated to reflect the
  I2 fix wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:52:18 +00:00
a81460f9d0 p1-5: clamp overlap seed budget to target_tokens / 2
A pathological `ChunkPolicy { overlap_tokens >= target_tokens }`
caused `md-heading-v1` to degenerate into 1-block-per-chunk: the
seeded `acc.text_tokens` already exceeded `target_tokens` before any
fresh content landed, so the next paragraph immediately tripped the
`would_exceed` flush.

Clamp the seed budget in `collect_overlap_seed` to
`min(overlap_tokens, target_tokens / 2)`. This guarantees that after
seeding, the chunk has at least `target/2` worth of room for new
content before flushing, restoring the intended paragraph-overlap
behavior on every reasonable and unreasonable policy.

Adds a regression test pinning a 50/200 (overlap = 4× target) policy
and asserting every emitted chunk holds ≥2 blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:51:00 +00:00
f780c71ce0 p1-5: heading-only chunk carries self in heading_path
When a Heading block opens a chunk and is followed by another Heading
or an atomic block (Code, Table, ImageRef, AudioRef) with no
intervening prose, the prior fallback used `common.heading_path` from
the heading itself — which per kb-normalize convention does NOT
include the heading's own text. Result: heading-only and heading-led
chunks for `# Alpha\n## Beta\n...` patterns landed with
`heading_path = []`, losing citation context.

Synthesize the leading heading into the chunk's heading_path when
blocks[0] is a Heading: parent path + heading.text. The first
non-Heading branch (existing logic for normal mid-section chunks) is
unchanged.

`chunk_id` recipe is `(doc_id, chunker_version, block_ids,
policy_hash)` — `heading_path` is not in the recipe, so this fix does
NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json`
also unchanged because every heading in that fixture is followed by a
paragraph (the bug only triggers on direct heading→heading or
heading→atomic adjacency).

Adds `heading_with_parents` test helper and a regression test pinning
the `# Alpha\n## Beta\n[code]` pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:50:12 +00:00
58f7b8573d p1-5: add long-section fixture + Vec<Chunk> snapshot test
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:

  - Heading boundaries hold across H1 → H2 → H1 transitions
  - The code block emits one chunk at 427 tokens > target=200
  - The table stays single-chunk
  - Gamma's paragraph stream splits with one block of overlap seed

A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.

Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:33:29 +00:00
0237022d0e p1-5: implement md-heading-v1 chunking rules
Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset
from the design (§0 / §14):

  1. Heading is a hard boundary; the heading itself starts and is
     included in its chunk so heading text is retrievable.
  2. Code blocks never split, regardless of size.
  3. Tables stay single-chunk (row-split deferred per task spec).
  4. Long sections split at target_tokens with paragraph-level
     overlap_tokens worth of seeded tail blocks.
  5. ImageRef / AudioRef each become their own chunk with
     token_estimate = 0.
  6. heading_path lifts from the first contributing non-Heading
     block; source_spans concatenate in document order.
  7. chunk_id derives from id_for_chunk(doc_id, chunker_version,
     block_ids, policy_hash) per §4.2.

Covers the unit + determinism rows of the P1-5 test plan: heading
boundary respected, 800-token code block stays single, small table
stays single, long paragraph chain splits with overlap, ImageRef
chunk has token_estimate=0, and 1000-iter chunk_id determinism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:30:26 +00:00
8142449eb7 p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:27:42 +00:00