Doc-only follow-ups for review minors I1, M3, M4. No behavior change.
* I1: rustdoc on `MdHeadingV1Chunker` now records that
`policy.respect_markdown_headings` flows into `policy_hash` only;
the `md-heading-v1` variant unconditionally treats headings as
boundaries by name. To disable heading awareness, ship a different
`chunker_version` (none in P1-5).
* M3: `# Panics` rustdoc on `policy_hash` documents the
unreachable-in-practice failure mode of
`serde_json_canonicalizer::to_vec` and explains why the `expect` is
retained as future-proofing.
* M4: Inline comment at the `would_exceed` decision noting that
`acc.text_tokens` already includes the prior chunk's overlap seed,
and that the I3 clamp guarantees a flush here never produces a
chunk smaller than the seed budget.
* Heading-path bullet in the behavior contract updated to reflect the
I2 fix wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A pathological `ChunkPolicy { overlap_tokens >= target_tokens }`
caused `md-heading-v1` to degenerate into 1-block-per-chunk: the
seeded `acc.text_tokens` already exceeded `target_tokens` before any
fresh content landed, so the next paragraph immediately tripped the
`would_exceed` flush.
Clamp the seed budget in `collect_overlap_seed` to
`min(overlap_tokens, target_tokens / 2)`. This guarantees that after
seeding, the chunk has at least `target/2` worth of room for new
content before flushing, restoring the intended paragraph-overlap
behavior on every reasonable and unreasonable policy.
Adds a regression test pinning a 50/200 (overlap = 4× target) policy
and asserting every emitted chunk holds ≥2 blocks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a Heading block opens a chunk and is followed by another Heading
or an atomic block (Code, Table, ImageRef, AudioRef) with no
intervening prose, the prior fallback used `common.heading_path` from
the heading itself — which per kb-normalize convention does NOT
include the heading's own text. Result: heading-only and heading-led
chunks for `# Alpha\n## Beta\n...` patterns landed with
`heading_path = []`, losing citation context.
Synthesize the leading heading into the chunk's heading_path when
blocks[0] is a Heading: parent path + heading.text. The first
non-Heading branch (existing logic for normal mid-section chunks) is
unchanged.
`chunk_id` recipe is `(doc_id, chunker_version, block_ids,
policy_hash)` — `heading_path` is not in the recipe, so this fix does
NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json`
also unchanged because every heading in that fixture is followed by a
paragraph (the bug only triggers on direct heading→heading or
heading→atomic adjacency).
Adds `heading_with_parents` test helper and a regression test pinning
the `# Alpha\n## Beta\n[code]` pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:
- Heading boundaries hold across H1 → H2 → H1 transitions
- The code block emits one chunk at 427 tokens > target=200
- The table stays single-chunk
- Gamma's paragraph stream splits with one block of overlap seed
A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.
Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset
from the design (§0 / §14):
1. Heading is a hard boundary; the heading itself starts and is
included in its chunk so heading text is retrievable.
2. Code blocks never split, regardless of size.
3. Tables stay single-chunk (row-split deferred per task spec).
4. Long sections split at target_tokens with paragraph-level
overlap_tokens worth of seeded tail blocks.
5. ImageRef / AudioRef each become their own chunk with
token_estimate = 0.
6. heading_path lifts from the first contributing non-Heading
block; source_spans concatenate in document order.
7. chunk_id derives from id_for_chunk(doc_id, chunker_version,
block_ids, policy_hash) per §4.2.
Covers the unit + determinism rows of the P1-5 test plan: heading
boundary respected, 800-token code block stays single, small table
stays single, long paragraph chain splits with overlap, ImageRef
chunk has token_estimate=0, and 1000-iter chunk_id determinism.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>