p1-5: doc rationale for respect_markdown_headings, policy_hash panic, overlap accounting

Doc-only follow-ups for review minors I1, M3, M4. No behavior change.

* I1: rustdoc on `MdHeadingV1Chunker` now records that
  `policy.respect_markdown_headings` flows into `policy_hash` only;
  the `md-heading-v1` variant unconditionally treats headings as
  boundaries by name. To disable heading awareness, ship a different
  `chunker_version` (none in P1-5).

* M3: `# Panics` rustdoc on `policy_hash` documents the
  unreachable-in-practice failure mode of
  `serde_json_canonicalizer::to_vec` and explains why the `expect` is
  retained as future-proofing.

* M4: Inline comment at the `would_exceed` decision noting that
  `acc.text_tokens` already includes the prior chunk's overlap seed,
  and that the I3 clamp guarantees a flush here never produces a
  chunk smaller than the seed budget.

* Heading-path bullet in the behavior contract updated to reflect the
  I2 fix wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-30 16:52:18 +00:00
parent a81460f9d0
commit ceeac9f974

View File

@@ -43,8 +43,10 @@ const POLICY_HASH_HEX_LEN: usize = 16;
/// with the previous chunk's tail blocks contributing roughly
/// `overlap_tokens` of content (paragraph-level overlap).
/// 5. **`heading_path` propagates.** Each chunk's `heading_path` is the
/// `heading_path` of its first contributing non-Heading block, or the
/// Heading block itself if that is the first.
/// `heading_path` of its first contributing non-Heading block, or
/// when the chunk leads with (or contains only) a Heading — the
/// parent path **plus the heading's own text** so heading-only or
/// heading-led chunks never lose their citation context.
/// 6. **`source_spans` merge.** A chunk lists every contributing block's
/// `source_span` in document order.
/// 7. **Version + policy hash recorded.** Each chunk records
@@ -62,6 +64,13 @@ const POLICY_HASH_HEX_LEN: usize = 16;
/// count — chunks sized against this proxy are guaranteed to fit in any
/// real BPE tokenizer's budget for English (~4 bytes/token) or Korean
/// (~3 bytes/token under E5/M-BERT). See [`BYTES_PER_TOKEN`] for rationale.
///
/// **`policy.respect_markdown_headings`.** This field flows into
/// `policy_hash` (so flipping it yields fresh chunk IDs), but the
/// chunker variant `md-heading-v1` unconditionally treats headings as
/// boundaries by design — the `md-heading-v1` name is the contract. To
/// disable heading awareness, ship a different `chunker_version`; none
/// is shipped in P1-5.
#[derive(Clone, Copy, Debug, Default)]
pub struct MdHeadingV1Chunker;
@@ -70,6 +79,19 @@ impl Chunker for MdHeadingV1Chunker {
ChunkerVersion(VERSION_LABEL.to_string())
}
/// Compute the policy hash folded into `chunk_id` per design §4.2.
///
/// # Panics
///
/// Panics if canonical JSON serialization of `ChunkPolicy` fails.
/// This is unreachable in practice — `ChunkPolicy` is composed of
/// owned primitives (`usize`, `bool`, owned `String`) and
/// `serde_json_canonicalizer::to_vec` only fails on
/// non-serializable values such as non-finite floats or maps with
/// non-string keys, neither of which can be constructed via
/// `ChunkPolicy`'s public surface. The `expect` is preserved as a
/// future-proofing guard against drift if `ChunkPolicy` ever gains
/// a field with such a property.
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
@@ -125,6 +147,11 @@ impl Chunker for MdHeadingV1Chunker {
// the current chunk and seed the next one with
// overlap from the prior tail.
let next_tokens = estimate_block_tokens(block);
// Note: `acc.text_tokens` already includes the prior
// chunk's overlap seed. The clamp in
// `collect_overlap_seed` keeps seed ≤ target/2, so
// a flush here never produces a chunk smaller than
// the seed budget.
let would_exceed = acc.text_tokens + next_tokens
> policy.target_tokens
&& acc.has_non_heading_content();