Merge pull request 'feat(kebab-chunk): P7-2 pdf-page-v1 chunker' (#38) from feat/p7-2-pdf-page-chunker into main

Reviewed-on: #38
2026-05-02 09:02:16 +00:00
parent d64320b508 8181fd91e7
commit b711152035
4 changed files with 698 additions and 1 deletions
--- a/crates/kebab-chunk/src/lib.rs
+++ b/crates/kebab-chunk/src/lib.rs
@@ -16,5 +16,7 @@
 //! It consumes `CanonicalDocument` purely through `kb-core` types.

 mod md_heading_v1;
+mod pdf_page_v1;

 pub use md_heading_v1::MdHeadingV1Chunker;
+pub use pdf_page_v1::PdfPageV1Chunker;
--- a/crates/kebab-chunk/src/pdf_page_v1.rs
+++ b/crates/kebab-chunk/src/pdf_page_v1.rs
@@ -0,0 +1,675 @@
+//! `pdf-page-v1` — page-aware PDF chunker.
+//!
+//! Consumes a [`CanonicalDocument`] produced by `kebab-parse-pdf` (one
+//! [`Block::Paragraph`] per page, every block carrying [`SourceSpan::Page`])
+//! and emits one or more [`Chunk`]s per page. Chunks NEVER cross a page
+//! boundary (citation locality is the whole reason §3.4 introduced
+//! `SourceSpan::Page`), and each chunk's `source_spans` is a single
+//! `Page { page, char_start, char_end }` with positions in **characters**
+//! within the page text — matching `Citation::Page` fragment semantics
+//! across the rest of the workspace.
+//!
+//! Per design §3.5 (Chunk), §4.2 (chunk_id recipe — see deviation note
+//! below), §0 Q3 (citation), §9 (versioning).
+//!
+//! ## Splitting policy
+//!
+//! - If a page's bytes fit under `policy.target_tokens * BYTES_PER_TOKEN`
+//!   the entire page is a single chunk.
+//! - Otherwise the page text is segmented at paragraph breaks (`\n\n`) and
+//!   sentence ends (`.`/`?`/`!` followed by whitespace). Adjacent
+//!   segments are greedily glued until the running byte budget would be
+//!   exceeded; the chunk is emitted at that boundary. The next chunk's
+//!   prefix is seeded with the trailing `policy.overlap_tokens *
+//!   BYTES_PER_TOKEN` bytes of the prior chunk so retrieval handles
+//!   queries that fall on the boundary.
+//! - A page with no qualifying segment boundary AND text exceeding the
+//!   budget (e.g. a 5,000-byte single sentence) emits one oversized
+//!   chunk rather than hard-splitting mid-word — a real tokenizer slot
+//!   in P+ replaces this proxy and can do better mid-sentence splitting
+//!   when needed.
+//! - Common English abbreviations (`Mr.`, `i.e.`, `e.g.`, `Fig. 3`)
+//!   trip the sentence-end heuristic and produce spurious boundaries —
+//!   accepted as a v1 limit. A real sentence segmenter lands with the
+//!   P+ tokenizer slot.
+//! - The effective overlap budget is clamped at `target_bytes / 2` so a
+//!   pathological policy (`overlap_tokens >= target_tokens`) cannot
+//!   make a chunk fully re-emit the previous chunk's text. Same guard
+//!   pattern as `md-heading-v1::collect_overlap_seed`.
+//!
+//! ## `BYTES_PER_TOKEN`
+//!
+//! 3 — same calibration as `md-heading-v1` (covers Korean ≈ 3 b/tok and
+//! over-estimates English ≈ 4 b/tok). The original p7-2 spec literal said
+//! `× 4`, but cross-chunker comparability outweighs the spec literal here.
+//! Logged in `tasks/HOTFIXES.md`.
+//!
+//! ## `chunk_id` collision deviation
+//!
+//! Design §4.2's `chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
+//! || policy_hash)` collides when one block (= one PDF page) is split
+//! into multiple chunks: every chunk on that page has identical
+//! `block_ids`. md-heading-v1 sidesteps this by always emitting at most
+//! one chunk per atomic block. PdfPageV1 cannot.
+//!
+//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
+//! variant `format!("{base_policy_hash}#c{char_start}")` into the
+//! recipe's `policy_hash` slot (so distinct chunks distinguish via
+//! different policy_hash inputs), while storing the unmodified
+//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
+//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
+
+use kebab_core::{
+    Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
+    SourceSpan, id_for_chunk,
+};
+
+const VERSION_LABEL: &str = "pdf-page-v1";
+const BYTES_PER_TOKEN: usize = 3;
+const POLICY_HASH_HEX_LEN: usize = 16;
+
+/// Page-aware PDF chunker. See module docs for the splitting policy and
+/// the `chunk_id` collision-avoidance deviation.
+#[derive(Clone, Copy, Debug, Default)]
+pub struct PdfPageV1Chunker;
+
+impl Chunker for PdfPageV1Chunker {
+    fn chunker_version(&self) -> ChunkerVersion {
+        ChunkerVersion(VERSION_LABEL.to_string())
+    }
+
+    /// blake3(canonical_json(policy)) truncated to 16 hex chars. Matches
+    /// the `md-heading-v1` recipe so a workspace-wide policy hash lookup
+    /// (e.g. for invalidation reports) yields the same digest across
+    /// chunkers.
+    fn policy_hash(&self, policy: &ChunkPolicy) -> String {
+        let bytes = serde_json_canonicalizer::to_vec(policy)
+            .expect("canonical JSON serialization of ChunkPolicy must not fail");
+        let hex = blake3::hash(&bytes).to_hex().to_string();
+        hex[..POLICY_HASH_HEX_LEN].to_string()
+    }
+
+    fn chunk(
+        &self,
+        doc: &CanonicalDocument,
+        policy: &ChunkPolicy,
+    ) -> anyhow::Result<Vec<Chunk>> {
+        // Validate up front — every block must be a Paragraph carrying
+        // SourceSpan::Page. A mixed document signals a routing bug in
+        // the caller (e.g. running this chunker on Markdown) and is
+        // worth surfacing loudly.
+        for b in &doc.blocks {
+            let common = match b {
+                Block::Paragraph(p) => &p.common,
+                _ => anyhow::bail!(
+                    "PdfPageV1Chunker only handles PDF docs (got non-Paragraph block)"
+                ),
+            };
+            if !matches!(common.source_span, SourceSpan::Page { .. }) {
+                anyhow::bail!(
+                    "PdfPageV1Chunker only handles PDF docs (got non-Page source_span)"
+                );
+            }
+        }
+
+        let base_policy_hash = self.policy_hash(policy);
+        let chunker_version = self.chunker_version();
+        let target_bytes = policy
+            .target_tokens
+            .saturating_mul(BYTES_PER_TOKEN)
+            .max(1);
+        // Clamp the overlap to half the target. Without this, a policy
+        // with `overlap_tokens >= target_tokens` would make every chunk
+        // fully re-emit the previous chunk's text — mirrors
+        // md-heading-v1's `seed_budget = overlap_tokens.min(target/2)`.
+        let overlap_bytes = policy
+            .overlap_tokens
+            .saturating_mul(BYTES_PER_TOKEN)
+            .min(target_bytes / 2);
+
+        let mut out: Vec<Chunk> = Vec::new();
+        for b in &doc.blocks {
+            let p = match b {
+                Block::Paragraph(t) => t,
+                _ => unreachable!("validated above"),
+            };
+            let page_num = match p.common.source_span {
+                SourceSpan::Page { page, .. } => page,
+                _ => unreachable!("validated above"),
+            };
+
+            // Empty page → 0 chunks. Page is still searchable via the
+            // CanonicalDocument's per-page `Provenance::Warning`
+            // ("scanned candidate") — chunking just has nothing to say
+            // about it.
+            if p.text.trim().is_empty() {
+                continue;
+            }
+
+            for (char_start, char_end, slice) in
+                chunk_page(&p.text, target_bytes, overlap_bytes)
+            {
+                // PDF chars-per-page comfortably fits in u32 (a single
+                // page maxes out around ~10k chars even for dense
+                // typography); silent `as u32` truncation would only
+                // surface on corrupted input, where an explicit panic
+                // is preferable to an off-by-2^32 span.
+                let char_start_u32 = u32::try_from(char_start)
+                    .expect("page chars fit in u32");
+                let char_end_u32 =
+                    u32::try_from(char_end).expect("page chars fit in u32");
+                let span = SourceSpan::Page {
+                    page: page_num,
+                    char_start: Some(char_start_u32),
+                    char_end: Some(char_end_u32),
+                };
+                let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
+                // Per-chunk policy_hash variant prevents chunk_id
+                // collision when a page produces multiple chunks. See
+                // module docs for rationale.
+                let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
+                let chunk_id =
+                    id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
+                let token_estimate = slice.len().div_ceil(BYTES_PER_TOKEN);
+
+                out.push(Chunk {
+                    chunk_id,
+                    doc_id: DocumentId(doc.doc_id.0.clone()),
+                    block_ids,
+                    text: slice,
+                    heading_path: Vec::new(),
+                    source_spans: vec![span],
+                    token_estimate,
+                    chunker_version: chunker_version.clone(),
+                    policy_hash: base_policy_hash.clone(),
+                });
+            }
+        }
+
+        tracing::debug!(
+            target: "kebab-chunk",
+            doc_id = %doc.doc_id,
+            chunks = out.len(),
+            "pdf-page-v1 chunked",
+        );
+
+        Ok(out)
+    }
+}
+
+/// Split a single page's text into ordered chunks, each represented as
+/// `(char_start, char_end, text_slice)`. Char positions are within the
+/// page text, suitable for `SourceSpan::Page::char_start` / `char_end`.
+///
+/// Returns an empty vector when `text` is empty or whitespace-only.
+fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, String)> {
+    let chars: Vec<char> = text.chars().collect();
+    let n = chars.len();
+    if n == 0 {
+        return Vec::new();
+    }
+    if text.len() <= target_bytes {
+        return vec![(0, n, text.to_string())];
+    }
+
+    // Build candidate boundary positions (char indices where a chunk
+    // *may* start). 0 and n are always boundaries; interior boundaries
+    // are after a paragraph break (`\n\n`) or after a sentence-ending
+    // punctuation followed by whitespace.
+    let mut bounds: Vec<usize> = vec![0];
+    let mut k = 0;
+    while k + 1 < n {
+        let c = chars[k];
+        let nx = chars[k + 1];
+        let is_paragraph_break = c == '\n' && nx == '\n';
+        let is_sentence_end =
+            matches!(c, '.' | '?' | '!') && nx.is_whitespace();
+        if (is_paragraph_break || is_sentence_end) && k + 2 <= n {
+            bounds.push(k + 2);
+        }
+        k += 1;
+    }
+    if *bounds.last().unwrap() != n {
+        bounds.push(n);
+    }
+    bounds.dedup();
+
+    // UTF-8 byte length of the slice between two char indices.
+    let byte_len = |a: usize, b: usize| -> usize {
+        chars[a..b].iter().map(|c| c.len_utf8()).sum()
+    };
+
+    let mut chunks: Vec<(usize, usize, String)> = Vec::new();
+    let mut seg_idx: usize = 0;
+    while seg_idx + 1 < bounds.len() {
+        let start = bounds[seg_idx];
+        let mut end_idx = seg_idx + 1;
+        let mut acc = byte_len(start, bounds[end_idx]);
+
+        // Greedy grow: glue subsequent segments while we stay under
+        // budget. We always include at least one segment per chunk
+        // (`acc > 0` guard) so a single oversize segment doesn't loop.
+        while end_idx + 1 < bounds.len() {
+            let next_bytes = byte_len(bounds[end_idx], bounds[end_idx + 1]);
+            if acc + next_bytes > target_bytes && acc > 0 {
+                break;
+            }
+            acc += next_bytes;
+            end_idx += 1;
+        }
+
+        let chunk_end = bounds[end_idx];
+
+        // Apply overlap: walk `actual_start` left of `start` until we
+        // have absorbed up to `overlap_bytes` of bytes, but never past
+        // the previous chunk's start (no full re-emission).
+        let actual_start = if let Some(prev) = chunks.last() {
+            let prev_min = prev.0;
+            let mut a = start;
+            let mut acc_o: usize = 0;
+            while a > prev_min {
+                let cl = chars[a - 1].len_utf8();
+                if acc_o + cl > overlap_bytes {
+                    break;
+                }
+                acc_o += cl;
+                a -= 1;
+            }
+            a
+        } else {
+            start
+        };
+
+        let slice: String = chars[actual_start..chunk_end].iter().collect();
+        chunks.push((actual_start, chunk_end, slice));
+        seg_idx = end_idx;
+    }
+
+    chunks
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use kebab_core::{
+        AssetId, CommonBlock, Inline, Lang, Metadata, ParserVersion, Provenance, SourceType,
+        TextBlock, TrustLevel, WorkspacePath, id_for_block, id_for_doc,
+    };
+    use time::OffsetDateTime;
+
+    fn make_pdf_doc(pages: &[&str]) -> CanonicalDocument {
+        let workspace_path = WorkspacePath::new("docs/test.pdf".into()).unwrap();
+        let asset_id = AssetId("a".repeat(64));
+        let parser_version = ParserVersion("pdf-text-v1".into());
+        let doc_id = id_for_doc(&workspace_path, &asset_id, &parser_version);
+
+        let mut blocks: Vec<Block> = Vec::new();
+        for (i, text) in pages.iter().enumerate() {
+            let page = (i as u32) + 1;
+            let char_count = text.chars().count() as u32;
+            let span = SourceSpan::Page {
+                page,
+                char_start: Some(0),
+                char_end: Some(char_count),
+            };
+            let block_id = id_for_block(&doc_id, "paragraph", &[], i as u32, &span);
+            let inlines = if text.is_empty() {
+                Vec::new()
+            } else {
+                vec![Inline::Text {
+                    text: (*text).to_string(),
+                }]
+            };
+            blocks.push(Block::Paragraph(TextBlock {
+                common: CommonBlock {
+                    block_id,
+                    heading_path: Vec::new(),
+                    source_span: span,
+                },
+                text: (*text).to_string(),
+                inlines,
+            }));
+        }
+
+        CanonicalDocument {
+            doc_id,
+            source_asset_id: asset_id,
+            workspace_path,
+            title: "test".into(),
+            lang: Lang("und".into()),
+            blocks,
+            metadata: Metadata {
+                aliases: vec![],
+                tags: vec![],
+                created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
+                updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
+                source_type: SourceType::Paper,
+                trust_level: TrustLevel::Primary,
+                user_id_alias: None,
+                user: Default::default(),
+            },
+            provenance: Provenance { events: vec![] },
+            parser_version,
+            schema_version: 1,
+            doc_version: 1,
+        }
+    }
+
+    fn default_policy(target: usize, overlap: usize) -> ChunkPolicy {
+        ChunkPolicy {
+            target_tokens: target,
+            overlap_tokens: overlap,
+            respect_markdown_headings: false,
+            chunker_version: ChunkerVersion(VERSION_LABEL.into()),
+        }
+    }
+
+    #[test]
+    fn chunker_version_is_pdf_page_v1() {
+        assert_eq!(
+            PdfPageV1Chunker.chunker_version(),
+            ChunkerVersion(VERSION_LABEL.to_string())
+        );
+    }
+
+    #[test]
+    fn three_page_small_emits_one_chunk_per_page() {
+        let doc = make_pdf_doc(&["page one", "page two", "page three"]);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(500, 80))
+            .unwrap();
+        assert_eq!(chunks.len(), 3);
+        for (i, c) in chunks.iter().enumerate() {
+            assert_eq!(c.block_ids.len(), 1);
+            assert_eq!(c.heading_path, Vec::<String>::new());
+            assert_eq!(c.source_spans.len(), 1);
+            match c.source_spans[0] {
+                SourceSpan::Page { page, char_start, char_end } => {
+                    assert_eq!(page, (i as u32) + 1);
+                    assert_eq!(char_start, Some(0));
+                    assert!(char_end.unwrap() > 0);
+                }
+                ref other => panic!("expected Page span, got {other:?}"),
+            }
+        }
+        assert_eq!(chunks[0].text, "page one");
+        assert_eq!(chunks[1].text, "page two");
+        assert_eq!(chunks[2].text, "page three");
+    }
+
+    #[test]
+    fn one_page_huge_text_splits_into_multiple_chunks_with_overlap() {
+        // Build a single page with 8 paragraphs of ~150 bytes each.
+        // target=50 tokens × 3 b/tok = 150 byte budget → each paragraph
+        // is itself just under budget, so 2-paragraph accumulation
+        // overshoots → ~8 chunks.
+        let para = "a".repeat(150);
+        let page_text = std::iter::repeat_n(para, 8)
+            .collect::<Vec<_>>()
+            .join("\n\n");
+        let doc = make_pdf_doc(&[&page_text]);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(50, 20))
+            .unwrap();
+        assert!(
+            chunks.len() >= 4,
+            "expected ≥4 chunks for a 1200-byte page; got {}: text len={}",
+            chunks.len(),
+            page_text.len()
+        );
+        // All chunks live on page 1.
+        for c in &chunks {
+            match c.source_spans[0] {
+                SourceSpan::Page { page, .. } => assert_eq!(page, 1),
+                _ => panic!("non-Page span"),
+            }
+        }
+        // Overlap: chunk N's text starts with chunk N-1's tail bytes
+        // (or, equivalently, chunk N's char_start lies before chunk
+        // N-1's char_end).
+        for w in chunks.windows(2) {
+            let prev_end = match w[0].source_spans[0] {
+                SourceSpan::Page { char_end: Some(e), .. } => e,
+                _ => panic!("missing char_end"),
+            };
+            let next_start = match w[1].source_spans[0] {
+                SourceSpan::Page { char_start: Some(s), .. } => s,
+                _ => panic!("missing char_start"),
+            };
+            assert!(
+                next_start < prev_end,
+                "expected overlap (next.start < prev.end): {next_start} vs {prev_end}"
+            );
+        }
+        // chunk_ids stay distinct despite identical block_ids — the
+        // per-chunk policy_hash variant is doing its job.
+        let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
+        ids.sort();
+        let total = ids.len();
+        ids.dedup();
+        assert_eq!(ids.len(), total, "all chunk_ids must be unique");
+    }
+
+    #[test]
+    fn empty_page_produces_no_chunks_for_that_page() {
+        let doc = make_pdf_doc(&["page one", "", "page three"]);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(500, 80))
+            .unwrap();
+        assert_eq!(chunks.len(), 2);
+        let pages: Vec<u32> = chunks
+            .iter()
+            .map(|c| match c.source_spans[0] {
+                SourceSpan::Page { page, .. } => page,
+                _ => 0,
+            })
+            .collect();
+        assert_eq!(pages, vec![1, 3]);
+    }
+
+    #[test]
+    fn whitespace_only_page_skipped_too() {
+        let doc = make_pdf_doc(&["page one", "   \n  ", "page three"]);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(500, 80))
+            .unwrap();
+        assert_eq!(chunks.len(), 2);
+    }
+
+    #[test]
+    fn non_pdf_doc_returns_error() {
+        // A doc whose blocks carry SourceSpan::Line (Markdown shape).
+        let workspace_path = WorkspacePath::new("notes/note.md".into()).unwrap();
+        let asset_id = AssetId("a".repeat(64));
+        let parser_version = ParserVersion("md-block-v1".into());
+        let doc_id = id_for_doc(&workspace_path, &asset_id, &parser_version);
+        let span = SourceSpan::Line { start: 1, end: 1 };
+        let block_id = id_for_block(&doc_id, "paragraph", &[], 0, &span);
+        let blocks = vec![Block::Paragraph(TextBlock {
+            common: CommonBlock {
+                block_id,
+                heading_path: vec![],
+                source_span: span,
+            },
+            text: "markdown body".into(),
+            inlines: vec![],
+        })];
+        let doc = CanonicalDocument {
+            doc_id,
+            source_asset_id: asset_id,
+            workspace_path,
+            title: "n".into(),
+            lang: Lang("en".into()),
+            blocks,
+            metadata: Metadata {
+                aliases: vec![],
+                tags: vec![],
+                created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
+                updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
+                source_type: SourceType::Note,
+                trust_level: TrustLevel::Primary,
+                user_id_alias: None,
+                user: Default::default(),
+            },
+            provenance: Provenance { events: vec![] },
+            parser_version,
+            schema_version: 1,
+            doc_version: 1,
+        };
+        let err = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(500, 80))
+            .expect_err("non-PDF doc must error");
+        assert!(
+            err.to_string().contains("PdfPageV1Chunker"),
+            "error mentions chunker: {err}"
+        );
+    }
+
+    #[test]
+    fn no_chunk_crosses_page_boundary() {
+        // Synthetic 4-page doc with mixed page sizes — each chunk
+        // must claim exactly one page in its single source_span.
+        let big_x = "x".repeat(2000);
+        let big_y = "y".repeat(800);
+        let pages = vec![
+            "tiny page one.",
+            big_x.as_str(),
+            "another tiny one.",
+            big_y.as_str(),
+        ];
+        let doc = make_pdf_doc(&pages);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(50, 10))
+            .unwrap();
+        for c in &chunks {
+            assert_eq!(c.source_spans.len(), 1, "chunk should hold one Page span");
+            assert!(matches!(c.source_spans[0], SourceSpan::Page { .. }));
+        }
+        // Group chunks by page, verify pages are non-decreasing in
+        // chunk order (no interleaving across pages).
+        let mut prev_page = 0u32;
+        for c in &chunks {
+            let page = match c.source_spans[0] {
+                SourceSpan::Page { page, .. } => page,
+                _ => unreachable!(),
+            };
+            assert!(
+                page >= prev_page,
+                "page numbers must be non-decreasing in chunk order: {prev_page} → {page}"
+            );
+            prev_page = page;
+        }
+    }
+
+    #[test]
+    fn deterministic_chunk_ids_1000() {
+        let doc = make_pdf_doc(&[
+            "first page text. and another sentence here.",
+            &("xyz ".repeat(500)),
+        ]);
+        let policy = default_policy(80, 20);
+        let baseline: Vec<String> = PdfPageV1Chunker
+            .chunk(&doc, &policy)
+            .unwrap()
+            .into_iter()
+            .map(|c| c.chunk_id.0)
+            .collect();
+        for _ in 0..1000 {
+            let again: Vec<String> = PdfPageV1Chunker
+                .chunk(&doc, &policy)
+                .unwrap()
+                .into_iter()
+                .map(|c| c.chunk_id.0)
+                .collect();
+            assert_eq!(again, baseline);
+        }
+    }
+
+    #[test]
+    fn snapshot_three_page_chunks_stable() {
+        let doc = make_pdf_doc(&[
+            "Hello page 1.",
+            "Hello page 2 with some more body text.",
+            "Hello page 3.",
+        ]);
+        let chunks = PdfPageV1Chunker
+            .chunk(&doc, &default_policy(500, 80))
+            .unwrap();
+        assert_eq!(chunks.len(), 3);
+        for (i, c) in chunks.iter().enumerate() {
+            assert_eq!(c.chunker_version.0, VERSION_LABEL);
+            assert_eq!(c.heading_path, Vec::<String>::new());
+            assert_eq!(c.source_spans.len(), 1);
+            match c.source_spans[0] {
+                SourceSpan::Page {
+                    page,
+                    char_start,
+                    char_end,
+                } => {
+                    assert_eq!(page, (i as u32) + 1);
+                    assert_eq!(char_start, Some(0));
+                    assert_eq!(char_end, Some(c.text.chars().count() as u32));
+                }
+                _ => panic!("expected Page"),
+            }
+            assert!(c.policy_hash.len() == POLICY_HASH_HEX_LEN);
+            assert!(c.policy_hash.bytes().all(|b| b.is_ascii_hexdigit()));
+        }
+    }
+
+    #[test]
+    fn overlap_clamped_when_overlap_exceeds_target() {
+        // Pathological policy: overlap = 4× target. Without the
+        // `target_bytes / 2` clamp, every chunk would fully re-emit
+        // the previous chunk's text (chunk N's actual_start collapses
+        // to chunk N-1's actual_start).
+        let para = "a".repeat(150);
+        let page_text = std::iter::repeat_n(para, 6)
+            .collect::<Vec<_>>()
+            .join("\n\n");
+        let doc = make_pdf_doc(&[&page_text]);
+        let policy = ChunkPolicy {
+            target_tokens: 50,
+            overlap_tokens: 200,
+            respect_markdown_headings: false,
+            chunker_version: ChunkerVersion(VERSION_LABEL.into()),
+        };
+        let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
+        // For each consecutive pair, the new chunk's actual_start must
+        // be strictly greater than the previous chunk's actual_start
+        // (no full re-emission). Without the clamp, equality (full
+        // overlap) is the failure mode.
+        for w in chunks.windows(2) {
+            let prev_start = match w[0].source_spans[0] {
+                SourceSpan::Page { char_start: Some(s), .. } => s,
+                _ => panic!("missing char_start"),
+            };
+            let next_start = match w[1].source_spans[0] {
+                SourceSpan::Page { char_start: Some(s), .. } => s,
+                _ => panic!("missing char_start"),
+            };
+            assert!(
+                next_start > prev_start,
+                "overlap must not fully re-emit prior chunk: prev_start={prev_start}, next_start={next_start}"
+            );
+        }
+        // chunk_ids stay distinct (the per-chunk hash variant keys off
+        // char_start which is now strictly increasing).
+        let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
+        ids.sort();
+        let total = ids.len();
+        ids.dedup();
+        assert_eq!(ids.len(), total, "chunk_ids must remain unique");
+    }
+
+    #[test]
+    fn policy_hash_matches_md_heading_v1_for_identical_policy() {
+        // Cross-chunker policy fingerprint identity — important so a
+        // workspace-wide "show me chunks with policy_hash = X" query
+        // covers both chunkers without per-chunker logic.
+        let p = default_policy(500, 80);
+        let pdf = PdfPageV1Chunker.policy_hash(&p);
+        let md = crate::MdHeadingV1Chunker.policy_hash(&p);
+        assert_eq!(pdf, md);
+    }
+}
--- a/tasks/HOTFIXES.md
+++ b/tasks/HOTFIXES.md
@@ -14,6 +14,26 @@ historical contract that was implemented; this file accumulates the
 deltas so phase 5+ readers can find the live behavior without diffing
 git history.

+## 2026-05-02 — P7-2 pdf-page-v1: chunk_id collision + BYTES_PER_TOKEN
+
+**Discovered**: P7-2 implementation start.
+
+**Symptom 1 (load-bearing)**: `tasks/p7/p7-2-pdf-page-chunker.md` § Behavior contract literally says `chunk_id` per design §4.2 with `(doc_id, "pdf-page-v1", block_ids, policy_hash)`. But unlike `md-heading-v1` (which always emits at most one chunk per atomic block), `pdf-page-v1` splits one page-block into multiple chunks when page text exceeds the byte budget. All sub-chunks of the same page have identical `block_ids` → identical `chunk_id` collisions, breaking the §3.5 invariant that `chunk_id` is a primary key.
+
+**Symptom 2 (cosmetic)**: Spec text says `token_estimate = byte_len / 4` and "matches `md-heading-v1` proxy". Looking at the actual md-heading-v1 source (`crates/kebab-chunk/src/md_heading_v1.rs:17`), the constant is `BYTES_PER_TOKEN = 3` (chosen to cover Korean ≈ 3 b/tok and over-estimate English ≈ 4 b/tok). Spec's "/4" claim is inconsistent with the implementation it claims to match.
+
+**Root cause**: §4.2 chunk_id recipe was designed assuming one-chunk-per-block-set. Page-aware chunking violates that assumption.
+
+**Fix** (PR #38, feat/p7-2-pdf-page-chunker):
+
+- **Per-chunk policy_hash variant**: feed `format!("{base_policy_hash}#c{char_start}")` into `id_for_chunk`'s `policy_hash` slot so chunks within the same page get distinct `chunk_id`s. The §4.2 recipe itself stays unchanged — only the *input* to one of its slots differs per chunk. The unmodified `base_policy_hash` is still stored in `Chunk.policy_hash` so the field still answers "what policy was active" (workspace-wide policy invalidation lookups continue to work).
+- **`BYTES_PER_TOKEN = 3`** (matches md-heading-v1 actual code, not spec literal). Cross-chunker policy fingerprint identity is verified by a unit test: `policy_hash_matches_md_heading_v1_for_identical_policy`.
+
+**Trust note**: The per-chunk hash variant is opaque (`#c<n>` is just a marker, not interpretable as char_start by downstream tools — they read `Chunk.source_spans[0].char_start` for that). Downstream identifier comparisons on `chunk_id` continue to work as opaque blake3 hashes.
+
+**Amends**:
+- tasks/p7/p7-2-pdf-page-chunker.md (chunk_id recipe per-chunk variant; BYTES_PER_TOKEN = 3 not 4).
+
 ## 2026-05-02 — P6-3 caption: GenerateRequest.images + cargo feature dropped

 **Discovered**: P6-3 implementation start.
--- a/tasks/p7/p7-2-pdf-page-chunker.md
+++ b/tasks/p7/p7-2-pdf-page-chunker.md
@@ -3,7 +3,7 @@ phase: P7
 component: kebab-chunk (pdf-page-v1)
 task_id: p7-2
 title: "PDF page-aware chunker (pdf-page-v1)"
-status: planned
+status: completed
 depends_on: [p7-1]
 unblocks: []
 contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md