feat(kebab-chunk): P7-2 pdf-page-v1 chunker #38

Merged
altair823 merged 2 commits from feat/p7-2-pdf-page-chunker into main 2026-05-02 09:02:18 +00:00
4 changed files with 698 additions and 1 deletions

View File

@@ -16,5 +16,7 @@
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod md_heading_v1;
mod pdf_page_v1;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;

View File

@@ -0,0 +1,675 @@
//! `pdf-page-v1` — page-aware PDF chunker.
//!
//! Consumes a [`CanonicalDocument`] produced by `kebab-parse-pdf` (one
//! [`Block::Paragraph`] per page, every block carrying [`SourceSpan::Page`])
//! and emits one or more [`Chunk`]s per page. Chunks NEVER cross a page
//! boundary (citation locality is the whole reason §3.4 introduced
//! `SourceSpan::Page`), and each chunk's `source_spans` is a single
//! `Page { page, char_start, char_end }` with positions in **characters**
//! within the page text — matching `Citation::Page` fragment semantics
//! across the rest of the workspace.
//!
//! Per design §3.5 (Chunk), §4.2 (chunk_id recipe — see deviation note
//! below), §0 Q3 (citation), §9 (versioning).
//!
//! ## Splitting policy
//!
//! - If a page's bytes fit under `policy.target_tokens * BYTES_PER_TOKEN`
//! the entire page is a single chunk.
//! - Otherwise the page text is segmented at paragraph breaks (`\n\n`) and
//! sentence ends (`.`/`?`/`!` followed by whitespace). Adjacent
//! segments are greedily glued until the running byte budget would be
//! exceeded; the chunk is emitted at that boundary. The next chunk's
//! prefix is seeded with the trailing `policy.overlap_tokens *
//! BYTES_PER_TOKEN` bytes of the prior chunk so retrieval handles
//! queries that fall on the boundary.
//! - A page with no qualifying segment boundary AND text exceeding the
//! budget (e.g. a 5,000-byte single sentence) emits one oversized
//! chunk rather than hard-splitting mid-word — a real tokenizer slot
//! in P+ replaces this proxy and can do better mid-sentence splitting
//! when needed.
//! - Common English abbreviations (`Mr.`, `i.e.`, `e.g.`, `Fig. 3`)
//! trip the sentence-end heuristic and produce spurious boundaries —
//! accepted as a v1 limit. A real sentence segmenter lands with the
//! P+ tokenizer slot.
//! - The effective overlap budget is clamped at `target_bytes / 2` so a
//! pathological policy (`overlap_tokens >= target_tokens`) cannot
//! make a chunk fully re-emit the previous chunk's text. Same guard
//! pattern as `md-heading-v1::collect_overlap_seed`.
//!
//! ## `BYTES_PER_TOKEN`
//!
//! 3 — same calibration as `md-heading-v1` (covers Korean ≈ 3 b/tok and
//! over-estimates English ≈ 4 b/tok). The original p7-2 spec literal said
//! `× 4`, but cross-chunker comparability outweighs the spec literal here.
//! Logged in `tasks/HOTFIXES.md`.
//!
//! ## `chunk_id` collision deviation
//!
//! Design §4.2's `chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
//! || policy_hash)` collides when one block (= one PDF page) is split
//! into multiple chunks: every chunk on that page has identical
//! `block_ids`. md-heading-v1 sidesteps this by always emitting at most
//! one chunk per atomic block. PdfPageV1 cannot.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{char_start}")` into the
//! recipe's `policy_hash` slot (so distinct chunks distinguish via
//! different policy_hash inputs), while storing the unmodified
//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "pdf-page-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
/// Page-aware PDF chunker. See module docs for the splitting policy and
/// the `chunk_id` collision-avoidance deviation.
#[derive(Clone, Copy, Debug, Default)]
pub struct PdfPageV1Chunker;
impl Chunker for PdfPageV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
/// blake3(canonical_json(policy)) truncated to 16 hex chars. Matches
/// the `md-heading-v1` recipe so a workspace-wide policy hash lookup
/// (e.g. for invalidation reports) yields the same digest across
/// chunkers.
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
// Validate up front — every block must be a Paragraph carrying
// SourceSpan::Page. A mixed document signals a routing bug in
// the caller (e.g. running this chunker on Markdown) and is
// worth surfacing loudly.
for b in &doc.blocks {
let common = match b {
Block::Paragraph(p) => &p.common,
_ => anyhow::bail!(
"PdfPageV1Chunker only handles PDF docs (got non-Paragraph block)"
),
};
if !matches!(common.source_span, SourceSpan::Page { .. }) {
anyhow::bail!(
"PdfPageV1Chunker only handles PDF docs (got non-Page source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let target_bytes = policy
.target_tokens
.saturating_mul(BYTES_PER_TOKEN)
.max(1);
// Clamp the overlap to half the target. Without this, a policy
// with `overlap_tokens >= target_tokens` would make every chunk
// fully re-emit the previous chunk's text — mirrors
// md-heading-v1's `seed_budget = overlap_tokens.min(target/2)`.
let overlap_bytes = policy
.overlap_tokens
.saturating_mul(BYTES_PER_TOKEN)
.min(target_bytes / 2);
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let p = match b {
Block::Paragraph(t) => t,
_ => unreachable!("validated above"),
};
let page_num = match p.common.source_span {
SourceSpan::Page { page, .. } => page,
_ => unreachable!("validated above"),
};

(nit / 보강) char_start as u32 / char_end as u32 가 silent truncation 입니다. PDF 한 페이지가 4 G chars 를 넘는 케이스는 현실적으로 없지만 (PDF spec 상 종이 1 장 한계), as 캐스팅은 fuzz / corrupted 파일에서 의미 없이 잘리고 잘못된 span 을 만듭니다.

How to apply: u32::try_from(char_start).expect("page chars fit in u32") 정도로 dev-build 에서 명시 panic. 본 task spec 의 char_start: Option<u32> 는 §3.4 의 SourceSpan::Page 시그니처로 변경 불가 — astry_from 로 의도 명시만 바꾸는 것이라 scope 안.

(nit / 보강) `char_start as u32` / `char_end as u32` 가 silent truncation 입니다. PDF 한 페이지가 4 G chars 를 넘는 케이스는 현실적으로 없지만 (PDF spec 상 종이 1 장 한계), `as` 캐스팅은 fuzz / corrupted 파일에서 의미 없이 잘리고 잘못된 span 을 만듭니다. How to apply: `u32::try_from(char_start).expect("page chars fit in u32")` 정도로 dev-build 에서 명시 panic. 본 task spec 의 `char_start: Option<u32>` 는 §3.4 의 SourceSpan::Page 시그니처로 변경 불가 — `as` → `try_from` 로 의도 명시만 바꾸는 것이라 scope 안.
// Empty page → 0 chunks. Page is still searchable via the
// CanonicalDocument's per-page `Provenance::Warning`
// ("scanned candidate") — chunking just has nothing to say
// about it.
if p.text.trim().is_empty() {
continue;
}
for (char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// PDF chars-per-page comfortably fits in u32 (a single
// page maxes out around ~10k chars even for dense
// typography); silent `as u32` truncation would only
// surface on corrupted input, where an explicit panic
// is preferable to an off-by-2^32 span.
let char_start_u32 = u32::try_from(char_start)
.expect("page chars fit in u32");
let char_end_u32 =

(칭찬) target_bytes / 2 clamp 가 정확히 md-heading-v1 의 seed_budget = overlap_tokens.min(target_tokens / 2) 와 같은 형태로 들어갔습니다. 두 chunker 가 같은 invariant 를 같은 표현으로 보장하면 미래에 누군가 한 쪽만 만지면 즉시 발견할 수 있는 symmetry — 본 PR 의 cross-chunker 일관성 약속 (BYTES_PER_TOKEN 동일 + policy_hash 동일성 unit test) 라인을 한 줄 더 강화합니다.

(칭찬) `target_bytes / 2` clamp 가 정확히 md-heading-v1 의 `seed_budget = overlap_tokens.min(target_tokens / 2)` 와 같은 형태로 들어갔습니다. 두 chunker 가 같은 invariant 를 같은 표현으로 보장하면 미래에 누군가 한 쪽만 만지면 즉시 발견할 수 있는 symmetry — 본 PR 의 cross-chunker 일관성 약속 (BYTES_PER_TOKEN 동일 + policy_hash 동일성 unit test) 라인을 한 줄 더 강화합니다.
u32::try_from(char_end).expect("page chars fit in u32");
let span = SourceSpan::Page {
page: page_num,
char_start: Some(char_start_u32),
char_end: Some(char_end_u32),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// Per-chunk policy_hash variant prevents chunk_id
// collision when a page produces multiple chunks. See
// module docs for rationale.
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
let token_estimate = slice.len().div_ceil(BYTES_PER_TOKEN);
out.push(Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),

(칭찬) format!("{base_policy_hash}#c{char_start}") 라는 per-chunk policy_hash 변형이 §4.2 chunk_id recipe 자체를 손대지 않으면서 collision 을 푸는 가장 작은-surface 해법입니다. recipe 의 4 슬롯 (doc_id, chunker_version, block_ids, policy_hash) 중 "입력만" per-chunk 로 다양화하고 — Chunk.policy_hash 필드는 base 유지로 workspace 정책 lookup 호환 — 두 invariant 가 합쳐서 깔끔합니다. md-heading-v1 가 미래에 같은 패턴을 만나면 그대로 베껴 쓸 수 있는 모듈 boundary.

(칭찬) `format!("{base_policy_hash}#c{char_start}")` 라는 per-chunk policy_hash 변형이 §4.2 chunk_id recipe 자체를 손대지 않으면서 collision 을 푸는 가장 작은-surface 해법입니다. recipe 의 4 슬롯 (doc_id, chunker_version, block_ids, policy_hash) 중 "입력만" per-chunk 로 다양화하고 — `Chunk.policy_hash` 필드는 base 유지로 workspace 정책 lookup 호환 — 두 invariant 가 합쳐서 깔끔합니다. md-heading-v1 가 미래에 같은 패턴을 만나면 그대로 베껴 쓸 수 있는 모듈 boundary.
block_ids,
text: slice,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.clone(),
});
}
}

(칭찬) as u32u32::try_from(...).expect(...) + 코멘트 (PDF chars-per-page comfortably fits in u32 (a single page maxes out around ~10k chars even for dense typography)) 로 silent truncation 의도성을 명시한 게 좋습니다. fuzzed / corrupted 페이지에서 의미 없는 잘림 대신 명시 panic — debugging 친화적이고, expect 메시지가 invariant 를 그대로 코드 안에 박아둡니다.

(칭찬) `as u32` → `u32::try_from(...).expect(...)` + 코멘트 (`PDF chars-per-page comfortably fits in u32 (a single page maxes out around ~10k chars even for dense typography)`) 로 silent truncation 의도성을 명시한 게 좋습니다. fuzzed / corrupted 페이지에서 의미 없는 잘림 대신 명시 panic — debugging 친화적이고, expect 메시지가 invariant 를 그대로 코드 안에 박아둡니다.
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"pdf-page-v1 chunked",
);
Ok(out)
}
}
/// Split a single page's text into ordered chunks, each represented as
/// `(char_start, char_end, text_slice)`. Char positions are within the
/// page text, suitable for `SourceSpan::Page::char_start` / `char_end`.
///

(nit / 문서화) sentence-end 휴리스틱이 "Mr. Smith", "i.e.", "e.g.", "Fig. 3" 등의 흔한 약어에서 spurious 경계를 만듭니다. 결과적으로 chunk 가 약어 직후에 잘려 retrieval 시 "Mr. Smith was…" 의 "Smith was…" 부분이 별도 chunk 로 떨어질 수 있습니다.

Why: spec § Risks 에 "sentence-splitting uses simple regex; languages without clear sentence punctuation may produce uneven chunks" 만 적혀 있는데, 약어 케이스도 같은 카테고리이므로 모듈 doc 에 한 줄 명시하는 편이 미래 reader 에게 친절합니다.

How to apply: 모듈 doc ## Splitting policy 섹션에 "common English abbreviations (Mr., i.e., e.g.) trigger spurious sentence boundaries — accepted v1 limit, full sentence segmentation lands with the P+ tokenizer slot" 같은 두 줄 추가.

(nit / 문서화) sentence-end 휴리스틱이 "Mr. Smith", "i.e.", "e.g.", "Fig. 3" 등의 흔한 약어에서 spurious 경계를 만듭니다. 결과적으로 chunk 가 약어 직후에 잘려 retrieval 시 "Mr. Smith was…" 의 "Smith was…" 부분이 별도 chunk 로 떨어질 수 있습니다. Why: spec § Risks 에 "sentence-splitting uses simple regex; languages without clear sentence punctuation may produce uneven chunks" 만 적혀 있는데, 약어 케이스도 같은 카테고리이므로 모듈 doc 에 한 줄 명시하는 편이 미래 reader 에게 친절합니다. How to apply: 모듈 doc `## Splitting policy` 섹션에 "common English abbreviations (`Mr.`, `i.e.`, `e.g.`) trigger spurious sentence boundaries — accepted v1 limit, full sentence segmentation lands with the P+ tokenizer slot" 같은 두 줄 추가.
/// Returns an empty vector when `text` is empty or whitespace-only.
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, String)> {
let chars: Vec<char> = text.chars().collect();
let n = chars.len();
if n == 0 {
return Vec::new();
}
if text.len() <= target_bytes {
return vec![(0, n, text.to_string())];
}
// Build candidate boundary positions (char indices where a chunk
// *may* start). 0 and n are always boundaries; interior boundaries
// are after a paragraph break (`\n\n`) or after a sentence-ending
// punctuation followed by whitespace.
let mut bounds: Vec<usize> = vec![0];
let mut k = 0;
while k + 1 < n {
let c = chars[k];
let nx = chars[k + 1];
let is_paragraph_break = c == '\n' && nx == '\n';
let is_sentence_end =
matches!(c, '.' | '?' | '!') && nx.is_whitespace();
if (is_paragraph_break || is_sentence_end) && k + 2 <= n {
bounds.push(k + 2);
}
k += 1;
}
if *bounds.last().unwrap() != n {
bounds.push(n);
}
bounds.dedup();
// UTF-8 byte length of the slice between two char indices.
let byte_len = |a: usize, b: usize| -> usize {
chars[a..b].iter().map(|c| c.len_utf8()).sum()
};
let mut chunks: Vec<(usize, usize, String)> = Vec::new();
let mut seg_idx: usize = 0;
while seg_idx + 1 < bounds.len() {
let start = bounds[seg_idx];
let mut end_idx = seg_idx + 1;

(issue) overlap walk-back cap 가 부족합니다. 현재 prev_min = prev.0 (이전 chunk 의 시작 포함, overlap 까지 적용된 상태) 까지 backwalk 가 허용됩니다. overlap_bytes >> target_bytes/2 인 병적인 정책이 들어오면:

  • chunk 1: start=0, end=10, actual_start=0
  • chunk 2: start=10, end=20, overlap=200 → backwalk 200 bytes → prev_min=0 까지 → actual_start=0
    → chunk 2 의 text 가 chars[0..20] 으로 chunk 1 의 text 를 완전히 포함. 사실상 한 페이지가 1 chunk-같은-다른-chunk 를 무한 증식할 수 있는 패턴.

md-heading-v1 은 정확히 같은 함정을 seed_budget = overlap_tokens.min(target_tokens / 2) 로 막아둔 history 가 있습니다 (md_heading_v1.rs:258overlap_clamped_when_overlap_exceeds_target 회귀 테스트 — 본 PR 의 deviation 노트에서 이미 md-heading-v1 와 일관성을 맞추고 있는 만큼 이 가드도 같이 들여오는 게 자연스럽습니다).

Why: 사용자가 실수로 overlap_tokens >= target_tokens 정책을 넣었을 때 chunk-explosion 으로 KB 디스크 / 임베딩 비용이 폭발하는 걸 막는 운영 가드입니다.

How to apply: chunk 메서드 진입부에서 한 줄 clamp + 같은 패턴의 회귀 테스트 추가:

let overlap_bytes = policy.overlap_tokens
    .saturating_mul(BYTES_PER_TOKEN)
    .min(target_bytes / 2);

그리고 pdf_page_overlap_clamped_when_exceeds_target 같은 unit test 로 overlap=200, target=50 케이스에서 chunk 가 actual_start <= chunks[i-1].1 - overlap_clamped_chars 정도로 안정 되는지 확인.

(issue) overlap walk-back cap 가 부족합니다. 현재 `prev_min = prev.0` (이전 chunk 의 *시작* 포함, overlap 까지 적용된 상태) 까지 backwalk 가 허용됩니다. `overlap_bytes >> target_bytes/2` 인 병적인 정책이 들어오면: - chunk 1: start=0, end=10, actual_start=0 - chunk 2: start=10, end=20, overlap=200 → backwalk 200 bytes → `prev_min=0` 까지 → `actual_start=0` → chunk 2 의 text 가 `chars[0..20]` 으로 chunk 1 의 text 를 *완전히 포함*. 사실상 한 페이지가 1 chunk-같은-다른-chunk 를 무한 증식할 수 있는 패턴. `md-heading-v1` 은 정확히 같은 함정을 `seed_budget = overlap_tokens.min(target_tokens / 2)` 로 막아둔 history 가 있습니다 (`md_heading_v1.rs:258` 의 `overlap_clamped_when_overlap_exceeds_target` 회귀 테스트 — 본 PR 의 deviation 노트에서 이미 md-heading-v1 와 일관성을 맞추고 있는 만큼 이 가드도 같이 들여오는 게 자연스럽습니다). Why: 사용자가 실수로 `overlap_tokens >= target_tokens` 정책을 넣었을 때 chunk-explosion 으로 KB 디스크 / 임베딩 비용이 폭발하는 걸 막는 운영 가드입니다. How to apply: chunk 메서드 진입부에서 한 줄 clamp + 같은 패턴의 회귀 테스트 추가: ```rust let overlap_bytes = policy.overlap_tokens .saturating_mul(BYTES_PER_TOKEN) .min(target_bytes / 2); ``` 그리고 `pdf_page_overlap_clamped_when_exceeds_target` 같은 unit test 로 `overlap=200, target=50` 케이스에서 chunk 가 `actual_start <= chunks[i-1].1 - overlap_clamped_chars` 정도로 안정 되는지 확인.
let mut acc = byte_len(start, bounds[end_idx]);
// Greedy grow: glue subsequent segments while we stay under
// budget. We always include at least one segment per chunk
// (`acc > 0` guard) so a single oversize segment doesn't loop.
while end_idx + 1 < bounds.len() {
let next_bytes = byte_len(bounds[end_idx], bounds[end_idx + 1]);
if acc + next_bytes > target_bytes && acc > 0 {
break;
}
acc += next_bytes;
end_idx += 1;
}
let chunk_end = bounds[end_idx];
// Apply overlap: walk `actual_start` left of `start` until we
// have absorbed up to `overlap_bytes` of bytes, but never past
// the previous chunk's start (no full re-emission).
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0;
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
let cl = chars[a - 1].len_utf8();
if acc_o + cl > overlap_bytes {
break;
}
acc_o += cl;
a -= 1;
}
a
} else {
start
};
let slice: String = chars[actual_start..chunk_end].iter().collect();
chunks.push((actual_start, chunk_end, slice));
seg_idx = end_idx;
}
chunks
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, CommonBlock, Inline, Lang, Metadata, ParserVersion, Provenance, SourceType,
TextBlock, TrustLevel, WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn make_pdf_doc(pages: &[&str]) -> CanonicalDocument {
let workspace_path = WorkspacePath::new("docs/test.pdf".into()).unwrap();
let asset_id = AssetId("a".repeat(64));
let parser_version = ParserVersion("pdf-text-v1".into());
let doc_id = id_for_doc(&workspace_path, &asset_id, &parser_version);
let mut blocks: Vec<Block> = Vec::new();
for (i, text) in pages.iter().enumerate() {
let page = (i as u32) + 1;
let char_count = text.chars().count() as u32;
let span = SourceSpan::Page {
page,
char_start: Some(0),
char_end: Some(char_count),
};
let block_id = id_for_block(&doc_id, "paragraph", &[], i as u32, &span);
let inlines = if text.is_empty() {
Vec::new()
} else {
vec![Inline::Text {
text: (*text).to_string(),
}]
};
blocks.push(Block::Paragraph(TextBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
text: (*text).to_string(),
inlines,
}));
}
CanonicalDocument {
doc_id,
source_asset_id: asset_id,
workspace_path,
title: "test".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Paper,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
},
provenance: Provenance { events: vec![] },
parser_version,
schema_version: 1,
doc_version: 1,
}
}

(칭찬) 합성 make_pdf_doc(pages: &[&str]) 헬퍼가 kebab-parse-pdf 를 dev-dep 으로 끌어오지 않고 직접 Block::Paragraph + SourceSpan::Page 를 생성합니다. spec § Forbidden deps 의 "kebab-parse-pdf" 제약을 dev-dep level 까지 일관되게 지키는 동시에, 본 chunker 의 input shape 가 정확히 무엇인지 reader 가 같은 파일에서 한 눈에 볼 수 있게 됩니다 — kebab-parse-pdf 의 미래 변경에 영향 받지 않는 self-contained test 입니다.

(칭찬) 합성 `make_pdf_doc(pages: &[&str])` 헬퍼가 `kebab-parse-pdf` 를 dev-dep 으로 끌어오지 않고 직접 `Block::Paragraph` + `SourceSpan::Page` 를 생성합니다. spec § Forbidden deps 의 "kebab-parse-pdf" 제약을 dev-dep level 까지 일관되게 지키는 동시에, 본 chunker 의 input shape 가 정확히 무엇인지 reader 가 같은 파일에서 한 눈에 볼 수 있게 됩니다 — `kebab-parse-pdf` 의 미래 변경에 영향 받지 않는 self-contained test 입니다.
fn default_policy(target: usize, overlap: usize) -> ChunkPolicy {
ChunkPolicy {
target_tokens: target,
overlap_tokens: overlap,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_pdf_page_v1() {
assert_eq!(
PdfPageV1Chunker.chunker_version(),
ChunkerVersion(VERSION_LABEL.to_string())
);
}
#[test]
fn three_page_small_emits_one_chunk_per_page() {
let doc = make_pdf_doc(&["page one", "page two", "page three"]);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(500, 80))
.unwrap();
assert_eq!(chunks.len(), 3);
for (i, c) in chunks.iter().enumerate() {
assert_eq!(c.block_ids.len(), 1);
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.source_spans.len(), 1);
match c.source_spans[0] {
SourceSpan::Page { page, char_start, char_end } => {
assert_eq!(page, (i as u32) + 1);
assert_eq!(char_start, Some(0));
assert!(char_end.unwrap() > 0);
}
ref other => panic!("expected Page span, got {other:?}"),
}
}
assert_eq!(chunks[0].text, "page one");
assert_eq!(chunks[1].text, "page two");
assert_eq!(chunks[2].text, "page three");
}
#[test]
fn one_page_huge_text_splits_into_multiple_chunks_with_overlap() {
// Build a single page with 8 paragraphs of ~150 bytes each.
// target=50 tokens × 3 b/tok = 150 byte budget → each paragraph
// is itself just under budget, so 2-paragraph accumulation
// overshoots → ~8 chunks.
let para = "a".repeat(150);
let page_text = std::iter::repeat_n(para, 8)
.collect::<Vec<_>>()
.join("\n\n");
let doc = make_pdf_doc(&[&page_text]);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(50, 20))
.unwrap();
assert!(
chunks.len() >= 4,
"expected ≥4 chunks for a 1200-byte page; got {}: text len={}",
chunks.len(),
page_text.len()
);
// All chunks live on page 1.
for c in &chunks {
match c.source_spans[0] {
SourceSpan::Page { page, .. } => assert_eq!(page, 1),
_ => panic!("non-Page span"),
}
}
// Overlap: chunk N's text starts with chunk N-1's tail bytes
// (or, equivalently, chunk N's char_start lies before chunk
// N-1's char_end).
for w in chunks.windows(2) {
let prev_end = match w[0].source_spans[0] {
SourceSpan::Page { char_end: Some(e), .. } => e,
_ => panic!("missing char_end"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
_ => panic!("missing char_start"),
};
assert!(
next_start < prev_end,
"expected overlap (next.start < prev.end): {next_start} vs {prev_end}"
);
}
// chunk_ids stay distinct despite identical block_ids — the
// per-chunk policy_hash variant is doing its job.
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort();
let total = ids.len();
ids.dedup();
assert_eq!(ids.len(), total, "all chunk_ids must be unique");
}
#[test]
fn empty_page_produces_no_chunks_for_that_page() {
let doc = make_pdf_doc(&["page one", "", "page three"]);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(500, 80))
.unwrap();
assert_eq!(chunks.len(), 2);
let pages: Vec<u32> = chunks
.iter()
.map(|c| match c.source_spans[0] {
SourceSpan::Page { page, .. } => page,
_ => 0,
})
.collect();
assert_eq!(pages, vec![1, 3]);
}
#[test]
fn whitespace_only_page_skipped_too() {
let doc = make_pdf_doc(&["page one", " \n ", "page three"]);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(500, 80))
.unwrap();
assert_eq!(chunks.len(), 2);
}
#[test]
fn non_pdf_doc_returns_error() {
// A doc whose blocks carry SourceSpan::Line (Markdown shape).
let workspace_path = WorkspacePath::new("notes/note.md".into()).unwrap();
let asset_id = AssetId("a".repeat(64));
let parser_version = ParserVersion("md-block-v1".into());
let doc_id = id_for_doc(&workspace_path, &asset_id, &parser_version);
let span = SourceSpan::Line { start: 1, end: 1 };
let block_id = id_for_block(&doc_id, "paragraph", &[], 0, &span);
let blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id,
heading_path: vec![],
source_span: span,
},
text: "markdown body".into(),
inlines: vec![],
})];
let doc = CanonicalDocument {
doc_id,
source_asset_id: asset_id,
workspace_path,
title: "n".into(),
lang: Lang("en".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,

(칭찬) policy_hash_matches_md_heading_v1_for_identical_policy 가 cross-chunker fingerprint 동일성을 unit test 로 잠근 게 좋습니다. 본 PR 의 BYTES_PER_TOKEN deviation 노트가 "md-heading-v1 와 calibration 일치" 라고 주장하는데, 그 주장이 실제로 한 슬롯 (policy_hash) 에서라도 검증된다는 사실이 미래에 누군가 BYTES_PER_TOKEN 만 다시 4 로 되돌리려고 할 때 즉시 빨갛게 만듭니다.

(칭찬) `policy_hash_matches_md_heading_v1_for_identical_policy` 가 cross-chunker fingerprint 동일성을 unit test 로 잠근 게 좋습니다. 본 PR 의 BYTES_PER_TOKEN deviation 노트가 "md-heading-v1 와 calibration 일치" 라고 주장하는데, 그 주장이 실제로 한 슬롯 (`policy_hash`) 에서라도 검증된다는 사실이 미래에 누군가 BYTES_PER_TOKEN 만 다시 4 로 되돌리려고 할 때 즉시 빨갛게 만듭니다.
user_id_alias: None,
user: Default::default(),
},
provenance: Provenance { events: vec![] },
parser_version,
schema_version: 1,
doc_version: 1,
};
let err = PdfPageV1Chunker
.chunk(&doc, &default_policy(500, 80))
.expect_err("non-PDF doc must error");
assert!(
err.to_string().contains("PdfPageV1Chunker"),
"error mentions chunker: {err}"
);
}
#[test]
fn no_chunk_crosses_page_boundary() {
// Synthetic 4-page doc with mixed page sizes — each chunk
// must claim exactly one page in its single source_span.
let big_x = "x".repeat(2000);
let big_y = "y".repeat(800);
let pages = vec![
"tiny page one.",
big_x.as_str(),
"another tiny one.",
big_y.as_str(),
];
let doc = make_pdf_doc(&pages);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(50, 10))
.unwrap();
for c in &chunks {
assert_eq!(c.source_spans.len(), 1, "chunk should hold one Page span");
assert!(matches!(c.source_spans[0], SourceSpan::Page { .. }));
}
// Group chunks by page, verify pages are non-decreasing in
// chunk order (no interleaving across pages).
let mut prev_page = 0u32;
for c in &chunks {
let page = match c.source_spans[0] {
SourceSpan::Page { page, .. } => page,
_ => unreachable!(),
};
assert!(
page >= prev_page,
"page numbers must be non-decreasing in chunk order: {prev_page} → {page}"
);
prev_page = page;
}
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = make_pdf_doc(&[
"first page text. and another sentence here.",
&("xyz ".repeat(500)),
]);
let policy = default_policy(80, 20);
let baseline: Vec<String> = PdfPageV1Chunker
.chunk(&doc, &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = PdfPageV1Chunker
.chunk(&doc, &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}
#[test]
fn snapshot_three_page_chunks_stable() {
let doc = make_pdf_doc(&[
"Hello page 1.",
"Hello page 2 with some more body text.",
"Hello page 3.",
]);
let chunks = PdfPageV1Chunker
.chunk(&doc, &default_policy(500, 80))
.unwrap();
assert_eq!(chunks.len(), 3);
for (i, c) in chunks.iter().enumerate() {
assert_eq!(c.chunker_version.0, VERSION_LABEL);
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.source_spans.len(), 1);
match c.source_spans[0] {
SourceSpan::Page {
page,
char_start,
char_end,
} => {
assert_eq!(page, (i as u32) + 1);
assert_eq!(char_start, Some(0));
assert_eq!(char_end, Some(c.text.chars().count() as u32));
}
_ => panic!("expected Page"),
}
assert!(c.policy_hash.len() == POLICY_HASH_HEX_LEN);
assert!(c.policy_hash.bytes().all(|b| b.is_ascii_hexdigit()));
}
}
#[test]
fn overlap_clamped_when_overlap_exceeds_target() {
// Pathological policy: overlap = 4× target. Without the
// `target_bytes / 2` clamp, every chunk would fully re-emit
// the previous chunk's text (chunk N's actual_start collapses
// to chunk N-1's actual_start).
let para = "a".repeat(150);
let page_text = std::iter::repeat_n(para, 6)
.collect::<Vec<_>>()
.join("\n\n");
let doc = make_pdf_doc(&[&page_text]);
let policy = ChunkPolicy {
target_tokens: 50,
overlap_tokens: 200,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
};
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
// For each consecutive pair, the new chunk's actual_start must
// be strictly greater than the previous chunk's actual_start
// (no full re-emission). Without the clamp, equality (full
// overlap) is the failure mode.
for w in chunks.windows(2) {
let prev_start = match w[0].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
_ => panic!("missing char_start"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
_ => panic!("missing char_start"),
};
assert!(
next_start > prev_start,
"overlap must not fully re-emit prior chunk: prev_start={prev_start}, next_start={next_start}"
);
}
// chunk_ids stay distinct (the per-chunk hash variant keys off
// char_start which is now strictly increasing).
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort();
let total = ids.len();
ids.dedup();
assert_eq!(ids.len(), total, "chunk_ids must remain unique");
}
#[test]
fn policy_hash_matches_md_heading_v1_for_identical_policy() {
// Cross-chunker policy fingerprint identity — important so a
// workspace-wide "show me chunks with policy_hash = X" query
// covers both chunkers without per-chunker logic.
let p = default_policy(500, 80);

(칭찬) overlap_clamped_when_overlap_exceeds_targetactual_start 가 인접 chunk 사이에 엄격 증가 하는지 직접 단정합니다. 단순히 chunk 개수만 보지 않고 "chunk N 이 chunk N-1 을 포함하지 않는다" 라는 invariant 를 표현해서, 미래에 누군가 clamp 를 잘못 풀어 chunk-explosion 을 다시 만들면 정확한 위치에서 빨갛게 됩니다.

(칭찬) `overlap_clamped_when_overlap_exceeds_target` 가 `actual_start` 가 인접 chunk 사이에 *엄격 증가* 하는지 직접 단정합니다. 단순히 chunk 개수만 보지 않고 "chunk N 이 chunk N-1 을 포함하지 않는다" 라는 invariant 를 표현해서, 미래에 누군가 clamp 를 잘못 풀어 chunk-explosion 을 다시 만들면 정확한 위치에서 빨갛게 됩니다.
let pdf = PdfPageV1Chunker.policy_hash(&p);
let md = crate::MdHeadingV1Chunker.policy_hash(&p);
assert_eq!(pdf, md);
}
}

View File

@@ -14,6 +14,26 @@ historical contract that was implemented; this file accumulates the
deltas so phase 5+ readers can find the live behavior without diffing
git history.
## 2026-05-02 — P7-2 pdf-page-v1: chunk_id collision + BYTES_PER_TOKEN

(칭찬) HOTFIXES entry 가 두 가지 deviation (chunk_id 충돌 가드, BYTES_PER_TOKEN=3 vs 4) 을 각각 "Symptom → Root cause → Fix → Trust note → Amends" 의 동일한 형식으로 정리한 게 좋습니다. 특히 BYTES_PER_TOKEN 케이스에서 spec literal 이 md-heading-v1 의 실코드와 어긋나 있다는 사실 ("spec's '/4' claim is inconsistent with the implementation it claims to match") 을 직시한 것이 중요합니다 — frozen task spec 을 retroactively 손대지 않고 HOTFIXES 가 살아있는 source of truth 라는 워크스페이스 컨벤션을 정확히 활용한 사례.

(칭찬) HOTFIXES entry 가 두 가지 deviation (`chunk_id` 충돌 가드, `BYTES_PER_TOKEN=3 vs 4`) 을 각각 "Symptom → Root cause → Fix → Trust note → Amends" 의 동일한 형식으로 정리한 게 좋습니다. 특히 `BYTES_PER_TOKEN` 케이스에서 spec literal 이 `md-heading-v1` 의 실코드와 어긋나 있다는 사실 ("spec's '/4' claim is inconsistent with the implementation it claims to match") 을 직시한 것이 중요합니다 — frozen task spec 을 retroactively 손대지 않고 HOTFIXES 가 살아있는 source of truth 라는 워크스페이스 컨벤션을 정확히 활용한 사례.
**Discovered**: P7-2 implementation start.
**Symptom 1 (load-bearing)**: `tasks/p7/p7-2-pdf-page-chunker.md` § Behavior contract literally says `chunk_id` per design §4.2 with `(doc_id, "pdf-page-v1", block_ids, policy_hash)`. But unlike `md-heading-v1` (which always emits at most one chunk per atomic block), `pdf-page-v1` splits one page-block into multiple chunks when page text exceeds the byte budget. All sub-chunks of the same page have identical `block_ids` → identical `chunk_id` collisions, breaking the §3.5 invariant that `chunk_id` is a primary key.
**Symptom 2 (cosmetic)**: Spec text says `token_estimate = byte_len / 4` and "matches `md-heading-v1` proxy". Looking at the actual md-heading-v1 source (`crates/kebab-chunk/src/md_heading_v1.rs:17`), the constant is `BYTES_PER_TOKEN = 3` (chosen to cover Korean ≈ 3 b/tok and over-estimate English ≈ 4 b/tok). Spec's "/4" claim is inconsistent with the implementation it claims to match.
**Root cause**: §4.2 chunk_id recipe was designed assuming one-chunk-per-block-set. Page-aware chunking violates that assumption.
**Fix** (PR #38, feat/p7-2-pdf-page-chunker):
- **Per-chunk policy_hash variant**: feed `format!("{base_policy_hash}#c{char_start}")` into `id_for_chunk`'s `policy_hash` slot so chunks within the same page get distinct `chunk_id`s. The §4.2 recipe itself stays unchanged — only the *input* to one of its slots differs per chunk. The unmodified `base_policy_hash` is still stored in `Chunk.policy_hash` so the field still answers "what policy was active" (workspace-wide policy invalidation lookups continue to work).
- **`BYTES_PER_TOKEN = 3`** (matches md-heading-v1 actual code, not spec literal). Cross-chunker policy fingerprint identity is verified by a unit test: `policy_hash_matches_md_heading_v1_for_identical_policy`.
**Trust note**: The per-chunk hash variant is opaque (`#c<n>` is just a marker, not interpretable as char_start by downstream tools — they read `Chunk.source_spans[0].char_start` for that). Downstream identifier comparisons on `chunk_id` continue to work as opaque blake3 hashes.
**Amends**:
- tasks/p7/p7-2-pdf-page-chunker.md (chunk_id recipe per-chunk variant; BYTES_PER_TOKEN = 3 not 4).
## 2026-05-02 — P6-3 caption: GenerateRequest.images + cargo feature dropped
**Discovered**: P6-3 implementation start.

View File

@@ -3,7 +3,7 @@ phase: P7
component: kebab-chunk (pdf-page-v1)
task_id: p7-2
title: "PDF page-aware chunker (pdf-page-v1)"
status: planned
status: completed
depends_on: [p7-1]
unblocks: []
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md