fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)

v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 13:32:09 +00:00
parent d9acda517a
commit 436fd015a2
3 changed files with 109 additions and 20 deletions

View File

@@ -165,7 +165,7 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() {
);
assert_eq!(
pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("pdf-page-v1")
Some("pdf-page-v1.1")
);
// Inspect the stored doc to confirm SourceSpan::Page round-trip.
@@ -365,7 +365,7 @@ fn mixed_page_pdf_stores_asset_with_scanned_candidate_warning() {
assert_eq!(
pdf_item.chunk_count,
Some(2),
"pdf-page-v1 emits 0 chunks for the empty page; total = 2"
"pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
);
let doc = kebab_app::inspect_doc_with_config(

View File

@@ -53,18 +53,21 @@
//! one chunk per atomic block. PdfPageV1 cannot.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{char_start}")` into the
//! recipe's `policy_hash` slot (so distinct chunks distinguish via
//! different policy_hash inputs), while storing the unmodified
//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
//! segment boundary, strictly increasing across the returned chunks
//! even when the overlap walk collapses `actual_start` to a previous
//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
//! `Chunk.policy_hash` so the field still answers "what policy was
//! active". v1.1 second-iteration patch — logged in
//! `tasks/HOTFIXES.md` (2026-05-27).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "pdf-page-v1";
const VERSION_LABEL: &str = "pdf-page-v1.1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
@@ -146,7 +149,7 @@ impl Chunker for PdfPageV1Chunker {
continue;
}
for (char_start, char_end, slice) in
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// PDF chars-per-page comfortably fits in u32 (a single
@@ -164,10 +167,12 @@ impl Chunker for PdfPageV1Chunker {
char_end: Some(char_end_u32),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// Per-chunk policy_hash variant prevents chunk_id
// collision when a page produces multiple chunks. See
// module docs for rationale.
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
// v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
// variant uses `segment_start` (pre-overlap boundary,
// strictly increasing) instead of `char_start` (post-
// overlap, may collapse to prev_min). See module docs +
// spec §4.1 root cause + HOTFIXES.md 2026-05-27.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
let token_estimate = slice.len().div_ceil(BYTES_PER_TOKEN);
@@ -198,18 +203,24 @@ impl Chunker for PdfPageV1Chunker {
}
/// Split a single page's text into ordered chunks, each represented as
/// `(char_start, char_end, text_slice)`. Char positions are within the
/// page text, suitable for `SourceSpan::Page::char_start` / `char_end`.
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
///
/// Returns an empty vector when `text` is empty or whitespace-only.
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, String)> {
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, usize, String)> {
let chars: Vec<char> = text.chars().collect();
let n = chars.len();
if n == 0 {
return Vec::new();
}
if text.len() <= target_bytes {
return vec![(0, n, text.to_string())];
return vec![(0, 0, n, text.to_string())];
}
// Build candidate boundary positions (char indices where a chunk
@@ -239,7 +250,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
chars[a..b].iter().map(|c| c.len_utf8()).sum()
};
let mut chunks: Vec<(usize, usize, String)> = Vec::new();
let mut chunks: Vec<(usize, usize, usize, String)> = Vec::new();
let mut seg_idx: usize = 0;
while seg_idx + 1 < bounds.len() {
let start = bounds[seg_idx];
@@ -264,7 +275,9 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
// have absorbed up to `overlap_bytes` of bytes, but never past
// the previous chunk's start (no full re-emission).
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0;
// prev tuple shape = (segment_start, actual_start, chunk_end, slice).
// overlap walk floor = previous chunk's actual_start (prev.1).
let prev_min = prev.1;
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
@@ -281,7 +294,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
};
let slice: String = chars[actual_start..chunk_end].iter().collect();
chunks.push((actual_start, chunk_end, slice));
chunks.push((start, actual_start, chunk_end, slice));
seg_idx = end_idx;
}
@@ -674,6 +687,43 @@ mod tests {
assert_eq!(ids.len(), total, "chunk_ids must remain unique");
}
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
// → first segment [0, 12), second segment [12, n).
// page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
// → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
// → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
//
// default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
// (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
// verifier round 1 L-3 보강.
let early_seg = "".repeat(10);
let tail = "".repeat(500);
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
#[test]
fn policy_hash_matches_md_heading_v1_for_identical_policy() {
// Cross-chunker policy fingerprint identity — important so a

View File

@@ -14,6 +14,45 @@ historical contract that was implemented; this file accumulates the
deltas so phase 5+ readers can find the live behavior without diffing
git history.
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround collapses under aggressive overlap (Bug #3 second-iteration patch)
**Symptom**: F2 (1580 chars OCR, scanned_page2.pdf) ingest 시
`DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
failed`. `kebab v0.20.0` (commit `b4d9e60`) dogfood (qwen2.5vl:3b 의
`192.168.0.47:11434` Ollama endpoint, `/build/cache/tmp/v0.20-dogfood`
isolated KB) `--force-reingest` 마다 reproducible.
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170`
`per_chunk_hash = format!("{base_policy_hash}#c{char_start}")` 에서
`char_start` = post-overlap `actual_start`. line 266-281 의 overlap
walk 가 `prev_min` floor 까지만 back-walk 하므로 aggressive overlap
+ 첫 segment 가 작은 page (F2 의 한국어 OCR text: 첫 ~10 char 안
sentence-end → segment_1 = [0, 30], segment_2 = [30, n], overlap_bytes
240 / chars=80 → segment_2 의 actual_start 가 prev_min=0 으로
collapse) → 두 chunk 의 `#c0` suffix identical → identical chunk_id →
`chunks` PRIMARY KEY violation.
**Fix** (spec §4.4): `chunk_page` return tuple 에 `segment_start`
추가 (3-tuple → 4-tuple `(segment_start, actual_start, chunk_end,
slice)`), caller `per_chunk_hash` 의 suffix 를 `segment_start`
변경. `segment_start``bounds[seg_idx]` (dedup 후 strictly
increasing) — overlap walk 와 무관하게 모든 chunk distinct. citation
locality 의 `SourceSpan::Page.char_start` 는 여전히 post-overlap
`actual_start` 유지.
**chunker_version cascade**: `pdf-page-v1``pdf-page-v1.1` bump
(spec §4.4.1 round 1c M-1 결정, design §9 cascade rule 의 직접 적용).
multi-chunk PDF page (pre-OCR 시점 `metro-korea.pdf` 의 21 block /
34 chunk 같은 정상 path) 의 chunk_id 가 변경 — explicit user-facing
audit trail 확보, store layer 의 자동 invalidation report. v0.20.0
force-update path 라 사용자 cost zero (어차피 fresh ingest).
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체 unchanged (workaround
layer 의 internal computation 만 변경). parent PR #189
(`feat/pdf-scanned-ocr`, force-update path).
## 2026-05-26 — design deviation — kebab-normalize + kebab-parse-types 흡수 (24 → 22 crates)
**Symptom**: design deviation — post-PR9 audit (system-architect, `tasks/INDEX.md` L169) identified 두 crate (`kebab-normalize` + `kebab-parse-types`) 가 dead abstraction. design §3.7b 의 "thin layer" raison d'être ((a) `kebab-core` namespace 폭발 방지, (b) normalize 의 parser non-dependence) 가 4 parser 중 1개 (markdown) 만 lift 를 경유하는 현실에서 fan-in/fan-out 모두 1 → layer 의미 잃음. `kebab-parse-types` 의 production caller 가 2개 (`kebab-parse-md` + `kebab-normalize`) 이고 `kebab-normalize` 자체 caller 가 1개 (`kebab-app`) — 모두 markdown 의 lift 경로 안에서 단일 fan-in 경계 가능.