3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
42 KiB
title, created, status, target_version, parent_spec, dogfood_evidence, review_history
| title | created | status | target_version | parent_spec | dogfood_evidence | review_history | |||
|---|---|---|---|---|---|---|---|---|---|
| v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture | 2026-05-27 | ACCEPT (round 2 closure — Phase A complete) | 0.20.0 (PR | docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md | .omc/reviews/2026-05-27-v0.20-dogfood-report.md |
|
v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된 3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다. 후속 plan + executor 단계의 source 다.
§1 Background + dogfood evidence chain
§1.1 dogfood 환경
| 항목 | 값 |
|---|---|
| Binary | kebab v0.20.0 (commit b4d9e60) |
| Ollama endpoint | http://192.168.0.47:11434 (qwen2.5vl:3b) |
| Isolated KB | /build/cache/tmp/v0.20-dogfood/ |
| Corpus | 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB) |
§1.2 3 bug 의 reproducibility
| Bug | Severity | Trigger | Reproducible |
|---|---|---|---|
| #2 walker code limit | Important | 256 KB+ PDF/image/markdown ingest | 항상 (default config) |
| #3 chunk_id collision | Critical | scanned_page2.pdf (1580 OCR chars) ingest | force-reingest 마다 |
| #4 F4 Pages tree | Important | mojibake.pdf (F4 fixture) ingest | 항상 |
§1.3 dogfood report 인용
dogfood report (.omc/reviews/2026-05-27-v0.20-dogfood-report.md) 의 핵심 인용:
- Bug #2:
scanned=3, skipped_size_exceeded=6— workspace 9 PDF 중 3 만 통과, 6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label 2.7MB / internals 820KB) walker 단계 skip. - Bug #3:
"DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint failed"— scanned_page2.pdf chunk INSERT 단계에서 발생. - Bug #4:
block_count: 0, chunk_count: 0— F4 mojibake.pdf 의 ingest 결과 가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반.
§2 Goals + non-goals
§2.1 Goals
- 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은 branch 위에 stack).
- parent spec
(
docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md) 의 invariant 보존:- §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page".
- §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path).
- §4.6 wire schema additive 만 (V00X migration 불필요).
- parent plan
(
docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md) round 1c ACCEPT 의 design decisions 와 충돌 0. - workspace test regression 0 (
cargo test --workspace -j 1).
§2.2 Non-goals
- 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0.
- 새 V00X sqlite migration —
chunkstable DDL 변경 없음, fix 는 chunk_id 계산 path 한정. - F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF 요구사항 유지).
- 새 config knob 추가 (
[ingest.pdf].max_file_bytes같은 per-media-type limit 은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만).
§3 Bug #2 — walker code limit
§3.1 Root cause (file:line evidence)
crates/kebab-source-fs/src/connector.rs:42-72 — FsSourceConnector 가
Config::new 에서 max_file_bytes 와 max_file_lines 를
config.ingest.code 단일 namespace 에서 읽는다:
Ok(Self {
default_root: root,
default_exclude: config.workspace.exclude.clone(),
copy_threshold_bytes,
skip_generated_header: config.ingest.code.skip_generated_header,
max_file_bytes: config.ingest.code.max_file_bytes, // <-- code-specific
max_file_lines: config.ingest.code.max_file_lines, // <-- code-specific
})
crates/kebab-source-fs/src/connector.rs:169-190 — walker 의 size check 가
is_oversized(...) 호출 시 path 의 media type 무관:
if crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes, // generic limit, applied 모든 file
self.max_file_lines,
).unwrap_or(false) {
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
// ...
continue;
}
crates/kebab-source-fs/src/code_meta.rs:114-129 — is_oversized(...) 자체는
generic helper (extension 무관):
pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
let meta = std::fs::metadata(path)?;
if meta.len() > max_bytes {
return Ok(true);
}
// line cap (streaming)
...
}
crates/kebab-config/src/lib.rs:535-547 — IngestCodeCfg::default() 의
max_file_bytes = 262_144 (256 KB) — 대부분 PDF/image 가 이를 초과.
§3.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|---|---|---|---|
| A — code path only | walker 의 size check 를 code file (extension recognized by code_lang_for_path) 에만 적용 |
간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 | PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과 |
| B — per-type config | 새 [ingest.pdf], [ingest.image], [ingest.markdown] section 추가 + per-type limit |
user-tunable | 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과 |
| C — generic limit + docs note | 같은 generic limit 유지하지만 의도 명문화 | code 변경 0 | UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → 거부 |
§3.3 Chosen path — Option A
walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는 parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리, image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config 필요 시 Option B 로 진화.
is_code_file(path: &Path) -> bool helper 추가:
code_meta::code_lang_for_path(path).is_some()= code file. 기존 helper 재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list 완전 동일).
§3.4 Implementation (Rust diff)
crates/kebab-source-fs/src/code_meta.rs — pub(crate) helper 추가:
/// Returns true when `path`'s filename/extension is recognised as a code file
/// (per `code_lang_for_path`). Used by the walker to apply
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
/// not to PDF/image/markdown (which have their own size controls in their
/// respective parsers).
pub(crate) fn is_code_file(path: &Path) -> bool {
code_lang_for_path(path).is_some()
}
crates/kebab-source-fs/src/connector.rs:168-190 — walker conditional 추가:
// p10-1A-1: apply per-file generated-header + size-cap checks on files
// that passed the override (gitignore/builtin/kebabignore) matching.
// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines)
// applies ONLY to code files. PDF/image/markdown bypass — their parsers
// have their own size controls.
if crate::code_meta::is_code_file(&abs_path)
&& crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes,
self.max_file_lines,
)
.unwrap_or(false)
{
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
push_sample(
&mut fs_skips.skip_examples.size_exceeded,
&abs_path,
&root,
);
tracing::debug!(
path = %rel_path.display(),
max_bytes = self.max_file_bytes,
max_lines = self.max_file_lines,
"skip: code file exceeds size cap"
);
continue;
}
skip_generated_header 의 conditional 적용은 별개 — generated header sniff
은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image
의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지
않으므로 false positive 0. is_generated_file 의 walker conditional 추가는
불필요 — 기존 behavior 유지.
§3.5 Test additions
crates/kebab-source-fs/src/connector.rs 의 기존 test module 에 추가:
#[test]
fn size_cap_skips_only_code_files() {
let dir = tempfile::tempdir().unwrap();
let root = dir.path();
// 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code).
let big_blob: Vec<u8> = vec![b'x'; 300_000];
std::fs::write(root.join("paper.pdf"), &big_blob).unwrap();
std::fs::write(root.join("notes.md"), &big_blob).unwrap();
std::fs::write(root.join("big.rs"), &big_blob).unwrap();
let conn = FsSourceConnector::new(
&cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
)
.unwrap();
let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
// PDF + Markdown pass through walker.
assert!(paths.contains(&"paper.pdf".to_string()));
assert!(paths.contains(&"notes.md".to_string()));
// Code file gets skipped.
assert!(!paths.contains(&"big.rs".to_string()));
assert!(
skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")),
"size_exceeded examples should contain only big.rs: {:?}",
skips.skip_examples.size_exceeded
);
assert!(
!skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")),
"PDF must NOT appear in size_exceeded examples: {:?}",
skips.skip_examples.size_exceeded
);
}
추가로 기존 test ingest_report_counts_oversized_files_by_bytes 의 fixture
이름이 huge.rs 라서 invariant 보존됨. ingest_report_size_cap_by_line_count
도 longfile.rs 라서 동일.
§4 Bug #3 — chunk_id collision (Critical)
§4.1 Root cause investigation
§4.1.1 chunker 의 collision-avoidance workaround
crates/kebab-chunk/src/pdf_page_v1.rs:47-60 의 module doc 에 collision 회피
설명:
Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
|| policy_hash) collides when one block (= one PDF page) is split into
multiple chunks: every chunk on that page has identical block_ids.
Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}")
into the recipe's policy_hash slot.
crates/kebab-chunk/src/pdf_page_v1.rs:170-172 의 actual call:
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
여기 char_start = chunk_page(...) 의 첫 번째 tuple field = post-overlap
actual_start (NOT 원본 segment boundary start).
§4.1.2 overlap 의 actual_start 계산
crates/kebab-chunk/src/pdf_page_v1.rs:266-281:
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0; // previous chunk 의 actual_start
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
let cl = chars[a - 1].len_utf8();
if acc_o + cl > overlap_bytes {
break;
}
acc_o += cl;
a -= 1;
}
a
} else {
start
};
while a > prev_min — overlap walk 는 previous chunk 의 actual_start 까지만
back-walk. overlap_bytes 가 충분히 크고 start - prev_min 이 작으면
actual_start = prev_min. 두 chunk 가 같은 actual_start = 같은 #c{N}.
§4.1.3 가설 검증 — F2 (1580 chars OCR)
가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (. + whitespace)
또는 paragraph break (\n\n) 를 포함.
- 기본 chunking policy:
target_tokens=500→target_bytes=1500,overlap_tokens=80→overlap_bytes = min(240, 750) = 240. - 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk.
- 가정한 bounds =
[0, 30, ~n](첫 ~30 chars 안 sentence-end 1 개). - segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)).
#c0. - segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk.
- actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240.
- 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료.
- actual_start = 0 = prev_min.
- chunks.push((0, n, ...)).
#c0— collision with chunk 1.
같은 doc 안 두 chunk 의 chunk_id input:
{kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1", block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}- canonical JSON 동일 → blake3 동일 → chunk_id 동일.
→ put_chunks 의 INSERT INTO chunks 에서 첫 row 성공, 두 번째 row 가
PRIMARY KEY violation.
§4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유
F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence
boundary 부재. 그 경우 bounds = [0, n] 또는 first boundary 가 80 char 이후
→ chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct #c{N}
값 → distinct chunk_id.
→ F2 만 collision 이라는 dogfood 의 observation 과 일치.
§4.1.5 dogfood report 의 가설 평가
dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision 을 추정. 본 spec 의 investigation 결과 = intra-doc (F2 내부) collision. 근거:
- chunk_id input 에
doc_id포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름. - 같은 doc 안 두 chunk 가 같은 block_id + 같은
#c{N}policy_hash 면 identical chunk_id. - 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는 policy 의 canonical JSON blake3 (deterministic).
- 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른 collision pattern).
- 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per chunk 이므로 ordering N/A.
- 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text 미포함, N/A.
Confirmed root cause = 가설 C 의 variant — 단일 page 가 multi-chunk 일
때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, #c{N}
suffix 동일.
§4.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|---|---|---|---|
A — segment boundary start |
per_chunk_hash 의 suffix 를 post-overlap actual_start 대신 segment boundary start 로 변경 |
minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 | chunk_page 의 return tuple shape 변경 필요 |
| B — chunk ordinal | per_chunk_hash = "#c{ordinal}" (page 안 chunk index 0, 1, 2, ...) |
가장 simple / segment boundary 무관 | chunk_id 의 "meaningful hash input" semantic 약화 |
C — (char_start, char_end) pair |
per_chunk_hash = "#c{char_start}-{char_end}" |
두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct | char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함 |
| D — sequence number + char_start | per_chunk_hash = "#c{ordinal}-{char_start}" |
invariant 완전 보장 | redundant info, hash input 가 더 길어짐 |
§4.3 Chosen path — Option A
근거:
- chunk_page 의 main loop 는
seg_idx가 strictly increasing, segment boundarybounds[seg_idx]도 strictly increasing (bounds 가 dedup 후 unique). 따라서 segment boundarystart를 hash suffix 로 쓰면 같은 page 안 chunk 들의 hash input 가 보장된 distinct. - chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전 의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary 를 위한 enrichment.
- chunk_page 의 return tuple 을
(segment_start, actual_start, chunk_end, slice)의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도 track) — minimal diff.
§4.4 Implementation
crates/kebab-chunk/src/pdf_page_v1.rs 의 chunk_page return signature 확장:
/// Split a single page's text into ordered chunks, each represented as
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
fn chunk_page(
text: &str,
target_bytes: usize,
overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
// ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice))
chunks.push((start, actual_start, chunk_end, slice));
// ...
}
caller 의 chunk method 도 동일하게 update:
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// ... existing u32 conversion + span construction ...
let span = SourceSpan::Page {
page: page_num,
char_start: Some(u32::try_from(char_start).expect("...")),
char_end: Some(u32::try_from(char_end).expect("...")),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// segment_start (pre-overlap boundary) is strictly increasing across
// chunks, even when overlap walk collapses actual_start to prev_min.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
// ... rest unchanged ...
}
crates/kebab-chunk/src/pdf_page_v1.rs:47-60 의 module doc 도 동시 update —
기존 description 의 "#c{char_start}" 가 새 fix 에 stale 하므로:
//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
//! || policy_hash) collides when one block (= one PDF page) is split into
//! multiple chunks: every chunk on that page has identical block_ids.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment
//! boundary, strictly increasing across the returned chunks even when the
//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`.
//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch).
추가로 tasks/HOTFIXES.md 에 dated entry 추가 (본 fix 이 chunk_id deviation
의 second-iteration patch — 첫 iteration 의 #c{char_start} workaround 가
aggressive overlap case 에서 collision 을 leave 했음을 명문화):
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround
collapses under aggressive overlap (Bug #3 second-iteration patch)
**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf):
UNIQUE constraint failed: chunks.chunk_id`. ...
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ...
post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ...
**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start`
추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ...
**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9
cascade trigger 로 explicit invalidation.
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer
의 internal computation 만 변경).
§4.4.1 chunk_id determinism 보존
기존 single-chunk-per-page case (e.g. small pages, text.len() <= target_bytes)
:
chunk_page의 early return:vec![(0, n, text.to_string())]→ 새 shape 로vec![(0, 0, n, text.to_string())].segment_start = 0 = actual_start.#c0suffix 동일 → chunk_id 동일.
multi-chunk case 의 첫 chunk:
- segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk).
#c0suffix 동일 → chunk_id 동일.
multi-chunk case 의 second-and-later chunk:
- 기존:
actual_start(overlap-walked, may be == 0). - 새:
segment_start= bounds[seg_idx] > 0. - → chunk_id 변경 (intentional, collision 회피).
→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨. 이는 v0.20 의 force-reingest path 에서 자동 갱신.
Decision (round 1c, closes §7.2 Open Q1): chunker_version bump
pdf-page-v1 → pdf-page-v1.1 (critic round 1 M-1 권장 채택).
근거:
- 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf
의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가
internal computation 변경으로 silent 하게 다른 값으로 mapping.
chunker_version 을
pdf-page-v1유지하면 store/embedding layer 의 cascade audit 가 발생 안 함 → 사용자가--force-reingest를 명시적으로 호출하지 않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능. - design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit
version bump → store layer 의 자동 invalidation report.
pdf-page-v1.1bump 는 그 rule 의 직접 적용. - bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일
release commit 위에 cumulative bugfix stack) 이고, parent spec
(
2026-05-27-pdf-scanned-ocr-spec.md) 의 OCR feature 활성화가 어차피 force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만 다르면 새 doc_id chain 안에서 동일하게 cascade 재계산. - benefit = explicit user-facing audit trail. 다음 ingest 시 cascade invalidation 이 store layer report 에 명시.
cascade 의 다른 version field (parser_version / embedding_version / prompt_template_version / index_version) 는 unchanged — chunker layer 한정 patch.
PdfPageV1Chunker 의 chunker_version() 상수 update:
impl Chunker for PdfPageV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion("pdf-page-v1.1".to_string()) // was: "pdf-page-v1"
}
// ...
}
crates/kebab-chunk/src/pdf_page_v1.rs 의 PARSER_VERSION 또는 const
CHUNKER_VERSION 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서).
§4.5 Test additions
crates/kebab-chunk/src/pdf_page_v1.rs 의 #[cfg(test)] mod tests 에 추가:
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start
// can collapse to prev_min, producing identical `#c{char_start}` suffixes
// and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation
// at put_chunks INSERT time.
//
// Synthesises Korean OCR text shape: dense Korean characters (3 bytes
// per char) with a single early sentence-end boundary at char ~10 +
// long tail.
// 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars.
let early_seg: String = std::iter::repeat('가').take(10).collect();
let tail: String = std::iter::repeat('나').take(500).collect();
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
// Hard invariant: all chunk_ids must be unique. Without the fix, the
// second chunk would have actual_start = 0 (== first chunk's
// actual_start) under the aggressive-overlap walk → identical `#c0`
// hash suffix → identical chunk_id → PRIMARY KEY violation.
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
(round 1c L-1: 원래 round 0 의 second test
chunk_id_recipe_uses_segment_start_not_actual_start 는 본 test 의
uniqueness 검증과 redundant + 실제 assertion 이 assert!(chunks.len() >= 2)
뿐이라 test name 의 의도와 mismatch — 제거.)
추가로 crates/kebab-app/tests/ 에 integration 수준의 regression test:
// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new)
//
// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page +
// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과.
//
// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이
// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록
// 구성. real Ollama 호출 회피.
#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
// ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB
// ... assert: ingest_report.items 모두 kind != Error
// ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts
}
(round 1c NIT-1: 파일명과 함수명을 multi_scanned_pdf_ingest_no_chunk_id_collision
로 통일 — 원래 round 0 의 파일명 pdf_multi_scan_no_chunk_id_collision.rs 는
fn name 과 mismatch.)
§4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3)
본 integration test 는 OcrEngine trait 의 mock impl 을 요구. executor 단계의
1st step:
grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/로 MockOcrEngine 위치 확인.- 현재 상태 (2026-05-27 verifier probe):
crates/kebab-parse-image/src/ocr.rs:235— productionimpl OcrEngine for OllamaVisionOcr.crates/kebab-app/tests/pdf_ocr_apply.rs:25—impl OcrEngine for MockOcrEngine(test-only).
- 본 새 integration test (
multi_scanned_pdf_ingest_no_chunk_id_collision.rs) 는 같은 crate (kebab-app) 안의 별 test binary 라pdf_ocr_apply.rs의 private MockOcrEngine 를 직접 import 불가. executor 의 선택지:- Option A (권장):
crates/kebab-app/tests/common/mock_ocr.rs에 MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는 parameterised 형태). 두 test (pdf_ocr_apply.rs+ 본 신 test) 모두mod common;으로 share. - Option B: 본 신 test 안에 inline
impl OcrEngine for LocalMock { ... }중복 정의 (test isolation 우선, share 비용 회피).
- Option A (권장):
- 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의
acceptance 를 conditional downgrade —
kebab-chunk의 unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀 확보. integration 회피.
executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서 closed.
§4.6 Acceptance (Bug #3 fix)
- F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0.
--force-reingest마다 collision 0.- 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0.
crates/kebab-chunk의 기존 1000-determinism test (deterministic_chunk_ids_1000) 통과 보존.- workspace test regression 0.
- new test
multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids- integration test
multi_scanned_pdf_ingest_no_chunk_id_collision추가.
- integration test
§5 Bug #4 — F4 fixture Pages tree
§5.1 Root cause
§5.1.1 현상
{
"doc_path": "mojibake.pdf",
"kind": "new",
"byte_len": 22568,
"pdf_ocr_pages": 0,
"pdf_ocr_ms_total": 0,
"block_count": 0,
"chunk_count": 0,
"warnings": []
}
PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반.
§5.1.2 lopdf get_pages() 의 reaction
dogfood probe:
lopdf::Document::load_mem(F4_bytes)→ OK.pdf_doc.get_pages()→ emptyBTreeMap.- PDF byte stream 안
/Type /Pagecount = 1,/Countvalue = 1.
→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal
(/Pages → /Kids chain) 가 broken.
§5.1.3 fixture 생성 path 분석
tests/fixtures/_synth/mojibake.py:
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30*mm
for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]:
c.drawString(30*mm, y, line)
y -= 16
c.save()
data = dst.read_bytes()
# pattern: "/ToUnicode <objref>" — strip indirect object reference
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
dst.write_bytes(new_data)
Step 2 분석: re.sub 가 /ToUnicode N M R byte sequence 를 제거하지만:
- 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift.
- cross-reference table (
xref) 의 offset entries 가 stale. startxrefvalue 의 offset 도 stale.
Step 3 의 startxref fix (tasks/HOTFIXES.md 의 commit c2cd3a7):
- manual byte edit
22130 → 22114로 startxref 갱신. - 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의
/Kidsarray 가 가리키는 indirect object 의 actual byte position 가 xref entry 와 mismatch. - lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공 하지만 Pages tree traversal 시 indirect object resolution fail → empty Pages map.
§5.2 Fixture re-generation strategy
| Option | 설명 | 장점 | 단점 |
|---|---|---|---|
| A — pikepdf | reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) | proper xref regeneration / Pages tree intact / library available (pip install pikepdf) | 새 Python dependency (pikepdf) |
| B — qpdf normalize | byte-edit 후 qpdf --linearize input.pdf output.pdf |
external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) | qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음) |
| C — reportlab disable ToUnicode | reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable | byte-edit 회피 — clean | reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요) |
§5.3 Chosen path — Option A (pikepdf)
근거:
- pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings.
- xref table 의 auto-regeneration + Pages tree 의 integrity 보존.
pip install pikepdf로 dependency 추가 — 이미 fixture generation 용 Python venv 가 reportlab 사용 중이라 추가 install 가 trivial.
§5.3.1 ToUnicode strip 의 pikepdf approach
reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안
/ToUnicode <ref> 로 등장. pikepdf 로 font dictionary 의 /ToUnicode entry 만
제거:
import pikepdf
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
# Walk all indirect objects, delete /ToUnicode entry whenever found.
# PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 →
# false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4
# 의 actual implementation 과 동일 형태).
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
pdf.save(str(dst))
pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존.
§5.4 Implementation (mojibake.py revision)
tests/fixtures/_synth/mojibake.py 의 완전 rewrite:
"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
Strategy:
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen).
Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`.
Usage:
python3 tests/fixtures/_synth/mojibake.py \
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
"""
import sys
from pathlib import Path
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
import pikepdf
DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
FONT_NAME = "DejaVuSans"
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
dst = Path(sys.argv[1])
# Step 1: 정상 PDF 합성.
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30 * mm
for line in [
"Mojibake fixture (no ToUnicode CMap)",
"Text extraction yields garbage \x00\x01\x02",
]:
c.drawString(30 * mm, y, line)
y -= 16
c.save()
# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration.
removed = 0
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
removed += 1
pdf.save(str(dst))
if removed == 0:
print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr)
sys.exit(2)
# Step 3: invariant 검증 — load + page count.
with pikepdf.open(str(dst)) as pdf:
n_pages = len(pdf.pages)
if n_pages != 1:
print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr)
sys.exit(3)
# ToUnicode 부재 invariant 확인.
raw = Path(dst).read_bytes()
if b"/ToUnicode" in raw:
print("ERROR: /ToUnicode still present after strip", file=sys.stderr)
sys.exit(4)
print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)")
§5.5 Test additions
crates/kebab-parse-pdf/tests/text_extractor.rs (or relevant existing test
file) 에 추가:
/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation
/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장.
///
/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는
/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken
/// indirect object resolution → empty pages map → block_count=0.
#[test]
fn mojibake_fixture_load_yields_one_page() {
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load");
let pages = doc.get_pages();
assert_eq!(
pages.len(),
1,
"F4 fixture must have exactly 1 page (Pages tree integrity)"
);
}
#[test]
fn mojibake_fixture_has_no_tounicode_cmap() {
// Step 2 의 ToUnicode 부재 invariant.
let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap();
let count = bytes.windows(b"/ToUnicode".len())
.filter(|w| *w == b"/ToUnicode")
.count();
assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker");
}
#[test]
fn pdf_text_extractor_on_mojibake_yields_one_block() {
// PdfTextExtractor 의 invariant: 1 Block::Paragraph per page.
// F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는
// empty → 1 empty Block::Paragraph + "scanned candidate" warning.
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
// ... ExtractContext setup + extractor.extract(&ctx, bytes) ...
let canonical = extractor.extract(&ctx, bytes).unwrap();
assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page");
// text 는 garbage 또는 empty — invariant 는 block 자체의 존재.
let warning_present = canonical.provenance.events.iter().any(|e| {
matches!(e.kind, ProvenanceKind::Warning)
&& e.note.as_ref().is_some_and(|n| n.contains("scanned candidate"))
});
assert!(warning_present || !canonical.blocks[0].text_is_empty(),
"text-detect first 의 empty fallback 시 scanned-candidate warning 필수");
}
§5.6 Acceptance (Bug #4 fix)
- F4 fixture re-generation 후
lopdf::Document::load_mem(...).get_pages().len() = 1. - F4 fixture 의 ToUnicode CMap 부재 invariant 보존
(
grep -c "/ToUnicode" mojibake.pdf= 0). - PdfTextExtractor 의 F4 ingest 시
block_count = 1, warning"page1 empty (scanned candidate)"또는 garbage text. - dogfood retest 시 mojibake.pdf 의
block_count: 1,chunk_count: 0~1(depending on text content). - 기존
text_extractor_regression.rs의 F4 baseline 갱신 — old baseline 자체 가 broken invariant 의 snapshot 이라 update 필요. - workspace test regression 0.
§6 Acceptance criteria (consolidated)
| # | Verifier | Bug | 명령 |
|---|---|---|---|
| 1 | walker bypasses size cap for PDF | #2 | cargo test -p kebab-source-fs size_cap_skips_only_code_files |
| 2 | walker still skips oversized code files | #2 | cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes |
| 3 | 256KB+ PDF/markdown ingest default config | #2 | dogfood retest: kebab ingest 시 skipped_size_exceeded = 0 for non-code |
| 4 | chunker collision regression test | #3 | cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids |
| 5 | chunker determinism preserved | #3 | cargo test -p kebab-chunk deterministic_chunk_ids_1000 |
| 6 | chunker overlap clamp preserved | #3 | cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target |
| 7 | integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) | #3 | cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision |
| 8 | dogfood: F1 + F2 force-reingest | #3 | dogfood retest: kebab ingest --force-reingest 시 errors = 0 (encrypted 제외) |
| 9 | F4 fixture lopdf 1-page invariant | #4 | cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page |
| 10 | F4 fixture ToUnicode 부재 invariant | #4 | cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap |
| 11 | F4 PdfTextExtractor 1-block invariant | #4 | cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block |
| 12 | dogfood: F4 ingest yields block_count=1 | #4 | dogfood retest: mojibake.pdf 의 ingest item block_count: 1 |
| 13 | workspace clippy clean | all | cargo clippy --workspace --all-targets -- -D warnings |
| 14 | workspace full test pass | all | cargo test --workspace --no-fail-fast -j 1 |
| 15 | dogfood end-to-end 9 PDF | all | dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
| 16 | chunker_version cascade — final value | #3 | grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs 결과가 "pdf-page-v1.1" (round 1c M-1 결정) |
§7 Risks + open questions
§7.1 Risks
- Bug #3 fix 가 chunk_id 변경: multi-chunk PDF page (pre-OCR 시점에 1500
byte 초과 page) 의 chunk_id 가 변경됨. 사용자가
--force-reingest1회 필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh ingest). README 또는 release note 에 명시. - Bug #2 fix 의 side-effect: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의 load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable.
- Bug #4 fix 의 fixture binary 변경: F4 mojibake.pdf 의 SHA256 가 변경
→ git LFS / binary diff 의 noise.
text_extractor_regression.rs의 baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리. - pikepdf install requirement: fixture re-generation 시
pip install pikepdf필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의 Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로 generation 은 1회성, CI 의존성 미발생.
§7.2 Open questions
- chunker_version bump 의 cost-benefit: ✅ CLOSED (round 1c M-1) —
pdf-page-v1→pdf-page-v1.1bump 결정. cascade audit trail explicit- v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision (round 1c, closes §7.2 Open Q1)" 단락.
- Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion: 본 spec 은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음.
- F4 fixture 의 invariant: critic round 1 ACCEPT — ToUnicode 부재 + valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible. Step 2 의 design (mojibake.py rewrite) sound.
- integration test 의 mock OCR: ✅ CLOSED (round 1c M-3) —
crates/kebab-app/tests/pdf_ocr_apply.rs:25에 이미impl OcrEngine for MockOcrEngine존재. executor 의 share path (Option A —tests/common/ mock_ocr.rslift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가 능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition — MockOcrEngine availability" 단락. - chunk_page tuple shape 변경: Option A 의 4-tuple
(segment_start, actual_start, chunk_end, slice)가 외부 callers 에 영향을 주는가?chunk_page는 module-private (fn chunk_page) 이라 외부 caller 0, 안전. critic round 1 ACCEPT.
§8 References
- dogfood report:
.omc/reviews/2026-05-27-v0.20-dogfood-report.md - parent spec (frozen):
docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md - parent plan (round 1c ACCEPT):
docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md - source code (root cause evidence):
crates/kebab-source-fs/src/connector.rs(Bug #2)crates/kebab-source-fs/src/code_meta.rs(is_oversized + code_lang_for_path)crates/kebab-config/src/lib.rs(IngestCodeCfg)crates/kebab-core/src/ids.rs(id_for_chunk / id_for_block recipes)crates/kebab-chunk/src/pdf_page_v1.rs(PdfPageV1Chunker + chunk_page)crates/kebab-app/src/pdf_ocr_apply.rs(post-extract OCR enrichment)crates/kebab-app/src/lib.rs:1769-1968(ingest_one_pdf_asset wiring)crates/kebab-store-sqlite/src/documents.rs:103-155(put_chunks DELETE+INSERT)migrations/V001__init.sql:80-94(chunks table DDL — chunk_id PRIMARY KEY)tests/fixtures/_synth/mojibake.py(Bug #4 fixture source)
- design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe), §4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist), §9 (versioning cascade).