Files
kebab/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
altair823 46e99470eb docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00

42 KiB
Raw Blame History

title, created, status, target_version, parent_spec, dogfood_evidence, review_history
title created status target_version parent_spec dogfood_evidence review_history
v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture 2026-05-27 ACCEPT (round 2 closure — Phase A complete) 0.20.0 (PR docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md .omc/reviews/2026-05-27-v0.20-dogfood-report.md
2026-05-27 spec round 1 critic (opus, thorough) — ACCEPT, HIGH 0 + MEDIUM 3 + LOW 2 + NIT 2
2026-05-27 spec round 1c rewrite (opus, drafter) — MEDIUM/LOW/NIT all applied
2026-05-27 spec round 2 closure critic (opus) — ACCEPT, 7/7 applied + 1 NIT (frontmatter status, applied here)

v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture

본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된 3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다. 후속 plan + executor 단계의 source 다.

§1 Background + dogfood evidence chain

§1.1 dogfood 환경

항목
Binary kebab v0.20.0 (commit b4d9e60)
Ollama endpoint http://192.168.0.47:11434 (qwen2.5vl:3b)
Isolated KB /build/cache/tmp/v0.20-dogfood/
Corpus 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB)

§1.2 3 bug 의 reproducibility

Bug Severity Trigger Reproducible
#2 walker code limit Important 256 KB+ PDF/image/markdown ingest 항상 (default config)
#3 chunk_id collision Critical scanned_page2.pdf (1580 OCR chars) ingest force-reingest 마다
#4 F4 Pages tree Important mojibake.pdf (F4 fixture) ingest 항상

§1.3 dogfood report 인용

dogfood report (.omc/reviews/2026-05-27-v0.20-dogfood-report.md) 의 핵심 인용:

  • Bug #2: scanned=3, skipped_size_exceeded=6 — workspace 9 PDF 중 3 만 통과, 6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label 2.7MB / internals 820KB) walker 단계 skip.
  • Bug #3: "DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint failed" — scanned_page2.pdf chunk INSERT 단계에서 발생.
  • Bug #4: block_count: 0, chunk_count: 0 — F4 mojibake.pdf 의 ingest 결과 가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반.

§2 Goals + non-goals

§2.1 Goals

  • 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은 branch 위에 stack).
  • parent spec (docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md) 의 invariant 보존:
    • §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page".
    • §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path).
    • §4.6 wire schema additive 만 (V00X migration 불필요).
  • parent plan (docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md) round 1c ACCEPT 의 design decisions 와 충돌 0.
  • workspace test regression 0 (cargo test --workspace -j 1).

§2.2 Non-goals

  • 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0.
  • 새 V00X sqlite migration — chunks table DDL 변경 없음, fix 는 chunk_id 계산 path 한정.
  • F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF 요구사항 유지).
  • 새 config knob 추가 ([ingest.pdf].max_file_bytes 같은 per-media-type limit 은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만).

§3 Bug #2 — walker code limit

§3.1 Root cause (file:line evidence)

crates/kebab-source-fs/src/connector.rs:42-72FsSourceConnectorConfig::new 에서 max_file_bytesmax_file_linesconfig.ingest.code 단일 namespace 에서 읽는다:

Ok(Self {
    default_root: root,
    default_exclude: config.workspace.exclude.clone(),
    copy_threshold_bytes,
    skip_generated_header: config.ingest.code.skip_generated_header,
    max_file_bytes: config.ingest.code.max_file_bytes,    // <-- code-specific
    max_file_lines: config.ingest.code.max_file_lines,    // <-- code-specific
})

crates/kebab-source-fs/src/connector.rs:169-190 — walker 의 size check 가 is_oversized(...) 호출 시 path 의 media type 무관:

if crate::code_meta::is_oversized(
    &abs_path,
    self.max_file_bytes,    // generic limit, applied 모든 file
    self.max_file_lines,
).unwrap_or(false) {
    fs_skips.skipped_size_exceeded =
        fs_skips.skipped_size_exceeded.saturating_add(1);
    // ...
    continue;
}

crates/kebab-source-fs/src/code_meta.rs:114-129is_oversized(...) 자체는 generic helper (extension 무관):

pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
    let meta = std::fs::metadata(path)?;
    if meta.len() > max_bytes {
        return Ok(true);
    }
    // line cap (streaming)
    ...
}

crates/kebab-config/src/lib.rs:535-547IngestCodeCfg::default()max_file_bytes = 262_144 (256 KB) — 대부분 PDF/image 가 이를 초과.

§3.2 Decision matrix

Option 설명 장점 단점
A — code path only walker 의 size check 를 code file (extension recognized by code_lang_for_path) 에만 적용 간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과
B — per-type config [ingest.pdf], [ingest.image], [ingest.markdown] section 추가 + per-type limit user-tunable 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과
C — generic limit + docs note 같은 generic limit 유지하지만 의도 명문화 code 변경 0 UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → 거부

§3.3 Chosen path — Option A

walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는 parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리, image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config 필요 시 Option B 로 진화.

is_code_file(path: &Path) -> bool helper 추가:

  • code_meta::code_lang_for_path(path).is_some() = code file. 기존 helper 재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list 완전 동일).

§3.4 Implementation (Rust diff)

crates/kebab-source-fs/src/code_meta.rspub(crate) helper 추가:

/// Returns true when `path`'s filename/extension is recognised as a code file
/// (per `code_lang_for_path`). Used by the walker to apply
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
/// not to PDF/image/markdown (which have their own size controls in their
/// respective parsers).
pub(crate) fn is_code_file(path: &Path) -> bool {
    code_lang_for_path(path).is_some()
}

crates/kebab-source-fs/src/connector.rs:168-190 — walker conditional 추가:

// p10-1A-1: apply per-file generated-header + size-cap checks on files
// that passed the override (gitignore/builtin/kebabignore) matching.
// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines)
// applies ONLY to code files. PDF/image/markdown bypass — their parsers
// have their own size controls.
if crate::code_meta::is_code_file(&abs_path)
    && crate::code_meta::is_oversized(
        &abs_path,
        self.max_file_bytes,
        self.max_file_lines,
    )
    .unwrap_or(false)
{
    fs_skips.skipped_size_exceeded =
        fs_skips.skipped_size_exceeded.saturating_add(1);
    push_sample(
        &mut fs_skips.skip_examples.size_exceeded,
        &abs_path,
        &root,
    );
    tracing::debug!(
        path = %rel_path.display(),
        max_bytes = self.max_file_bytes,
        max_lines = self.max_file_lines,
        "skip: code file exceeds size cap"
    );
    continue;
}

skip_generated_header 의 conditional 적용은 별개 — generated header sniff 은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image 의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지 않으므로 false positive 0. is_generated_file 의 walker conditional 추가는 불필요 — 기존 behavior 유지.

§3.5 Test additions

crates/kebab-source-fs/src/connector.rs 의 기존 test module 에 추가:

#[test]
fn size_cap_skips_only_code_files() {
    let dir = tempfile::tempdir().unwrap();
    let root = dir.path();
    // 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code).
    let big_blob: Vec<u8> = vec![b'x'; 300_000];
    std::fs::write(root.join("paper.pdf"), &big_blob).unwrap();
    std::fs::write(root.join("notes.md"), &big_blob).unwrap();
    std::fs::write(root.join("big.rs"), &big_blob).unwrap();

    let conn = FsSourceConnector::new(
        &cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
    )
    .unwrap();
    let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();

    let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
    // PDF + Markdown pass through walker.
    assert!(paths.contains(&"paper.pdf".to_string()));
    assert!(paths.contains(&"notes.md".to_string()));
    // Code file gets skipped.
    assert!(!paths.contains(&"big.rs".to_string()));
    assert!(
        skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")),
        "size_exceeded examples should contain only big.rs: {:?}",
        skips.skip_examples.size_exceeded
    );
    assert!(
        !skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")),
        "PDF must NOT appear in size_exceeded examples: {:?}",
        skips.skip_examples.size_exceeded
    );
}

추가로 기존 test ingest_report_counts_oversized_files_by_bytes 의 fixture 이름이 huge.rs 라서 invariant 보존됨. ingest_report_size_cap_by_line_countlongfile.rs 라서 동일.

§4 Bug #3 — chunk_id collision (Critical)

§4.1 Root cause investigation

§4.1.1 chunker 의 collision-avoidance workaround

crates/kebab-chunk/src/pdf_page_v1.rs:47-60 의 module doc 에 collision 회피 설명:

Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
|| policy_hash) collides when one block (= one PDF page) is split into
multiple chunks: every chunk on that page has identical block_ids.

Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}")
into the recipe's policy_hash slot.

crates/kebab-chunk/src/pdf_page_v1.rs:170-172 의 actual call:

let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
let chunk_id =
    id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);

여기 char_start = chunk_page(...) 의 첫 번째 tuple field = post-overlap actual_start (NOT 원본 segment boundary start).

§4.1.2 overlap 의 actual_start 계산

crates/kebab-chunk/src/pdf_page_v1.rs:266-281:

let actual_start = if let Some(prev) = chunks.last() {
    let prev_min = prev.0;   // previous chunk 의 actual_start
    let mut a = start;
    let mut acc_o: usize = 0;
    while a > prev_min {
        let cl = chars[a - 1].len_utf8();
        if acc_o + cl > overlap_bytes {
            break;
        }
        acc_o += cl;
        a -= 1;
    }
    a
} else {
    start
};

while a > prev_min — overlap walk 는 previous chunk 의 actual_start 까지만 back-walk. overlap_bytes 가 충분히 크고 start - prev_min 이 작으면 actual_start = prev_min. 두 chunk 가 같은 actual_start = 같은 #c{N}.

§4.1.3 가설 검증 — F2 (1580 chars OCR)

가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (. + whitespace) 또는 paragraph break (\n\n) 를 포함.

  • 기본 chunking policy: target_tokens=500target_bytes=1500, overlap_tokens=80overlap_bytes = min(240, 750) = 240.
  • 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk.
  • 가정한 bounds = [0, 30, ~n] (첫 ~30 chars 안 sentence-end 1 개).
  • segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)). #c0.
  • segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk.
    • actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240.
    • 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료.
    • actual_start = 0 = prev_min.
  • chunks.push((0, n, ...)). #c0collision with chunk 1.

같은 doc 안 두 chunk 의 chunk_id input:

  • {kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1", block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}
  • canonical JSON 동일 → blake3 동일 → chunk_id 동일.

put_chunksINSERT INTO chunks 에서 첫 row 성공, 두 번째 row 가 PRIMARY KEY violation.

§4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유

F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence boundary 부재. 그 경우 bounds = [0, n] 또는 first boundary 가 80 char 이후 → chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct #c{N} 값 → distinct chunk_id.

F2 만 collision 이라는 dogfood 의 observation 과 일치.

§4.1.5 dogfood report 의 가설 평가

dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision 을 추정. 본 spec 의 investigation 결과 = intra-doc (F2 내부) collision. 근거:

  • chunk_id input 에 doc_id 포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름.
  • 같은 doc 안 두 chunk 가 같은 block_id + 같은 #c{N} policy_hash 면 identical chunk_id.
  • 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는 policy 의 canonical JSON blake3 (deterministic).
  • 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른 collision pattern).
  • 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per chunk 이므로 ordering N/A.
  • 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text 미포함, N/A.

Confirmed root cause = 가설 C 의 variant — 단일 page 가 multi-chunk 일 때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, #c{N} suffix 동일.

§4.2 Decision matrix

Option 설명 장점 단점
A — segment boundary start per_chunk_hash 의 suffix 를 post-overlap actual_start 대신 segment boundary start 로 변경 minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 chunk_page 의 return tuple shape 변경 필요
B — chunk ordinal per_chunk_hash = "#c{ordinal}" (page 안 chunk index 0, 1, 2, ...) 가장 simple / segment boundary 무관 chunk_id 의 "meaningful hash input" semantic 약화
C — (char_start, char_end) pair per_chunk_hash = "#c{char_start}-{char_end}" 두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함
D — sequence number + char_start per_chunk_hash = "#c{ordinal}-{char_start}" invariant 완전 보장 redundant info, hash input 가 더 길어짐

§4.3 Chosen path — Option A

근거:

  • chunk_page 의 main loop 는 seg_idx 가 strictly increasing, segment boundary bounds[seg_idx] 도 strictly increasing (bounds 가 dedup 후 unique). 따라서 segment boundary start 를 hash suffix 로 쓰면 같은 page 안 chunk 들의 hash input 가 보장된 distinct.
  • chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전 의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary 를 위한 enrichment.
  • chunk_page 의 return tuple 을 (segment_start, actual_start, chunk_end, slice) 의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도 track) — minimal diff.

§4.4 Implementation

crates/kebab-chunk/src/pdf_page_v1.rschunk_page return signature 확장:

/// Split a single page's text into ordered chunks, each represented as
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
///   across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
///   previous chunk's `actual_start` under aggressive overlap policy.
///   Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
fn chunk_page(
    text: &str,
    target_bytes: usize,
    overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
    // ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice))
    chunks.push((start, actual_start, chunk_end, slice));
    // ...
}

caller 의 chunk method 도 동일하게 update:

for (segment_start, char_start, char_end, slice) in
    chunk_page(&p.text, target_bytes, overlap_bytes)
{
    // ... existing u32 conversion + span construction ...
    let span = SourceSpan::Page {
        page: page_num,
        char_start: Some(u32::try_from(char_start).expect("...")),
        char_end: Some(u32::try_from(char_end).expect("...")),
    };
    let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
    // segment_start (pre-overlap boundary) is strictly increasing across
    // chunks, even when overlap walk collapses actual_start to prev_min.
    let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
    let chunk_id =
        id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
    // ... rest unchanged ...
}

crates/kebab-chunk/src/pdf_page_v1.rs:47-60 의 module doc 도 동시 update — 기존 description 의 "#c{char_start}" 가 새 fix 에 stale 하므로:

//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
//! || policy_hash) collides when one block (= one PDF page) is split into
//! multiple chunks: every chunk on that page has identical block_ids.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment
//! boundary, strictly increasing across the returned chunks even when the
//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`.
//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch).

추가로 tasks/HOTFIXES.md 에 dated entry 추가 (본 fix 이 chunk_id deviation 의 second-iteration patch — 첫 iteration 의 #c{char_start} workaround 가 aggressive overlap case 에서 collision 을 leave 했음을 명문화):

## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround
collapses under aggressive overlap (Bug #3 second-iteration patch)

**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf):
UNIQUE constraint failed: chunks.chunk_id`. ...

**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ...
post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ...

**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start`
추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ...

**chunker_version cascade**: `pdf-page-v1``pdf-page-v1.1` bump
(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9
cascade trigger 로 explicit invalidation.

**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer
의 internal computation 만 변경).

§4.4.1 chunk_id determinism 보존

기존 single-chunk-per-page case (e.g. small pages, text.len() <= target_bytes) :

  • chunk_page 의 early return: vec![(0, n, text.to_string())] → 새 shape 로 vec![(0, 0, n, text.to_string())]. segment_start = 0 = actual_start.
  • #c0 suffix 동일 → chunk_id 동일.

multi-chunk case 의 첫 chunk:

  • segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk).
  • #c0 suffix 동일 → chunk_id 동일.

multi-chunk case 의 second-and-later chunk:

  • 기존: actual_start (overlap-walked, may be == 0).
  • 새: segment_start = bounds[seg_idx] > 0.
  • → chunk_id 변경 (intentional, collision 회피).

→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨. 이는 v0.20 의 force-reingest path 에서 자동 갱신.

Decision (round 1c, closes §7.2 Open Q1): chunker_version bump pdf-page-v1pdf-page-v1.1 (critic round 1 M-1 권장 채택).

근거:

  • 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf 의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가 internal computation 변경으로 silent 하게 다른 값으로 mapping. chunker_version 을 pdf-page-v1 유지하면 store/embedding layer 의 cascade audit 가 발생 안 함 → 사용자가 --force-reingest 를 명시적으로 호출하지 않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능.
  • design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit version bump → store layer 의 자동 invalidation report. pdf-page-v1.1 bump 는 그 rule 의 직접 적용.
  • bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일 release commit 위에 cumulative bugfix stack) 이고, parent spec (2026-05-27-pdf-scanned-ocr-spec.md) 의 OCR feature 활성화가 어차피 force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만 다르면 새 doc_id chain 안에서 동일하게 cascade 재계산.
  • benefit = explicit user-facing audit trail. 다음 ingest 시 cascade invalidation 이 store layer report 에 명시.

cascade 의 다른 version field (parser_version / embedding_version / prompt_template_version / index_version) 는 unchanged — chunker layer 한정 patch.

PdfPageV1Chunkerchunker_version() 상수 update:

impl Chunker for PdfPageV1Chunker {
    fn chunker_version(&self) -> ChunkerVersion {
        ChunkerVersion("pdf-page-v1.1".to_string())  // was: "pdf-page-v1"
    }
    // ...
}

crates/kebab-chunk/src/pdf_page_v1.rsPARSER_VERSION 또는 const CHUNKER_VERSION 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서).

§4.5 Test additions

crates/kebab-chunk/src/pdf_page_v1.rs#[cfg(test)] mod tests 에 추가:

#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
    // Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start
    // can collapse to prev_min, producing identical `#c{char_start}` suffixes
    // and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation
    // at put_chunks INSERT time.
    //
    // Synthesises Korean OCR text shape: dense Korean characters (3 bytes
    // per char) with a single early sentence-end boundary at char ~10 +
    // long tail.

    // 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars.
    let early_seg: String = std::iter::repeat('가').take(10).collect();
    let tail: String = std::iter::repeat('나').take(500).collect();
    let page_text = format!("{early_seg}. {tail}");

    let doc = make_pdf_doc(&[&page_text]);
    let policy = default_policy(500, 80);  // target=1500 byte, overlap=240 byte
    let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();

    assert!(
        chunks.len() >= 2,
        "expected ≥2 chunks for {} byte page; got {}",
        page_text.len(),
        chunks.len()
    );

    // Hard invariant: all chunk_ids must be unique. Without the fix, the
    // second chunk would have actual_start = 0 (== first chunk's
    // actual_start) under the aggressive-overlap walk → identical `#c0`
    // hash suffix → identical chunk_id → PRIMARY KEY violation.
    let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
    ids.sort_unstable();
    let total = ids.len();
    ids.dedup();
    assert_eq!(
        ids.len(),
        total,
        "all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
    );
}

(round 1c L-1: 원래 round 0 의 second test chunk_id_recipe_uses_segment_start_not_actual_start 는 본 test 의 uniqueness 검증과 redundant + 실제 assertion 이 assert!(chunks.len() >= 2) 뿐이라 test name 의 의도와 mismatch — 제거.)

추가로 crates/kebab-app/tests/ 에 integration 수준의 regression test:

// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new)
//
// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page +
// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과.
//
// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이
// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록
// 구성. real Ollama 호출 회피.

#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
    // ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB
    // ... assert: ingest_report.items 모두 kind != Error
    // ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts
}

(round 1c NIT-1: 파일명과 함수명을 multi_scanned_pdf_ingest_no_chunk_id_collision 로 통일 — 원래 round 0 의 파일명 pdf_multi_scan_no_chunk_id_collision.rs 는 fn name 과 mismatch.)

§4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3)

본 integration test 는 OcrEngine trait 의 mock impl 을 요구. executor 단계의 1st step:

  1. grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/ 로 MockOcrEngine 위치 확인.
  2. 현재 상태 (2026-05-27 verifier probe):
    • crates/kebab-parse-image/src/ocr.rs:235 — production impl OcrEngine for OllamaVisionOcr.
    • crates/kebab-app/tests/pdf_ocr_apply.rs:25impl OcrEngine for MockOcrEngine (test-only).
  3. 본 새 integration test (multi_scanned_pdf_ingest_no_chunk_id_collision.rs) 는 같은 crate (kebab-app) 안의 별 test binary 라 pdf_ocr_apply.rs 의 private MockOcrEngine 를 직접 import 불가. executor 의 선택지:
    • Option A (권장): crates/kebab-app/tests/common/mock_ocr.rs 에 MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는 parameterised 형태). 두 test (pdf_ocr_apply.rs + 본 신 test) 모두 mod common; 으로 share.
    • Option B: 본 신 test 안에 inline impl OcrEngine for LocalMock { ... } 중복 정의 (test isolation 우선, share 비용 회피).
  4. 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의 acceptance 를 conditional downgradekebab-chunk 의 unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀 확보. integration 회피.

executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서 closed.

§4.6 Acceptance (Bug #3 fix)

  • F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0.
  • --force-reingest 마다 collision 0.
  • 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0.
  • crates/kebab-chunk 의 기존 1000-determinism test (deterministic_chunk_ids_1000) 통과 보존.
  • workspace test regression 0.
  • new test multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids
    • integration test multi_scanned_pdf_ingest_no_chunk_id_collision 추가.

§5 Bug #4 — F4 fixture Pages tree

§5.1 Root cause

§5.1.1 현상

{
  "doc_path": "mojibake.pdf",
  "kind": "new",
  "byte_len": 22568,
  "pdf_ocr_pages": 0,
  "pdf_ocr_ms_total": 0,
  "block_count": 0,
  "chunk_count": 0,
  "warnings": []
}

PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반.

§5.1.2 lopdf get_pages() 의 reaction

dogfood probe:

  • lopdf::Document::load_mem(F4_bytes) → OK.
  • pdf_doc.get_pages() → empty BTreeMap.
  • PDF byte stream 안 /Type /Page count = 1, /Count value = 1.

→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal (/Pages/Kids chain) 가 broken.

§5.1.3 fixture 생성 path 분석

tests/fixtures/_synth/mojibake.py:

c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30*mm
for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]:
    c.drawString(30*mm, y, line)
    y -= 16
c.save()

data = dst.read_bytes()
# pattern: "/ToUnicode <objref>" — strip indirect object reference
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
dst.write_bytes(new_data)

Step 2 분석: re.sub/ToUnicode N M R byte sequence 를 제거하지만:

  • 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift.
  • cross-reference table (xref) 의 offset entries 가 stale.
  • startxref value 의 offset 도 stale.

Step 3 의 startxref fix (tasks/HOTFIXES.md 의 commit c2cd3a7):

  • manual byte edit 22130 → 22114 로 startxref 갱신.
  • 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의 /Kids array 가 가리키는 indirect object 의 actual byte position 가 xref entry 와 mismatch.
  • lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공 하지만 Pages tree traversal 시 indirect object resolution fail → empty Pages map.

§5.2 Fixture re-generation strategy

Option 설명 장점 단점
A — pikepdf reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) proper xref regeneration / Pages tree intact / library available (pip install pikepdf) 새 Python dependency (pikepdf)
B — qpdf normalize byte-edit 후 qpdf --linearize input.pdf output.pdf external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음)
C — reportlab disable ToUnicode reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable byte-edit 회피 — clean reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요)

§5.3 Chosen path — Option A (pikepdf)

근거:

  • pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings.
  • xref table 의 auto-regeneration + Pages tree 의 integrity 보존.
  • pip install pikepdf 로 dependency 추가 — 이미 fixture generation 용 Python venv 가 reportlab 사용 중이라 추가 install 가 trivial.

§5.3.1 ToUnicode strip 의 pikepdf approach

reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안 /ToUnicode <ref> 로 등장. pikepdf 로 font dictionary 의 /ToUnicode entry 만 제거:

import pikepdf
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
    # Walk all indirect objects, delete /ToUnicode entry whenever found.
    # PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 →
    # false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4
    # 의 actual implementation 과 동일 형태).
    for obj in pdf.objects:
        if isinstance(obj, pikepdf.Dictionary):
            if "/ToUnicode" in obj:
                del obj["/ToUnicode"]
    pdf.save(str(dst))

pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존.

§5.4 Implementation (mojibake.py revision)

tests/fixtures/_synth/mojibake.py 의 완전 rewrite:

"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.

Strategy:
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen).

Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`.

Usage:
  python3 tests/fixtures/_synth/mojibake.py \
      crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
"""
import sys
from pathlib import Path
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
import pikepdf

DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
FONT_NAME = "DejaVuSans"
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))

dst = Path(sys.argv[1])

# Step 1: 정상 PDF 합성.
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30 * mm
for line in [
    "Mojibake fixture (no ToUnicode CMap)",
    "Text extraction yields garbage \x00\x01\x02",
]:
    c.drawString(30 * mm, y, line)
    y -= 16
c.save()

# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration.
removed = 0
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
    for obj in pdf.objects:
        if isinstance(obj, pikepdf.Dictionary):
            if "/ToUnicode" in obj:
                del obj["/ToUnicode"]
                removed += 1
    pdf.save(str(dst))

if removed == 0:
    print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr)
    sys.exit(2)

# Step 3: invariant 검증 — load + page count.
with pikepdf.open(str(dst)) as pdf:
    n_pages = len(pdf.pages)
    if n_pages != 1:
        print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr)
        sys.exit(3)
    # ToUnicode 부재 invariant 확인.
    raw = Path(dst).read_bytes()
    if b"/ToUnicode" in raw:
        print("ERROR: /ToUnicode still present after strip", file=sys.stderr)
        sys.exit(4)

print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)")

§5.5 Test additions

crates/kebab-parse-pdf/tests/text_extractor.rs (or relevant existing test file) 에 추가:

/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation
/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장.
///
/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는
/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken
/// indirect object resolution → empty pages map → block_count=0.
#[test]
fn mojibake_fixture_load_yields_one_page() {
    let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
    let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load");
    let pages = doc.get_pages();
    assert_eq!(
        pages.len(),
        1,
        "F4 fixture must have exactly 1 page (Pages tree integrity)"
    );
}

#[test]
fn mojibake_fixture_has_no_tounicode_cmap() {
    // Step 2 의 ToUnicode 부재 invariant.
    let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap();
    let count = bytes.windows(b"/ToUnicode".len())
        .filter(|w| *w == b"/ToUnicode")
        .count();
    assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker");
}

#[test]
fn pdf_text_extractor_on_mojibake_yields_one_block() {
    // PdfTextExtractor 의 invariant: 1 Block::Paragraph per page.
    // F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는
    // empty → 1 empty Block::Paragraph + "scanned candidate" warning.
    let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
    // ... ExtractContext setup + extractor.extract(&ctx, bytes) ...
    let canonical = extractor.extract(&ctx, bytes).unwrap();
    assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page");
    // text 는 garbage 또는 empty — invariant 는 block 자체의 존재.
    let warning_present = canonical.provenance.events.iter().any(|e| {
        matches!(e.kind, ProvenanceKind::Warning)
            && e.note.as_ref().is_some_and(|n| n.contains("scanned candidate"))
    });
    assert!(warning_present || !canonical.blocks[0].text_is_empty(),
            "text-detect first 의 empty fallback 시 scanned-candidate warning 필수");
}

§5.6 Acceptance (Bug #4 fix)

  • F4 fixture re-generation 후 lopdf::Document::load_mem(...).get_pages().len() = 1.
  • F4 fixture 의 ToUnicode CMap 부재 invariant 보존 (grep -c "/ToUnicode" mojibake.pdf = 0).
  • PdfTextExtractor 의 F4 ingest 시 block_count = 1, warning "page1 empty (scanned candidate)" 또는 garbage text.
  • dogfood retest 시 mojibake.pdf 의 block_count: 1, chunk_count: 0~1 (depending on text content).
  • 기존 text_extractor_regression.rs 의 F4 baseline 갱신 — old baseline 자체 가 broken invariant 의 snapshot 이라 update 필요.
  • workspace test regression 0.

§6 Acceptance criteria (consolidated)

# Verifier Bug 명령
1 walker bypasses size cap for PDF #2 cargo test -p kebab-source-fs size_cap_skips_only_code_files
2 walker still skips oversized code files #2 cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes
3 256KB+ PDF/markdown ingest default config #2 dogfood retest: kebab ingestskipped_size_exceeded = 0 for non-code
4 chunker collision regression test #3 cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids
5 chunker determinism preserved #3 cargo test -p kebab-chunk deterministic_chunk_ids_1000
6 chunker overlap clamp preserved #3 cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target
7 integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) #3 cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision
8 dogfood: F1 + F2 force-reingest #3 dogfood retest: kebab ingest --force-reingest 시 errors = 0 (encrypted 제외)
9 F4 fixture lopdf 1-page invariant #4 cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page
10 F4 fixture ToUnicode 부재 invariant #4 cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap
11 F4 PdfTextExtractor 1-block invariant #4 cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block
12 dogfood: F4 ingest yields block_count=1 #4 dogfood retest: mojibake.pdf 의 ingest item block_count: 1
13 workspace clippy clean all cargo clippy --workspace --all-targets -- -D warnings
14 workspace full test pass all cargo test --workspace --no-fail-fast -j 1
15 dogfood end-to-end 9 PDF all dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only)
16 chunker_version cascade — final value #3 grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs 결과가 "pdf-page-v1.1" (round 1c M-1 결정)

§7 Risks + open questions

§7.1 Risks

  • Bug #3 fix 가 chunk_id 변경: multi-chunk PDF page (pre-OCR 시점에 1500 byte 초과 page) 의 chunk_id 가 변경됨. 사용자가 --force-reingest 1회 필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh ingest). README 또는 release note 에 명시.
  • Bug #2 fix 의 side-effect: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의 load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable.
  • Bug #4 fix 의 fixture binary 변경: F4 mojibake.pdf 의 SHA256 가 변경 → git LFS / binary diff 의 noise. text_extractor_regression.rs 의 baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리.
  • pikepdf install requirement: fixture re-generation 시 pip install pikepdf 필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의 Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로 generation 은 1회성, CI 의존성 미발생.

§7.2 Open questions

  1. chunker_version bump 의 cost-benefit: CLOSED (round 1c M-1)pdf-page-v1pdf-page-v1.1 bump 결정. cascade audit trail explicit
    • v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision (round 1c, closes §7.2 Open Q1)" 단락.
  2. Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion: 본 spec 은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음.
  3. F4 fixture 의 invariant: critic round 1 ACCEPT — ToUnicode 부재 + valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible. Step 2 의 design (mojibake.py rewrite) sound.
  4. integration test 의 mock OCR: CLOSED (round 1c M-3)crates/kebab-app/tests/pdf_ocr_apply.rs:25 에 이미 impl OcrEngine for MockOcrEngine 존재. executor 의 share path (Option A — tests/common/ mock_ocr.rs lift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가 능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition — MockOcrEngine availability" 단락.
  5. chunk_page tuple shape 변경: Option A 의 4-tuple (segment_start, actual_start, chunk_end, slice) 가 외부 callers 에 영향을 주는가? chunk_page 는 module-private (fn chunk_page) 이라 외부 caller 0, 안전. critic round 1 ACCEPT.

§8 References

  • dogfood report: .omc/reviews/2026-05-27-v0.20-dogfood-report.md
  • parent spec (frozen): docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
  • parent plan (round 1c ACCEPT): docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md
  • source code (root cause evidence):
    • crates/kebab-source-fs/src/connector.rs (Bug #2)
    • crates/kebab-source-fs/src/code_meta.rs (is_oversized + code_lang_for_path)
    • crates/kebab-config/src/lib.rs (IngestCodeCfg)
    • crates/kebab-core/src/ids.rs (id_for_chunk / id_for_block recipes)
    • crates/kebab-chunk/src/pdf_page_v1.rs (PdfPageV1Chunker + chunk_page)
    • crates/kebab-app/src/pdf_ocr_apply.rs (post-extract OCR enrichment)
    • crates/kebab-app/src/lib.rs:1769-1968 (ingest_one_pdf_asset wiring)
    • crates/kebab-store-sqlite/src/documents.rs:103-155 (put_chunks DELETE+INSERT)
    • migrations/V001__init.sql:80-94 (chunks table DDL — chunk_id PRIMARY KEY)
    • tests/fixtures/_synth/mojibake.py (Bug #4 fixture source)
  • design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe), §4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist), §9 (versioning cascade).