docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md

3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00
parent 9b44e27dfe
commit 46e99470eb
7 changed files with 4794 additions and 0 deletions
--- a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
+++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
@@ -0,0 +1,965 @@
+---
+title: v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
+created: 2026-05-27
+status: ACCEPT (round 2 closure — Phase A complete)
+target_version: 0.20.0 (PR #189 force-update)
+parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
+dogfood_evidence: .omc/reviews/2026-05-27-v0.20-dogfood-report.md
+review_history:
+  - "2026-05-27 spec round 1 critic (opus, thorough) — ACCEPT, HIGH 0 + MEDIUM 3 + LOW 2 + NIT 2"
+  - "2026-05-27 spec round 1c rewrite (opus, drafter) — MEDIUM/LOW/NIT all applied"
+  - "2026-05-27 spec round 2 closure critic (opus) — ACCEPT, 7/7 applied + 1 NIT (frontmatter status, applied here)"
+---
+
+# v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
+
+본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된
+3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다.
+후속 plan + executor 단계의 source 다.
+
+## §1 Background + dogfood evidence chain
+
+### §1.1 dogfood 환경
+
+| 항목 | 값 |
+|------|----|
+| Binary | `kebab v0.20.0` (commit `b4d9e60`) |
+| Ollama endpoint | `http://192.168.0.47:11434` (qwen2.5vl:3b) |
+| Isolated KB | `/build/cache/tmp/v0.20-dogfood/` |
+| Corpus | 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB) |
+
+### §1.2 3 bug 의 reproducibility
+
+| Bug | Severity | Trigger | Reproducible |
+|-----|----------|---------|--------------|
+| #2 walker code limit | Important | 256 KB+ PDF/image/markdown ingest | 항상 (default config) |
+| #3 chunk_id collision | **Critical** | scanned_page2.pdf (1580 OCR chars) ingest | force-reingest 마다 |
+| #4 F4 Pages tree | Important | mojibake.pdf (F4 fixture) ingest | 항상 |
+
+### §1.3 dogfood report 인용
+
+dogfood report (`.omc/reviews/2026-05-27-v0.20-dogfood-report.md`) 의 핵심 인용:
+
+- Bug #2: `scanned=3, skipped_size_exceeded=6` — workspace 9 PDF 중 3 만 통과,
+  6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label
+  2.7MB / internals 820KB) walker 단계 skip.
+- Bug #3: `"DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
+  failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
+  failed"` — scanned_page2.pdf chunk INSERT 단계에서 발생.
+- Bug #4: `block_count: 0, chunk_count: 0` — F4 mojibake.pdf 의 ingest 결과
+  가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반.
+
+## §2 Goals + non-goals
+
+### §2.1 Goals
+
+- 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은
+  branch 위에 stack).
+- parent spec
+  (`docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`) 의 invariant 보존:
+  - §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page".
+  - §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path).
+  - §4.6 wire schema additive 만 (V00X migration 불필요).
+- parent plan
+  (`docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`) round 1c ACCEPT
+  의 design decisions 와 충돌 0.
+- workspace test regression 0 (`cargo test --workspace -j 1`).
+
+### §2.2 Non-goals
+
+- 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0.
+- 새 V00X sqlite migration — `chunks` table DDL 변경 없음, fix 는 chunk_id 계산
+  path 한정.
+- F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF
+  요구사항 유지).
+- 새 config knob 추가 (`[ingest.pdf].max_file_bytes` 같은 per-media-type limit
+  은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만).
+
+## §3 Bug #2 — walker code limit
+
+### §3.1 Root cause (file:line evidence)
+
+`crates/kebab-source-fs/src/connector.rs:42-72` — `FsSourceConnector` 가
+`Config::new` 에서 `max_file_bytes` 와 `max_file_lines` 를
+`config.ingest.code` 단일 namespace 에서 읽는다:
+
+```rust
+Ok(Self {
+    default_root: root,
+    default_exclude: config.workspace.exclude.clone(),
+    copy_threshold_bytes,
+    skip_generated_header: config.ingest.code.skip_generated_header,
+    max_file_bytes: config.ingest.code.max_file_bytes,    // <-- code-specific
+    max_file_lines: config.ingest.code.max_file_lines,    // <-- code-specific
+})
+```
+
+`crates/kebab-source-fs/src/connector.rs:169-190` — walker 의 size check 가
+`is_oversized(...)` 호출 시 path 의 media type 무관:
+
+```rust
+if crate::code_meta::is_oversized(
+    &abs_path,
+    self.max_file_bytes,    // generic limit, applied 모든 file
+    self.max_file_lines,
+).unwrap_or(false) {
+    fs_skips.skipped_size_exceeded =
+        fs_skips.skipped_size_exceeded.saturating_add(1);
+    // ...
+    continue;
+}
+```
+
+`crates/kebab-source-fs/src/code_meta.rs:114-129` — `is_oversized(...)` 자체는
+generic helper (extension 무관):
+
+```rust
+pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
+    let meta = std::fs::metadata(path)?;
+    if meta.len() > max_bytes {
+        return Ok(true);
+    }
+    // line cap (streaming)
+    ...
+}
+```
+
+`crates/kebab-config/src/lib.rs:535-547` — `IngestCodeCfg::default()` 의
+`max_file_bytes = 262_144` (256 KB) — 대부분 PDF/image 가 이를 초과.
+
+### §3.2 Decision matrix
+
+| Option | 설명 | 장점 | 단점 |
+|--------|------|------|------|
+| **A — code path only** | walker 의 size check 를 code file (extension recognized by `code_lang_for_path`) 에만 적용 | 간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 | PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과 |
+| B — per-type config | 새 `[ingest.pdf]`, `[ingest.image]`, `[ingest.markdown]` section 추가 + per-type limit | user-tunable | 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과 |
+| C — generic limit + docs note | 같은 generic limit 유지하지만 의도 명문화 | code 변경 0 | UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → **거부** |
+
+### §3.3 Chosen path — Option A
+
+walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는
+parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리,
+image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config
+필요 시 Option B 로 진화.
+
+`is_code_file(path: &Path) -> bool` helper 추가:
+- `code_meta::code_lang_for_path(path).is_some()` = code file. 기존 helper
+  재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list
+  완전 동일).
+
+### §3.4 Implementation (Rust diff)
+
+`crates/kebab-source-fs/src/code_meta.rs` — `pub(crate)` helper 추가:
+
+```rust
+/// Returns true when `path`'s filename/extension is recognised as a code file
+/// (per `code_lang_for_path`). Used by the walker to apply
+/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
+/// not to PDF/image/markdown (which have their own size controls in their
+/// respective parsers).
+pub(crate) fn is_code_file(path: &Path) -> bool {
+    code_lang_for_path(path).is_some()
+}
+```
+
+`crates/kebab-source-fs/src/connector.rs:168-190` — walker conditional 추가:
+
+```rust
+// p10-1A-1: apply per-file generated-header + size-cap checks on files
+// that passed the override (gitignore/builtin/kebabignore) matching.
+// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines)
+// applies ONLY to code files. PDF/image/markdown bypass — their parsers
+// have their own size controls.
+if crate::code_meta::is_code_file(&abs_path)
+    && crate::code_meta::is_oversized(
+        &abs_path,
+        self.max_file_bytes,
+        self.max_file_lines,
+    )
+    .unwrap_or(false)
+{
+    fs_skips.skipped_size_exceeded =
+        fs_skips.skipped_size_exceeded.saturating_add(1);
+    push_sample(
+        &mut fs_skips.skip_examples.size_exceeded,
+        &abs_path,
+        &root,
+    );
+    tracing::debug!(
+        path = %rel_path.display(),
+        max_bytes = self.max_file_bytes,
+        max_lines = self.max_file_lines,
+        "skip: code file exceeds size cap"
+    );
+    continue;
+}
+```
+
+`skip_generated_header` 의 conditional 적용은 별개 — generated header sniff
+은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image
+의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지
+않으므로 false positive 0. **`is_generated_file` 의 walker conditional 추가는
+불필요** — 기존 behavior 유지.
+
+### §3.5 Test additions
+
+`crates/kebab-source-fs/src/connector.rs` 의 기존 test module 에 추가:
+
+```rust
+#[test]
+fn size_cap_skips_only_code_files() {
+    let dir = tempfile::tempdir().unwrap();
+    let root = dir.path();
+    // 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code).
+    let big_blob: Vec<u8> = vec![b'x'; 300_000];
+    std::fs::write(root.join("paper.pdf"), &big_blob).unwrap();
+    std::fs::write(root.join("notes.md"), &big_blob).unwrap();
+    std::fs::write(root.join("big.rs"), &big_blob).unwrap();
+
+    let conn = FsSourceConnector::new(
+        &cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
+    )
+    .unwrap();
+    let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
+
+    let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
+    // PDF + Markdown pass through walker.
+    assert!(paths.contains(&"paper.pdf".to_string()));
+    assert!(paths.contains(&"notes.md".to_string()));
+    // Code file gets skipped.
+    assert!(!paths.contains(&"big.rs".to_string()));
+    assert!(
+        skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")),
+        "size_exceeded examples should contain only big.rs: {:?}",
+        skips.skip_examples.size_exceeded
+    );
+    assert!(
+        !skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")),
+        "PDF must NOT appear in size_exceeded examples: {:?}",
+        skips.skip_examples.size_exceeded
+    );
+}
+```
+
+추가로 기존 test `ingest_report_counts_oversized_files_by_bytes` 의 fixture
+이름이 `huge.rs` 라서 invariant 보존됨. `ingest_report_size_cap_by_line_count`
+도 `longfile.rs` 라서 동일.
+
+## §4 Bug #3 — chunk_id collision (Critical)
+
+### §4.1 Root cause investigation
+
+#### §4.1.1 chunker 의 collision-avoidance workaround
+
+`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 에 collision 회피
+설명:
+
+```
+Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
+|| policy_hash) collides when one block (= one PDF page) is split into
+multiple chunks: every chunk on that page has identical block_ids.
+
+Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}")
+into the recipe's policy_hash slot.
+```
+
+`crates/kebab-chunk/src/pdf_page_v1.rs:170-172` 의 actual call:
+
+```rust
+let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
+let chunk_id =
+    id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
+```
+
+여기 `char_start` = `chunk_page(...)` 의 첫 번째 tuple field = **post-overlap
+`actual_start`** (NOT 원본 segment boundary `start`).
+
+#### §4.1.2 overlap 의 actual_start 계산
+
+`crates/kebab-chunk/src/pdf_page_v1.rs:266-281`:
+
+```rust
+let actual_start = if let Some(prev) = chunks.last() {
+    let prev_min = prev.0;   // previous chunk 의 actual_start
+    let mut a = start;
+    let mut acc_o: usize = 0;
+    while a > prev_min {
+        let cl = chars[a - 1].len_utf8();
+        if acc_o + cl > overlap_bytes {
+            break;
+        }
+        acc_o += cl;
+        a -= 1;
+    }
+    a
+} else {
+    start
+};
+```
+
+`while a > prev_min` — overlap walk 는 previous chunk 의 actual_start 까지만
+back-walk. overlap_bytes 가 충분히 크고 `start - prev_min` 이 작으면
+`actual_start = prev_min`. **두 chunk 가 같은 actual_start = 같은 `#c{N}`**.
+
+#### §4.1.3 가설 검증 — F2 (1580 chars OCR)
+
+가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (`.` + whitespace)
+또는 paragraph break (`\n\n`) 를 포함.
+
+- 기본 chunking policy: `target_tokens=500` → `target_bytes=1500`,
+  `overlap_tokens=80` → `overlap_bytes = min(240, 750) = 240`.
+- 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk.
+- 가정한 bounds = `[0, 30, ~n]` (첫 ~30 chars 안 sentence-end 1 개).
+- segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)). `#c0`.
+- segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk.
+  - actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240.
+  - 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료.
+  - actual_start = 0 = prev_min.
+- chunks.push((0, n, ...)). `#c0` — **collision with chunk 1**.
+
+같은 doc 안 두 chunk 의 chunk_id input:
+- `{kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1",
+   block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}`
+- canonical JSON 동일 → blake3 동일 → chunk_id 동일.
+
+→ `put_chunks` 의 `INSERT INTO chunks` 에서 첫 row 성공, 두 번째 row 가
+PRIMARY KEY violation.
+
+#### §4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유
+
+F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence
+boundary 부재. 그 경우 bounds = `[0, n]` 또는 first boundary 가 80 char 이후
+→ chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct `#c{N}`
+값 → distinct chunk_id.
+
+→ **F2 만 collision** 이라는 dogfood 의 observation 과 일치.
+
+#### §4.1.5 dogfood report 의 가설 평가
+
+dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision
+을 추정. 본 spec 의 investigation 결과 = **intra-doc (F2 내부) collision**.
+근거:
+- chunk_id input 에 `doc_id` 포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름.
+- 같은 doc 안 두 chunk 가 같은 block_id + 같은 `#c{N}` policy_hash 면
+  identical chunk_id.
+- 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는
+  policy 의 canonical JSON blake3 (deterministic).
+- 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id
+  collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른
+  collision pattern).
+- 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per
+  chunk 이므로 ordering N/A.
+- 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text
+  미포함, N/A.
+
+**Confirmed root cause** = 가설 C 의 variant — 단일 page 가 multi-chunk 일
+때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, `#c{N}`
+suffix 동일.
+
+### §4.2 Decision matrix
+
+| Option | 설명 | 장점 | 단점 |
+|--------|------|------|------|
+| **A — segment boundary `start`** | `per_chunk_hash` 의 suffix 를 post-overlap `actual_start` 대신 segment boundary `start` 로 변경 | minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 | chunk_page 의 return tuple shape 변경 필요 |
+| B — chunk ordinal | `per_chunk_hash = "#c{ordinal}"` (page 안 chunk index 0, 1, 2, ...) | 가장 simple / segment boundary 무관 | chunk_id 의 "meaningful hash input" semantic 약화 |
+| C — (`char_start`, `char_end`) pair | `per_chunk_hash = "#c{char_start}-{char_end}"` | 두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct | char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함 |
+| D — sequence number + char_start | `per_chunk_hash = "#c{ordinal}-{char_start}"` | invariant 완전 보장 | redundant info, hash input 가 더 길어짐 |
+
+### §4.3 Chosen path — Option A
+
+근거:
+- chunk_page 의 main loop 는 `seg_idx` 가 strictly increasing, segment
+  boundary `bounds[seg_idx]` 도 strictly increasing (bounds 가 dedup 후 unique).
+  따라서 segment boundary `start` 를 hash suffix 로 쓰면 같은 page 안 chunk
+  들의 hash input 가 보장된 distinct.
+- chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전
+  의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary
+  를 위한 enrichment.
+- chunk_page 의 return tuple 을 `(segment_start, actual_start, chunk_end,
+  slice)` 의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도
+  track) — minimal diff.
+
+### §4.4 Implementation
+
+`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `chunk_page` return signature 확장:
+
+```rust
+/// Split a single page's text into ordered chunks, each represented as
+/// `(segment_start, actual_start, chunk_end, text_slice)`.
+///
+/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
+///   across the returned vec. Use this for chunk_id uniqueness suffixes.
+/// - `actual_start` = post-overlap start char index. May collapse to a
+///   previous chunk's `actual_start` under aggressive overlap policy.
+///   Use this for `SourceSpan::Page::char_start`.
+/// - `chunk_end` = chunk's end char index (exclusive).
+fn chunk_page(
+    text: &str,
+    target_bytes: usize,
+    overlap_bytes: usize,
+) -> Vec<(usize, usize, usize, String)> {
+    // ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice))
+    chunks.push((start, actual_start, chunk_end, slice));
+    // ...
+}
+```
+
+caller 의 `chunk` method 도 동일하게 update:
+
+```rust
+for (segment_start, char_start, char_end, slice) in
+    chunk_page(&p.text, target_bytes, overlap_bytes)
+{
+    // ... existing u32 conversion + span construction ...
+    let span = SourceSpan::Page {
+        page: page_num,
+        char_start: Some(u32::try_from(char_start).expect("...")),
+        char_end: Some(u32::try_from(char_end).expect("...")),
+    };
+    let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
+    // segment_start (pre-overlap boundary) is strictly increasing across
+    // chunks, even when overlap walk collapses actual_start to prev_min.
+    let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
+    let chunk_id =
+        id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
+    // ... rest unchanged ...
+}
+```
+
+`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 도 동시 update —
+기존 description 의 `"#c{char_start}"` 가 새 fix 에 stale 하므로:
+
+```rust
+//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
+//! || policy_hash) collides when one block (= one PDF page) is split into
+//! multiple chunks: every chunk on that page has identical block_ids.
+//!
+//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
+//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
+//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment
+//! boundary, strictly increasing across the returned chunks even when the
+//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`.
+//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch).
+```
+
+추가로 `tasks/HOTFIXES.md` 에 dated entry 추가 (본 fix 이 chunk_id deviation
+의 **second-iteration patch** — 첫 iteration 의 `#c{char_start}` workaround 가
+aggressive overlap case 에서 collision 을 leave 했음을 명문화):
+
+```markdown
+## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround
+collapses under aggressive overlap (Bug #3 second-iteration patch)
+
+**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf):
+UNIQUE constraint failed: chunks.chunk_id`. ...
+
+**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ...
+post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ...
+
+**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start`
+추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ...
+
+**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
+(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9
+cascade trigger 로 explicit invalidation.
+
+**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
+§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer
+의 internal computation 만 변경).
+```
+
+#### §4.4.1 chunk_id determinism 보존
+
+기존 single-chunk-per-page case (e.g. small pages, `text.len() <= target_bytes`)
+:
+- `chunk_page` 의 early return: `vec![(0, n, text.to_string())]` → 새 shape
+  로 `vec![(0, 0, n, text.to_string())]`. `segment_start = 0 = actual_start`.
+- `#c0` suffix 동일 → chunk_id 동일.
+
+multi-chunk case 의 첫 chunk:
+- segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk).
+- `#c0` suffix 동일 → chunk_id 동일.
+
+multi-chunk case 의 second-and-later chunk:
+- 기존: `actual_start` (overlap-walked, may be == 0).
+- 새: `segment_start` = bounds[seg_idx] > 0.
+- → chunk_id 변경 (intentional, collision 회피).
+
+→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨.
+이는 v0.20 의 force-reingest path 에서 자동 갱신.
+
+**Decision (round 1c, closes §7.2 Open Q1): chunker_version bump
+`pdf-page-v1` → `pdf-page-v1.1`** (critic round 1 M-1 권장 채택).
+
+근거:
+- 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf
+  의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가
+  internal computation 변경으로 silent 하게 다른 값으로 mapping.
+  chunker_version 을 `pdf-page-v1` 유지하면 store/embedding layer 의 cascade
+  audit 가 발생 안 함 → 사용자가 `--force-reingest` 를 명시적으로 호출하지
+  않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능.
+- design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit
+  version bump → store layer 의 자동 invalidation report. `pdf-page-v1.1`
+  bump 는 그 rule 의 직접 적용.
+- bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일
+  release commit 위에 cumulative bugfix stack) 이고, parent spec
+  (`2026-05-27-pdf-scanned-ocr-spec.md`) 의 OCR feature 활성화가 어차피
+  force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만
+  다르면 새 doc_id chain 안에서 동일하게 cascade 재계산.
+- benefit = explicit user-facing audit trail. 다음 ingest 시 cascade
+  invalidation 이 store layer report 에 명시.
+
+cascade 의 다른 version field (parser_version / embedding_version /
+prompt_template_version / index_version) 는 unchanged — chunker layer
+한정 patch.
+
+`PdfPageV1Chunker` 의 `chunker_version()` 상수 update:
+```rust
+impl Chunker for PdfPageV1Chunker {
+    fn chunker_version(&self) -> ChunkerVersion {
+        ChunkerVersion("pdf-page-v1.1".to_string())  // was: "pdf-page-v1"
+    }
+    // ...
+}
+```
+
+`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `PARSER_VERSION` 또는 const
+`CHUNKER_VERSION` 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서).
+
+### §4.5 Test additions
+
+`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` 에 추가:
+
+```rust
+#[test]
+fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
+    // Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start
+    // can collapse to prev_min, producing identical `#c{char_start}` suffixes
+    // and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation
+    // at put_chunks INSERT time.
+    //
+    // Synthesises Korean OCR text shape: dense Korean characters (3 bytes
+    // per char) with a single early sentence-end boundary at char ~10 +
+    // long tail.
+
+    // 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars.
+    let early_seg: String = std::iter::repeat('가').take(10).collect();
+    let tail: String = std::iter::repeat('나').take(500).collect();
+    let page_text = format!("{early_seg}. {tail}");
+
+    let doc = make_pdf_doc(&[&page_text]);
+    let policy = default_policy(500, 80);  // target=1500 byte, overlap=240 byte
+    let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
+
+    assert!(
+        chunks.len() >= 2,
+        "expected ≥2 chunks for {} byte page; got {}",
+        page_text.len(),
+        chunks.len()
+    );
+
+    // Hard invariant: all chunk_ids must be unique. Without the fix, the
+    // second chunk would have actual_start = 0 (== first chunk's
+    // actual_start) under the aggressive-overlap walk → identical `#c0`
+    // hash suffix → identical chunk_id → PRIMARY KEY violation.
+    let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
+    ids.sort_unstable();
+    let total = ids.len();
+    ids.dedup();
+    assert_eq!(
+        ids.len(),
+        total,
+        "all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
+    );
+}
+```
+
+(round 1c L-1: 원래 round 0 의 second test
+`chunk_id_recipe_uses_segment_start_not_actual_start` 는 본 test 의
+uniqueness 검증과 redundant + 실제 assertion 이 `assert!(chunks.len() >= 2)`
+뿐이라 test name 의 의도와 mismatch — 제거.)
+
+추가로 `crates/kebab-app/tests/` 에 integration 수준의 regression test:
+
+```rust
+// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new)
+//
+// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page +
+// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과.
+//
+// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이
+// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록
+// 구성. real Ollama 호출 회피.
+
+#[test]
+fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
+    // ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB
+    // ... assert: ingest_report.items 모두 kind != Error
+    // ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts
+}
+```
+
+(round 1c NIT-1: 파일명과 함수명을 `multi_scanned_pdf_ingest_no_chunk_id_collision`
+로 통일 — 원래 round 0 의 파일명 `pdf_multi_scan_no_chunk_id_collision.rs` 는
+fn name 과 mismatch.)
+
+#### §4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3)
+
+본 integration test 는 `OcrEngine` trait 의 mock impl 을 요구. executor 단계의
+1st step:
+
+1. `grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/`
+   로 MockOcrEngine 위치 확인.
+2. **현재 상태** (2026-05-27 verifier probe):
+   - `crates/kebab-parse-image/src/ocr.rs:235` — production `impl OcrEngine for OllamaVisionOcr`.
+   - `crates/kebab-app/tests/pdf_ocr_apply.rs:25` — `impl OcrEngine for MockOcrEngine` (test-only).
+3. 본 새 integration test (`multi_scanned_pdf_ingest_no_chunk_id_collision.rs`)
+   는 같은 crate (`kebab-app`) 안의 별 test binary 라 `pdf_ocr_apply.rs` 의
+   private MockOcrEngine 를 직접 import 불가. executor 의 선택지:
+   - **Option A (권장)**: `crates/kebab-app/tests/common/mock_ocr.rs` 에
+     MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는
+     parameterised 형태). 두 test (`pdf_ocr_apply.rs` + 본 신 test) 모두
+     `mod common;` 으로 share.
+   - **Option B**: 본 신 test 안에 inline `impl OcrEngine for LocalMock { ... }`
+     중복 정의 (test isolation 우선, share 비용 회피).
+4. 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의
+   acceptance 를 **conditional downgrade** — `kebab-chunk` 의
+   unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀
+   확보. integration 회피.
+
+executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서
+closed.
+
+### §4.6 Acceptance (Bug #3 fix)
+
+- F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0.
+- `--force-reingest` 마다 collision 0.
+- 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0.
+- `crates/kebab-chunk` 의 기존 1000-determinism test
+  (`deterministic_chunk_ids_1000`) 통과 보존.
+- workspace test regression 0.
+- new test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
+  + integration test `multi_scanned_pdf_ingest_no_chunk_id_collision` 추가.
+
+## §5 Bug #4 — F4 fixture Pages tree
+
+### §5.1 Root cause
+
+#### §5.1.1 현상
+
+```json
+{
+  "doc_path": "mojibake.pdf",
+  "kind": "new",
+  "byte_len": 22568,
+  "pdf_ocr_pages": 0,
+  "pdf_ocr_ms_total": 0,
+  "block_count": 0,
+  "chunk_count": 0,
+  "warnings": []
+}
+```
+
+PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반.
+
+#### §5.1.2 lopdf get_pages() 의 reaction
+
+dogfood probe:
+- `lopdf::Document::load_mem(F4_bytes)` → OK.
+- `pdf_doc.get_pages()` → empty `BTreeMap`.
+- PDF byte stream 안 `/Type /Page` count = 1, `/Count` value = 1.
+
+→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal
+(`/Pages` → `/Kids` chain) 가 broken.
+
+#### §5.1.3 fixture 생성 path 분석
+
+`tests/fixtures/_synth/mojibake.py`:
+
+```python
+c = canvas.Canvas(str(dst), pagesize=A4)
+c.setFont(FONT_NAME, 12)
+y = A4[1] - 30*mm
+for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]:
+    c.drawString(30*mm, y, line)
+    y -= 16
+c.save()
+
+data = dst.read_bytes()
+# pattern: "/ToUnicode <objref>" — strip indirect object reference
+new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
+dst.write_bytes(new_data)
+```
+
+**Step 2 분석**: `re.sub` 가 `/ToUnicode N M R` byte sequence 를 제거하지만:
+- 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift.
+- cross-reference table (`xref`) 의 offset entries 가 stale.
+- `startxref` value 의 offset 도 stale.
+
+**Step 3 의 startxref fix** (`tasks/HOTFIXES.md` 의 commit `c2cd3a7`):
+- manual byte edit `22130 → 22114` 로 startxref 갱신.
+- 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의
+  `/Kids` array 가 가리키는 indirect object 의 actual byte position 가
+  xref entry 와 mismatch.
+- lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공
+  하지만 Pages tree traversal 시 indirect object resolution fail → empty
+  Pages map.
+
+### §5.2 Fixture re-generation strategy
+
+| Option | 설명 | 장점 | 단점 |
+|--------|------|------|------|
+| **A — pikepdf** | reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) | proper xref regeneration / Pages tree intact / library available (pip install pikepdf) | 새 Python dependency (`pikepdf`) |
+| B — qpdf normalize | byte-edit 후 `qpdf --linearize input.pdf output.pdf` | external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) | qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음) |
+| C — reportlab disable ToUnicode | reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable | byte-edit 회피 — clean | reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요) |
+
+### §5.3 Chosen path — Option A (pikepdf)
+
+근거:
+- pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings.
+- xref table 의 auto-regeneration + Pages tree 의 integrity 보존.
+- `pip install pikepdf` 로 dependency 추가 — 이미 fixture generation 용 Python
+  venv 가 reportlab 사용 중이라 추가 install 가 trivial.
+
+#### §5.3.1 ToUnicode strip 의 pikepdf approach
+
+reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안
+`/ToUnicode <ref>` 로 등장. pikepdf 로 font dictionary 의 `/ToUnicode` entry 만
+제거:
+
+```python
+import pikepdf
+with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
+    # Walk all indirect objects, delete /ToUnicode entry whenever found.
+    # PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 →
+    # false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4
+    # 의 actual implementation 과 동일 형태).
+    for obj in pdf.objects:
+        if isinstance(obj, pikepdf.Dictionary):
+            if "/ToUnicode" in obj:
+                del obj["/ToUnicode"]
+    pdf.save(str(dst))
+```
+
+pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존.
+
+### §5.4 Implementation (mojibake.py revision)
+
+`tests/fixtures/_synth/mojibake.py` 의 완전 rewrite:
+
+```python
+"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
+
+Strategy:
+1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
+2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen).
+
+Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`.
+
+Usage:
+  python3 tests/fixtures/_synth/mojibake.py \
+      crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
+"""
+import sys
+from pathlib import Path
+from reportlab.lib.pagesizes import A4
+from reportlab.lib.units import mm
+from reportlab.pdfbase import pdfmetrics
+from reportlab.pdfbase.ttfonts import TTFont
+from reportlab.pdfgen import canvas
+import pikepdf
+
+DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
+FONT_NAME = "DejaVuSans"
+pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
+
+dst = Path(sys.argv[1])
+
+# Step 1: 정상 PDF 합성.
+c = canvas.Canvas(str(dst), pagesize=A4)
+c.setFont(FONT_NAME, 12)
+y = A4[1] - 30 * mm
+for line in [
+    "Mojibake fixture (no ToUnicode CMap)",
+    "Text extraction yields garbage \x00\x01\x02",
+]:
+    c.drawString(30 * mm, y, line)
+    y -= 16
+c.save()
+
+# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration.
+removed = 0
+with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
+    for obj in pdf.objects:
+        if isinstance(obj, pikepdf.Dictionary):
+            if "/ToUnicode" in obj:
+                del obj["/ToUnicode"]
+                removed += 1
+    pdf.save(str(dst))
+
+if removed == 0:
+    print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr)
+    sys.exit(2)
+
+# Step 3: invariant 검증 — load + page count.
+with pikepdf.open(str(dst)) as pdf:
+    n_pages = len(pdf.pages)
+    if n_pages != 1:
+        print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr)
+        sys.exit(3)
+    # ToUnicode 부재 invariant 확인.
+    raw = Path(dst).read_bytes()
+    if b"/ToUnicode" in raw:
+        print("ERROR: /ToUnicode still present after strip", file=sys.stderr)
+        sys.exit(4)
+
+print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)")
+```
+
+### §5.5 Test additions
+
+`crates/kebab-parse-pdf/tests/text_extractor.rs` (or relevant existing test
+file) 에 추가:
+
+```rust
+/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation
+/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장.
+///
+/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는
+/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken
+/// indirect object resolution → empty pages map → block_count=0.
+#[test]
+fn mojibake_fixture_load_yields_one_page() {
+    let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
+    let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load");
+    let pages = doc.get_pages();
+    assert_eq!(
+        pages.len(),
+        1,
+        "F4 fixture must have exactly 1 page (Pages tree integrity)"
+    );
+}
+
+#[test]
+fn mojibake_fixture_has_no_tounicode_cmap() {
+    // Step 2 의 ToUnicode 부재 invariant.
+    let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap();
+    let count = bytes.windows(b"/ToUnicode".len())
+        .filter(|w| *w == b"/ToUnicode")
+        .count();
+    assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker");
+}
+
+#[test]
+fn pdf_text_extractor_on_mojibake_yields_one_block() {
+    // PdfTextExtractor 의 invariant: 1 Block::Paragraph per page.
+    // F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는
+    // empty → 1 empty Block::Paragraph + "scanned candidate" warning.
+    let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
+    // ... ExtractContext setup + extractor.extract(&ctx, bytes) ...
+    let canonical = extractor.extract(&ctx, bytes).unwrap();
+    assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page");
+    // text 는 garbage 또는 empty — invariant 는 block 자체의 존재.
+    let warning_present = canonical.provenance.events.iter().any(|e| {
+        matches!(e.kind, ProvenanceKind::Warning)
+            && e.note.as_ref().is_some_and(|n| n.contains("scanned candidate"))
+    });
+    assert!(warning_present || !canonical.blocks[0].text_is_empty(),
+            "text-detect first 의 empty fallback 시 scanned-candidate warning 필수");
+}
+```
+
+### §5.6 Acceptance (Bug #4 fix)
+
+- F4 fixture re-generation 후 `lopdf::Document::load_mem(...).get_pages().len() = 1`.
+- F4 fixture 의 ToUnicode CMap 부재 invariant 보존
+  (`grep -c "/ToUnicode" mojibake.pdf` = 0).
+- PdfTextExtractor 의 F4 ingest 시 `block_count = 1`,
+  warning `"page1 empty (scanned candidate)"` 또는 garbage text.
+- dogfood retest 시 mojibake.pdf 의 `block_count: 1`,
+  `chunk_count: 0~1` (depending on text content).
+- 기존 `text_extractor_regression.rs` 의 F4 baseline 갱신 — old baseline 자체
+  가 broken invariant 의 snapshot 이라 update 필요.
+- workspace test regression 0.
+
+## §6 Acceptance criteria (consolidated)
+
+| # | Verifier | Bug | 명령 |
+|---|---------|-----|------|
+| 1 | walker bypasses size cap for PDF | #2 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files` |
+| 2 | walker still skips oversized code files | #2 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes` |
+| 3 | 256KB+ PDF/markdown ingest default config | #2 | dogfood retest: `kebab ingest` 시 `skipped_size_exceeded = 0` for non-code |
+| 4 | chunker collision regression test | #3 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` |
+| 5 | chunker determinism preserved | #3 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000` |
+| 6 | chunker overlap clamp preserved | #3 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target` |
+| 7 | integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) | #3 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision` |
+| 8 | dogfood: F1 + F2 force-reingest | #3 | dogfood retest: `kebab ingest --force-reingest` 시 errors = 0 (encrypted 제외) |
+| 9 | F4 fixture lopdf 1-page invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page` |
+| 10 | F4 fixture ToUnicode 부재 invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap` |
+| 11 | F4 PdfTextExtractor 1-block invariant | #4 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block` |
+| 12 | dogfood: F4 ingest yields block_count=1 | #4 | dogfood retest: mojibake.pdf 의 ingest item `block_count: 1` |
+| 13 | workspace clippy clean | all | `cargo clippy --workspace --all-targets -- -D warnings` |
+| 14 | workspace full test pass | all | `cargo test --workspace --no-fail-fast -j 1` |
+| 15 | dogfood end-to-end 9 PDF | all | dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
+| 16 | chunker_version cascade — final value | #3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` (round 1c M-1 결정) |
+
+## §7 Risks + open questions
+
+### §7.1 Risks
+
+- **Bug #3 fix 가 chunk_id 변경**: multi-chunk PDF page (pre-OCR 시점에 1500
+  byte 초과 page) 의 chunk_id 가 변경됨. 사용자가 `--force-reingest` 1회
+  필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh
+  ingest). README 또는 release note 에 명시.
+- **Bug #2 fix 의 side-effect**: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의
+  load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming
+  parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable.
+- **Bug #4 fix 의 fixture binary 변경**: F4 mojibake.pdf 의 SHA256 가 변경
+  → git LFS / binary diff 의 noise. `text_extractor_regression.rs` 의
+  baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리.
+- **pikepdf install requirement**: fixture re-generation 시 `pip install
+  pikepdf` 필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의
+  Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로
+  generation 은 1회성, CI 의존성 미발생.
+
+### §7.2 Open questions
+
+1. **chunker_version bump 의 cost-benefit**: ✅ **CLOSED (round 1c M-1)** —
+   `pdf-page-v1` → `pdf-page-v1.1` bump 결정. cascade audit trail explicit
+   + v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision
+   (round 1c, closes §7.2 Open Q1)" 단락.
+2. **Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion**: 본 spec
+   은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음.
+3. **F4 fixture 의 invariant**: critic round 1 ACCEPT — ToUnicode 부재 +
+   valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible.
+   Step 2 의 design (mojibake.py rewrite) sound.
+4. **integration test 의 mock OCR**: ✅ **CLOSED (round 1c M-3)** —
+   `crates/kebab-app/tests/pdf_ocr_apply.rs:25` 에 이미 `impl OcrEngine for
+   MockOcrEngine` 존재. executor 의 share path (Option A — `tests/common/
+   mock_ocr.rs` lift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가
+   능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition
+   — MockOcrEngine availability" 단락.
+5. **chunk_page tuple shape 변경**: Option A 의 4-tuple `(segment_start,
+   actual_start, chunk_end, slice)` 가 외부 callers 에 영향을 주는가?
+   `chunk_page` 는 module-private (`fn chunk_page`) 이라 외부 caller 0,
+   안전. critic round 1 ACCEPT.
+
+## §8 References
+
+- dogfood report: `.omc/reviews/2026-05-27-v0.20-dogfood-report.md`
+- parent spec (frozen): `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`
+- parent plan (round 1c ACCEPT): `docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`
+- source code (root cause evidence):
+  - `crates/kebab-source-fs/src/connector.rs` (Bug #2)
+  - `crates/kebab-source-fs/src/code_meta.rs` (is_oversized + code_lang_for_path)
+  - `crates/kebab-config/src/lib.rs` (IngestCodeCfg)
+  - `crates/kebab-core/src/ids.rs` (id_for_chunk / id_for_block recipes)
+  - `crates/kebab-chunk/src/pdf_page_v1.rs` (PdfPageV1Chunker + chunk_page)
+  - `crates/kebab-app/src/pdf_ocr_apply.rs` (post-extract OCR enrichment)
+  - `crates/kebab-app/src/lib.rs:1769-1968` (ingest_one_pdf_asset wiring)
+  - `crates/kebab-store-sqlite/src/documents.rs:103-155` (put_chunks DELETE+INSERT)
+  - `migrations/V001__init.sql:80-94` (chunks table DDL — chunk_id PRIMARY KEY)
+  - `tests/fixtures/_synth/mojibake.py` (Bug #4 fixture source)
+- design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe),
+  §4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist),
+  §9 (versioning cascade).
--- a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
+++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
@@ -0,0 +1,308 @@
+---
+title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
+created: 2026-05-27
+status: "DRAFT round 1c"
+parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
+contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
+related_specs:
+  - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
+  - docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
+related_dogfood:
+  - .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
+---
+
+# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text
+
+## §1 Problem statement
+
+### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection
+
+**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.
+
+**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.
+
+**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.
+
+### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values
+
+**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.
+
+**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.
+
+**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.
+
+---
+
+## §2 Scope + non-scope
+
+### §2.1 Included in this spec
+
+- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
+- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
+- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
+- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
+- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
+- **Traceability**: Link both fixes to parent spec §1.3 design intent.
+
+### §2.2 Explicitly out of scope
+
+**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.
+
+**Non-bug observations**:
+- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
+- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
+- Binary staleness: Environmental artifact, not applicable to spec.
+
+**Ancillary risks**:
+- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
+- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.
+
+---
+
+## §3 Decisions
+
+### §3.1 Bug #6: Known mojibake marker stripping
+
+Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.
+
+**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.
+
+**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.
+
+### §3.2 Bug #7: CLI doc-comment update
+
+Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.
+
+### §3.3 Parent spec traceability
+
+Both fixes uphold parent spec §1.3:
+- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
+- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).
+
+No changes to parser_version, chunker_version, or wire schema.
+
+---
+
+## §4 Implementation specification
+
+### §4.1 Bug #6: text_quality.rs diff
+
+**File**: `crates/kebab-parse-pdf/src/text_quality.rs`
+
+**Change**:
+1. Add constant array of known mojibake markers (lines 8–10):
+   ```rust
+   // Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
+   // Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
+   // encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
+   // PUA / replacement-char territory already covered by `pure_pua_zero`.
+   // Re-verify on lopdf dependency upgrade.
+   const MOJIBAKE_MARKERS: &[&str] = &[
+       "?Identity-H Unimplemented?",
+   ];
+   ```
+
+2. Refactor `compute_valid_char_ratio()` (lines 39–106):
+   ```rust
+   pub fn compute_valid_char_ratio(s: &str) -> f32 {
+       // 1) Strip known mojibake markers before counting valid chars.
+       //    Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
+       //    substrings (bypassing PUA detection).
+       let mut cleaned: String = s.to_string();
+       // `had_marker` guard preserves prior behavior for whitespace-only input
+       // (returns ratio of whitespace validity, not 0.0) when no markers found.
+       // With markers stripped, the guard enables the trim-empty check.
+       let mut had_marker = false;
+       for marker in MOJIBAKE_MARKERS {
+           if cleaned.contains(marker) {
+               had_marker = true;
+               cleaned = cleaned.replace(marker, "");
+           }
+       }
+       // 2) Whitespace-only cleaned text → 0.0 (marker-only page).
+       if had_marker && cleaned.trim().is_empty() {
+           return 0.0;
+       }
+       // 3) Marker-dominance heuristic — when stripped chars exceed remaining
+       //    chars (i.e. marker > 50% of original), the page is "mostly mojibake
+       //    with some decodeable page-furniture" (e.g. metro-korea.pdf has
+       //    header text in a separate font + body that is Identity-H CID).
+       //    Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
+       if had_marker {
+           let stripped_chars = s.len().saturating_sub(cleaned.len());
+           if stripped_chars > cleaned.len() {
+               // Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
+               // The 0.3 cap (not 0.0) preserves a small signal that some text
+               // WAS decodeable, useful for downstream metrics if ever exposed.
+               let mut total = 0u32;
+               let mut valid = 0u32;
+               for c in cleaned.chars() {
+                   total += 1;
+                   if is_valid_text_char(c) {
+                       valid += 1;
+                   }
+               }
+               let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
+               return raw_ratio.min(0.3);
+           }
+       }
+       // 4) Otherwise compute ratio on cleaned text (existing logic).
+       let mut total = 0u32;
+       let mut valid = 0u32;
+       for c in cleaned.chars() {
+           total += 1;
+           if is_valid_text_char(c) {
+               valid += 1;
+           }
+       }
+       if total == 0 {
+           return 0.0;
+       }
+       valid as f32 / total as f32
+   }
+   ```
+
+**Invariants preserved**:
+- Function signature and return type unchanged (→ byte-identical caller surface).
+- Existing character category logic (hangul, CJK, Latin-1) unmodified.
+- Empty-string behavior (return 0.0) preserved.
+
+### §4.2 Bug #6: Unit tests
+
+Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:
+
+```rust
+#[test]
+fn identity_h_marker_dominance_caps_ratio_below_threshold() {
+    // metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
+    // Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
+    // With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
+    let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
+    let r = compute_valid_char_ratio(&s);
+    assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
+}
+
+#[test]
+fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
+    // Inverse case: short marker noise + long valid text → ratio stays high
+    // (no false OCR trigger on otherwise-good pages).
+    let header = "x".repeat(200);  // 200 char valid ASCII
+    let s = format!("{header} ?Identity-H Unimplemented?");  // 1× marker = 26 char
+    let r = compute_valid_char_ratio(&s);
+    assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
+}
+```
+
+**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.
+
+### §4.3 Bug #7: CLI doc-comment diff
+
+**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150–160)
+
+**Change**:
+```diff
+-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
+```
+
+### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update
+
+Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):
+
+**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)
+
+**Change**:
+```diff
+-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
+```
+
+### §4.4 Bug #7: CLI help assertion
+
+Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):
+
+```rust
+#[test]
+fn search_help_lists_code_in_media_values() {
+    let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
+        .args(["search", "--help"])
+        .output()
+        .expect("kebab search --help");
+    let stdout = String::from_utf8_lossy(&out.stdout);
+    assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
+}
+```
+
+### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)
+
+- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
+- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
+- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.
+
+---
+
+## §5 Acceptance criteria
+
+- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
+- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
+- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
+- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
+- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
+- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
+- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.
+
+---
+
+## §6 Risks + open questions
+
+### Identified risks
+
+**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**
+
+**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.
+
+**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.
+
+**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157–159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.
+
+**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.
+
+### Open questions
+
+**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.
+
+**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.
+
+**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.
+
+### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest
+
+This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.
+
+**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.
+
+**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.
+
+**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.
+
+Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126–128 precedent).
+
+---
+
+## §7 References
+
+- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
+- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
+- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
+- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
+- **Code locations**:
+  - text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
+  - CLI help: `crates/kebab-cli/src/main.rs:157–159`
+  - Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
+  - CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)
+
+---
+
+**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.
+
+**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).
--- a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md
+++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md
@@ -0,0 +1,410 @@
+---
+title: "v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings"
+created: 2026-05-27
+status: DRAFT
+round: 1c
+parent_spec: 2026-04-27-kebab-final-form-design.md
+contract_sections:
+  - "1.1 (ask streaming)"
+  - "2.2 (error handling)"
+  - "2.4 (JSON wire schema)"
+  - "3.1 (config XDG)"
+  - "4.1 (capabilities schema)"
+source_report: .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md
+---
+
+# v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings
+
+Post-bugfix2 final dogfood (2026-05-27) 에서 발견된 **5개 bug** 의 fix design. PR #189 force-update (base=main). Spec scope: root cause + fix decision + acceptance criteria + parent spec traceability. Bug #12 falsified (scope 외). Fix 5개 모두 trivial ~ small refactor (기존 1350 test + 추가 5+ test).
+
+---
+
+## §1 Problem statement
+
+### §1.1 Bug #9: capabilities false negative (Critical)
+
+`kebab schema --json` 의 `capabilities.streaming_ask` 와 `capabilities.single_file_ingest` 가 모두 `false` hardcoded. 그러나 실제 구현:
+- `kebab ask --stream` → `answer_event.v1` ndjson events 정상 emit (191 events 검증).
+- `kebab ingest-file <path>` → `ingest_report.v1` 신규/갱신 정상.
+- `kebab ingest-stdin --title <T>` → 정상.
+
+**Impact**: MCP host, Claude Code skill 등 agent 가 `capabilities: { streaming_ask: false, single_file_ingest: false }` 보고 routing 결정 시 false negative. user 가 실제 동작하는 feature 를 사용 불가능하다고 오인.
+
+### §1.2 Bug #10: config fail-fast (UX)
+
+```bash
+kebab search "rust" --config /tmp/nonexistent.toml --json
+# exit=0, {"hits":[],"schema_version":"search_response.v1"}
+```
+
+explicit path 가 missing 시 silent fallback to default config (XDG path). debugging nightmare — typo 또는 wrong path 가 0 hit 으로만 surface.
+
+### §1.3 Bug #11: OCR timeout 600s (Critical UX)
+
+`config.pdf.ocr.request_timeout_secs = 600` (10분/page default). metro-korea.pdf dogfood 증거:
+- page 8 + page 13 에서 Ollama remote 의 slow response → 600s 완전 timeout.
+- 결과: `ms: 600000, chars: 0, skipped: true` emit → 본문 indexed 안 됨 + 20분 cost waste.
+
+**Production impact**: 사용자가 ingest 완료 signal 못 받음, 일부 page 검색 불가.
+
+### §1.4 Bug #13: schema.models single value (UX)
+
+```json
+{
+  "chunker_version": "md-heading-v1",
+  "parser_version": "md-frontmatter-v2",
+  ...
+}
+```
+
+그러나 corpus 안 multi-active:
+- parsers: `md-frontmatter-v2`, `pdf-text-v1`, `code-rust-v1`, `code-python-v1`, `none-v1`.
+- chunkers: `md-heading-v1`, `pdf-page-v1.1`, `code-rust-ast-v1`, `code-python-ast-v1`, `dockerfile-file-v1`, `k8s-manifest-resource-v1`, `manifest-file-v1`, `code-text-paragraph-v1`.
+
+**Impact**: user 가 `kebab schema` 보고 active version 식별 불완전, version cascade audit 시 누락 risk.
+
+### §1.5 Bug #14: empty query silent (Minor UX)
+
+```bash
+kebab search "" --json
+# exit=0, {"hits":[],"next_cursor":null,"schema_version":"search_response.v1"}
+```
+
+empty query (또는 whitespace-only) 가 silent 0 hit return. user mistake → explicit error 가 정합.
+
+---
+
+## §2 Scope + non-scope
+
+### §2.1 Included: 5 bug fix
+
+| Bug | Category | Severity | Fix type |
+|-----|----------|----------|----------|
+| #9 | wire schema | critical | capability flag hardcoded boolean → actual feature check |
+| #10 | config UX | medium | silent fallback → error.v1 with config_not_found |
+| #11 | OCR config | critical | default 600s → 60s timeout |
+| #13 | wire schema | medium | single field → additive array fields (backward compat) |
+| #14 | input validation | minor | empty query silent → error.v1 with invalid_input |
+
+### §2.2 Out of scope
+
+- **Bug #12 (falsified)**: `inspect doc` blocks[].text 가 code parser 에서 "?" placeholder. 근본: `.text` 아님, `.code` field 정상 emit. user workflow 는 `.code` 로 접근 가능 → spec 범위 외.
+- dogfood report §12 의 다른 axis (ranking bias, multi-root caveat) → 별도 phase.
+
+---
+
+## §3 Decisions
+
+### §3.1 Bug #9: capabilities 정정
+
+**Decision**: `schema.rs::capabilities_snapshot()` 의 두 field 를 true 로 update.
+
+```rust
+fn capabilities_snapshot() -> Capabilities {
+    Capabilities {
+        json_mode: true,
+        ingest_progress: true,
+        ingest_cancellation: true,
+        rag_multi_turn: true,
+        search_cache: true,
+        incremental_ingest: true,
+        streaming_ask: true,        // ← WAS FALSE, actual TRUE
+        http_daemon: false,         // ← preserved (not-impl, separate sub-item)
+        mcp_server: true,
+        single_file_ingest: true,   // ← WAS FALSE, actual TRUE
+        bulk_search: true,
+    }
+}
+```
+
+**Rationale**: actual implementation 이 production-grade streaming ask + single-file ingest 지원. schema report 가 reality 와 정합되어야 agent routing 정확함.
+
+### §3.2 Bug #10: config_not_found error
+
+**Decision**: `kebab-config` 가 자체 error type `ConfigNotFound` 정의, `kebab-app::error_wire` 가 classify arm 추가.
+
+Pseudo-code:
+```rust
+// crates/kebab-config/src/lib.rs (또는 적절한 error module)
+#[derive(Debug, thiserror::Error)]
+#[error("config file does not exist: {path}")]
+pub struct ConfigNotFound {
+    pub path: PathBuf,
+}
+
+// Config::load 안:
+pub fn load(opt_path: Option<&Path>) -> anyhow::Result<Config> {
+    match opt_path {
+        Some(p) if !p.exists() => Err(anyhow::Error::new(ConfigNotFound { path: p.to_path_buf() })),
+        Some(p) => Self::from_file(p),
+        None => Self::from_xdg_default_or_defaults(),
+    }
+}
+```
+
+Classify arm in `kebab-app/src/error_wire.rs`:
+```rust
+if let Some(e) = err.downcast_ref::<kebab_config::ConfigNotFound>() {
+    return ErrorV1 {
+        schema_version: ERROR_V1_ID.to_string(),
+        code: "config_not_found".to_string(),
+        message: format!("config file does not exist: {}", e.path.display()),
+        details: json!({ "path": e.path }),
+        hint: Some("verify --config argument; use --config to point to a writable toml file, or omit to use XDG default".to_string()),
+    };
+}
+```
+
+**Exit code**: 2 (config error, not 0 silent).
+
+### §3.3 Bug #11: OCR timeout 60s
+
+**Decision**: `default_pdf_ocr_request_timeout_secs()` → 600 에서 60 으로 감소.
+
+```rust
+fn default_pdf_ocr_request_timeout_secs() -> u64 {
+    60  // 1 min, production-friendly per dogfood evidence
+}
+```
+
+**Doc-comment 추가**:
+```rust
+/// Default OCR request timeout in seconds. Most pages complete in 6-32s.
+/// Set to upper-bound valid throughput; exceeding 60s may indicate
+/// Ollama unavailability or very dense/high-res pages.
+/// Override via [pdf.ocr] request_timeout_secs = N in config.toml.
+```
+
+### §3.4 Bug #13: active_parsers + active_chunkers (additive)
+
+**Decision**: wire schema additive minor — `Models` struct 에 두 배열 추가, 기존 single field 보존 (backward compat). `kebab-store-sqlite` 가 fetch methods 제공.
+
+**Store API** (crates/kebab-store-sqlite/src/lib.rs):
+```rust
+impl SqliteStore {
+    /// SELECT DISTINCT parser_version FROM documents WHERE parser_version IS NOT NULL ORDER BY parser_version
+    pub fn fetch_distinct_parser_versions(&self) -> anyhow::Result<Vec<String>> {
+        let conn = self.conn()?;
+        let mut stmt = conn.prepare(
+            "SELECT DISTINCT parser_version FROM documents
+              WHERE parser_version IS NOT NULL
+              ORDER BY parser_version"
+        )?;
+        let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
+        let mut out = Vec::new();
+        for r in rows { out.push(r?); }
+        Ok(out)
+    }
+    
+    pub fn fetch_distinct_chunker_versions(&self) -> anyhow::Result<Vec<String>> {
+        let conn = self.conn()?;
+        let mut stmt = conn.prepare(
+            "SELECT DISTINCT chunker_version FROM chunks
+              WHERE chunker_version IS NOT NULL
+              ORDER BY chunker_version"
+        )?;
+        let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
+        let mut out = Vec::new();
+        for r in rows { out.push(r?); }
+        Ok(out)
+    }
+}
+```
+
+**Models struct** (crates/kebab-app/src/schema.rs):
+```rust
+pub struct Models {
+    /// Deprecated since v0.20.1. Use active_parsers for multi-parser corpus.
+    /// Reports default parser version (markdown path).
+    pub parser_version: String,
+    
+    /// Deprecated since v0.20.1. Use active_chunkers for multi-chunker corpus.
+    pub chunker_version: String,
+    
+    /// All parser versions active in corpus (v0.20.1+). May be empty if corpus is empty.
+    pub active_parsers: Vec<String>,
+    
+    /// All chunker versions active in corpus (v0.20.1+). May be empty if corpus is empty.
+    pub active_chunkers: Vec<String>,
+    
+    pub embedding_version: String,
+    pub prompt_template_version: String,
+    pub index_version: String,
+    pub corpus_revision: u64,
+}
+```
+
+**Computation** (crates/kebab-app/src/schema.rs::collect_models):
+```rust
+let store = open_store_for_stats(cfg)?;
+let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default();
+let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default();
+
+Ok(Models {
+    parser_version: active_parsers.first().cloned().unwrap_or_else(|| kebab_parse_md::PARSER_VERSION.to_string()),
+    chunker_version: active_chunkers.first().cloned().unwrap_or_else(|| kebab_chunk::md_heading_v1::VERSION_LABEL.to_string()),
+    active_parsers,
+    active_chunkers,
+    ...
+})
+```
+
+**Fallback**: markdown-fallback 유지. 기존 `parser_version` + `chunker_version` hardcode 보존 (backward compat).
+
+### §3.5 Bug #14: empty query validation
+
+**Decision**: `search` 및 `ask` command 모두에 query empty check + error.v1 emit.
+
+**Search command** (crates/kebab-cli/src/main.rs::search arm):
+```rust
+if let Some(q) = query.as_ref() {
+    if q.trim().is_empty() {
+        return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
+            schema_version: ERROR_V1_ID.to_string(),
+            code: "invalid_input".to_string(),
+            message: "query is empty; provide a non-empty search term or use --bulk".into(),
+            details: Value::Null,
+            hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()),
+        })));
+    }
+}
+```
+
+**Ask command** (crates/kebab-cli/src/main.rs::ask arm):
+```rust
+if query.trim().is_empty() {
+    return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
+        schema_version: ERROR_V1_ID.to_string(),
+        code: "invalid_input".to_string(),
+        message: "query is empty; provide a non-empty prompt".into(),
+        details: Value::Null,
+        hint: Some("e.g. `kebab ask 'explain this code'`".into()),
+    })));
+}
+```
+
+Both commands now validate; no silent fallback.
+
+---
+
+## §4 Implementation specification
+
+### §4.1 Files to modify
+
+1. **Bug #9 capability fix**: `crates/kebab-app/src/schema.rs`
+   - line 137–151: `capabilities_snapshot()` — flip `streaming_ask: false` → `true`, `single_file_ingest: false` → `true`.
+   - add test: `capabilities_streaming_ask_matches_cli_surface()`.
+   - add test: `capabilities_single_file_ingest_matches_cli_surface()`.
+
+2. **Bug #10 config_not_found**: Two files
+   - `crates/kebab-config/src/lib.rs`:
+     - Define `ConfigNotFound` error struct (with `#[derive(Debug, thiserror::Error)]`).
+     - Modify `Config::load(opt_path: Option<&Path>)` — path existence check, `return Err(anyhow::Error::new(ConfigNotFound { ... }))`.
+     - add test: `config_load_explicit_nonexistent_path_returns_error()`.
+   - `crates/kebab-app/src/error_wire.rs`:
+     - Add classify arm after existing `ConfigInvalid` case.
+     - Map `kebab_config::ConfigNotFound` → `ErrorV1 { code: "config_not_found", ... }`.
+
+3. **Bug #13 schema.models**: Three components
+   - `crates/kebab-store-sqlite/src/lib.rs`:
+     - Implement `fetch_distinct_parser_versions()` — SQL SELECT DISTINCT on documents.parser_version + ORDER BY.
+     - Implement `fetch_distinct_chunker_versions()` — SQL SELECT DISTINCT on chunks.chunker_version + ORDER BY.
+   - `crates/kebab-app/src/schema.rs`:
+     - Modify `Models` struct — add `active_parsers: Vec<String>`, `active_chunkers: Vec<String>` fields.
+     - Modify computation logic (`collect_models` or equiv) — call store methods, populate arrays, fallback to markdown defaults for single fields.
+     - add test: `schema_models_active_arrays_empty_on_empty_corpus()`.
+     - add test: `schema_models_active_arrays_populated_after_mixed_ingest()`.
+   - `docs/wire-schema/v1/schema.schema.json`:
+     - `Models` object — add `"active_parsers": { "type": "array", "items": { "type": "string" } }`.
+     - add `"active_chunkers": { "type": "array", "items": { "type": "string" } }`.
+     - Mark deprecated in comment: `parser_version` + `chunker_version` (additive, backward compat).
+
+4. **Bug #14 empty query validation**: `crates/kebab-cli/src/main.rs`
+   - search command arm: add `if query.trim().is_empty()` check → error.v1 code=invalid_input.
+   - ask command arm: add identical `if query.trim().is_empty()` check → error.v1 code=invalid_input.
+
+5. **Wire schema v1 doc update**: `docs/wire-schema/v1/`
+   - Update schema doc to note `active_parsers` / `active_chunkers` optional (additive).
+
+6. **Integration**: `integrations/claude-code/kebab/SKILL.md`
+   - Update `schema.models` surface docs — reference new `active_*` arrays for multi-version corpora.
+
+7. **Tests** (new or extended):
+   - `crates/kebab-cli/tests/`: invalid --config path (absolute + relative) → error.v1 + exit≠0.
+   - `crates/kebab-cli/tests/`: empty query (search + ask) → error.v1 code=invalid_input + exit≠0.
+   - `crates/kebab-config/tests/`: config file not found → ConfigNotFound error.
+   - `crates/kebab-app/tests/`: mixed corpus schema — active_parsers/chunkers include all ingested versions.
+
+### §4.2 Regression checks
+
+- Existing 1350 workspace tests: `cargo test --workspace --no-fail-fast -j 1` must pass green.
+- All non-bug capabilities (json_mode, ingest_progress, ingest_cancellation, rag_multi_turn, search_cache, incremental_ingest, mcp_server, bulk_search) stay true.
+- Default config path resolution (no --config) unchanged — silent fallback to XDG only if `--config` not passed.
+- Relative path behavior (cwd-relative, Rust std path::Path::exists()) preserved.
+- Empty corpus → empty `active_parsers` / `active_chunkers` array (not null, not error).
+- Existing hardcoded `parser_version` + `chunker_version` fields continue to report markdown defaults (backward compat).
+- Schema version bump not required (wire schema additive minor, backward compat).
+
+---
+
+## §5 Acceptance criteria
+
+| # | Criterion | Evidence |
+|----|-----------|----------|
+| AC-1 | `kebab schema --json` emit `streaming_ask: true` + `single_file_ingest: true` | `cargo test -p kebab-app capabilities_* -j 4` green |
+| AC-2 | `kebab search "x" --config /nonexistent.toml --json` emit exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-config config_load_explicit_nonexistent_path_returns_error -j 4` green |
+| AC-3 | `cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4` → green | unit test confirms default = 60s (no manual timing) |
+| AC-4 | After mixed ingest (MD + PDF + code), `kebab schema --json` emits both `active_parsers` + `active_chunkers` arrays containing all versions | integration test pass |
+| AC-5 | `kebab search "" --json` and `kebab search "  " --json` both emit exit≠0 + error.v1 code=invalid_input | integration test pass |
+| AC-6 | `kebab ask "" --json` emit exit≠0 + error.v1 code=invalid_input (ask symmetry) | integration test pass |
+| AC-7 | `kebab search "rust" --config nonexistent-relative.toml --json` (relative path) emit exit≠0 + error.v1 code=config_not_found | integration test pass |
+| AC-8 | All 1350+ workspace tests pass; no new failures | `cargo test --workspace --no-fail-fast -j 1` exit=0 |
+| AC-9 | Wire schema backward compat: old clients reading `parser_version` + `chunker_version` still work; `active_*` arrays optional per schema | JSON schema `additionalProperties: false` review |
+| AC-10 | `kebab ask --stream` still works; streaming events emitted (no regression) | manual `kebab ask --stream "explain this" 2>&1 | head -3` |
+
+---
+
+## §6 Risks + resolutions
+
+### Risks
+
+- **R-1** (Bug #10): Relative path `./config.toml` must resolve from cwd, not from binary location. **Resolution**: Rust `std::path::Path::exists()` is cwd-relative; no workaround needed.
+- **R-2** (Bug #13): Empty corpus → empty `active_parsers` / `active_chunkers` array. **Resolution**: Unit test `schema_models_active_arrays_empty_on_empty_corpus()` mandated (AC-4).
+- **R-3** (resolved): `collect_models` uses no cache (every-call re-computation). `active_parsers/chunkers` reflect corpus state at invocation time. If future caching is added, `corpus_revision` increment signals invalidation — document at that time.
+- **R-4** (Bug #14): `ask` command validation — covered by same fix (§3.5 mandates both search + ask).
+- **R-5** (Bug #11): 60s may still timeout on very dense/high-res pages. **Mitigation**: User can override via `config.toml [pdf.ocr] request_timeout_secs = N`. Release notes explicitly call this out.
+
+---
+
+---
+
+## §7 Parent spec deviation (HOTFIXES handoff)
+
+**F-11 MEDIUM finding**: parent spec `2026-04-27-kebab-final-form-design.md` (frozen) specifies PDF OCR request_timeout_secs = 600s (§1000 + §1628 OQ-1, rationale: "CPU 환경 105s 의 5x 여유"). Bug #11 (dogfood evidence) contradicts — 600s causes timeouts; 60s production-optimal.
+
+**Deviation handling**:
+1. Parent spec stays frozen (no edits).
+2. **HOTFIXES entry (executor Step N)**: `tasks/HOTFIXES.md` receives dated entry:
+   ```markdown
+   2026-05-27 — PDF OCR request_timeout_secs default 600s → 60s (v0.20.0 bugfix3 dogfood evidence). Bug #11.
+   ```
+3. **Parent spec cross-link (executor Step N)**: parent spec `2026-04-27-kebab-final-form-design.md` receives inline comment at §1000 (default value code block) or §1628 (OQ-1 paragraph):
+   ```markdown
+   <!-- HOTFIX 2026-05-27: default 60s (Bug #11). See tasks/HOTFIXES.md 2026-05-27 entry. -->
+   ```
+
+**Parent spec invariant**: No changes to parent spec text; only cross-link comment + HOTFIXES.md entry. Frozen design contract preserved.
+
+---
+
+## §8 References
+
+- [Dogfood report](../../../.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md) — 5 bugs discovered + decisions.
+- [Parent spec (frozen contract)](2026-04-27-kebab-final-form-design.md) — §1, §2, §4 (capabilities, error handling, JSON schema, config XDG).
+- `crates/kebab-app/src/schema.rs:137–151` (capabilities_snapshot).
+- `crates/kebab-config/src/lib.rs` (Config::load, default_pdf_ocr_request_timeout_secs).
+- `crates/kebab-app/src/error_wire.rs` (classify ConfigNotFound).
+- `crates/kebab-store-sqlite/src/lib.rs` (fetch_distinct_parser_versions, fetch_distinct_chunker_versions).
+- `crates/kebab-cli/src/main.rs` (search + ask query validation).
+- `docs/wire-schema/v1/schema.schema.json` (Models + Capabilities objects).
+- `tasks/HOTFIXES.md` (2026-05-27 entry, Bug #11 deviation record).