docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md

3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 01:21:34 +00:00
parent 9b44e27dfe
commit 46e99470eb
7 changed files with 4794 additions and 0 deletions

View File

@@ -0,0 +1,849 @@
---
title: v0.20.0 sub-item 1 bugfix — implementation plan
created: 2026-05-27
status: ACCEPT (round 2 closure — Phase B complete)
target_version: 0.20.0 (PR #189 force-update)
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
contract_sections: ["§9 (chunker_version cascade)"]
parent_plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md
review_history:
- 2026-05-27 plan round 0 (opus, drafter) — 5 step group A-E, 18 sub-action
- 2026-05-27 plan round 1 critic (opus, thorough) — NEEDS_DISCUSSION, HIGH 1 + MEDIUM 2 + LOW 3 + NIT 1 (7 finding)
- 2026-05-27 plan round 1 verifier (opus, thorough) — NEEDS_DISCUSSION, HIGH 4 + MEDIUM 3 + LOW 4 + NIT 3 (14 finding)
- 2026-05-27 plan round 1c rewrite (opus, drafter) — 21 finding 모두 적용 (critic 7 + verifier 14). detail = §8 round 1c rewrite changelog
- 2026-05-27 plan round 2 closure critic (opus) — ACCEPT, 21/21 applied + 4 NIT cosmetic
---
# v0.20.0 sub-item 1 bugfix plan
> ACCEPT 된 spec (`docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`, 965 lines, round 2 closure) 의 step decomposition. 3 bug (#2 walker code limit / #3 chunk_id collision Critical / #4 F4 fixture Pages tree) 의 force-update path (PR #189 base branch `feat/pdf-scanned-ocr` 위에 fix commits stack). **5 step (Group A-E), 18 sub-action, 4 commit + 1 verify-only step (= 5 step total, 4 commit boundary).** spec §6 의 16-row acceptance 가 본 plan 의 §4 verifier checklist 로 1:1 mapping.
## §0 Pre-flight + branch state
- **Branch**: `feat/pdf-scanned-ocr` (PR #189 base, HEAD = `b4d9e60` "chore(release): bump version 0.19.0 → 0.20.0"). 사용자 메모리 `feedback_pr_workflow` 따라 force-update path — 같은 branch 위에 fix commits 추가 + PR #189 force-push.
- **Working dir**: `/home/altair823/kebab`.
- **Env 강제** (`~/.claude/CLAUDE.md` "Disk Layout — 루트 디스크 보호가 최우선"):
- `export CARGO_TARGET_DIR=/build/out/cargo-target/target` — XFS 4TB 전용 디스크 격리, repo root `target/` 생성 방지.
- `export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab"` — release binary alias (Step 5 dogfood 의 모든 acceptance command 에서 사용).
- `export TMPDIR=/build/cache/tmp` — 대용량 임시 파일 보호.
- **Cargo build 직렬화** (memory `feedback_serial_build_only`):
- per-crate: `-j 4` default (예: `cargo test -p kebab-chunk -j 4`).
- workspace: `-j 1` 강제 (`cargo test --workspace -j 1`, `cargo clippy --workspace -j 1` — 18 integration-test binary 동시 link 시 OOM).
- **`target/` clean policy** (memory `feedback_cargo_clean_policy`): `/build` XFS 4TB 분리라 routinely clean 금지. `df -h /build``Avail < 500G` OR `du -sh $CARGO_TARGET_DIR` > 200G 시에만 clean. Step 5 E1 의 first cargo invoke 직전 1 회 conditional check, 임계 미달 시 skip + commit body 안 "skipped cargo clean — /build avail X TB" 1줄 record.
- **dogfood KB layout 가정** (Step 5 E3 prerequisite, critic round 1 H-1 closure): canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (in-place, KB dir 안). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml`**존재 안 함** — 본 plan 의 모든 acceptance command 는 in-place config 기준. Step 5 E3 의 KB clean 은 **destructive `rm -rf` 금지**, config 보존 selective clean 사용 (E3 detail 참조). dogfood config canonical path 는 본 §0 의 한 곳에서만 정의 — Step 5 E3 의 command 가 이 path 참조.
- **HOTFIXES.md / README / HANDOFF / ARCHITECTURE 영향**: Step 2 B4 가 HOTFIXES.md entry 추가 (Bug #3 second-iteration patch). 그 외 사용자 visible surface 변경 0 — README + HANDOFF + ARCHITECTURE 갱신 0 (CLI flag / wire schema / TUI key / config 추가 0; chunker_version bump 은 internal cascade 라 release notes 만).
- **wire schema 변경 0** — `ingest_progress.v1` + `ingest_report.v1` 추가 field 0. V00X migration 0. `chunks` table DDL unchanged.
- **frozen design contract 변경 0** — design §9 cascade rule 자체 변경 0 (rule 의 직접 적용으로 chunker_version 만 bump).
- **workspace version bump 0** — v0.20.0 이 이미 cut (commit `b4d9e60`). 본 plan 은 같은 v0.20.0 안의 cumulative bugfix (PR #189 force-update). Step 5 E5 의 PR force-push 만, release tag 재컷 0.
## §1 Plan overview + spec linkage
Spec §3 (Bug #2) + §4 (Bug #3) + §5 (Bug #4) 의 fix design 을 atomic step 으로 decompose. 핵심 sequencing:
1. **Bug #2 walker code limit fix** (Step 1) — `is_code_file` helper + walker conditional + unit test. spec §3.4 + §3.5 의 diff 그대로 적용. 1 commit.
2. **Bug #3 chunk_id fix + chunker_version bump** (Step 2) — `chunk_page` return tuple 4-tuple 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경 + `VERSION_LABEL` `"pdf-page-v1"``"pdf-page-v1.1"` bump + module doc 갱신 + HOTFIXES.md entry + unit regression test. spec §4.4 + §4.4.1 + §4.5 의 diff 그대로 적용. 1 commit.
3. **Bug #3 integration test** (Step 3) — `crates/kebab-app/tests/` 안 multi-scanned PDF chunk_id collision-free integration test. spec §4.5.1 의 MockOcrEngine pre-condition 결정 (Option A share 또는 Option B inline) 이 executor 의 first sub-action. 1 commit.
4. **Bug #4 F4 fixture re-generation** (Step 4) — `tests/fixtures/_synth/mojibake.py` 의 pikepdf-based rewrite + F4 fixture binary regenerate + parse-pdf 의 3 신규 invariant test. spec §5.4 + §5.5 + §5.6 의 diff 그대로 적용. 1 commit.
5. **Workspace verify + commit + PR force-push** (Step 5) — cargo workspace test `-j 1` + clippy `-D warnings` + dogfood re-run (`/build/cache/tmp/v0.20-dogfood` isolated KB, qwen2.5vl:3b 의 Ollama endpoint `192.168.0.47:11434`) + PR #189 force-push. spec §6 16-row consolidated acceptance 가 본 step 의 verifier checklist.
ordering invariant:
- **Step 1 || Step 2 || Step 4 mutually independent**: 3 bug 의 fix 가 서로 다른 crate (`kebab-source-fs` / `kebab-chunk` / `tests/fixtures` + `kebab-parse-pdf`) 의 file path 에 한정 — 동시 진행 가능. 정합성 우선 → Step 1 → Step 2 → Step 4 sequential.
- **Step 2 < Step 3**: integration test 가 `kebab-chunk` 의 fix 된 chunk_id 계산 path 위에 의존. Step 2 의 GREEN 이 prerequisite.
- **Step 4 < Step 5 dogfood**: F4 fixture regeneration 의 결과 binary 가 dogfood 의 9 PDF 중 1 (mojibake) — Step 5 E3 dogfood 의 `block_count: 1` invariant 검증 prerequisite.
- **Step 1-4 all < Step 5 workspace test**: workspace 전체 test 가 production code + test 의 final state 위에서만 의미.
commit 단위는 logical group 1 commit (atomic) — §7 sequencing summary 의 5-commit table 따름. 사용자 memory `feedback_pr_workflow` (gitea-pr + 리뷰 루프) 따라 force-update 후 `gitea-pr-review` skill 의 review 루프 진입.
---
## §2 Step group structure (Group A-E)
| Step | Group | 분류 | sub-action |
|---:|---|---|---|
| 1 | A | Bug #2 walker code limit fix | A1 `is_code_file` helper + A2 walker conditional + A3 unit test |
| 2 | B | Bug #3 chunk_id collision fix + chunker_version bump | B1 `chunk_page` 4-tuple + B2 caller `per_chunk_hash` + B3 `VERSION_LABEL` bump + B4 module doc + HOTFIXES.md + B5 unit regression test |
| 3 | C | Bug #3 multi-scanned PDF integration test | C1 MockOcrEngine share decision + C2 integration test (conditional) |
| 4 | D | Bug #4 F4 fixture re-generation | D1 mojibake.py pikepdf rewrite + D2 fixture regenerate + commit + D3 parse-pdf 3 invariant test |
| 5 | E | Workspace verify + commit + PR force-push | E1 cargo workspace test -j 1 + E2 clippy -D warnings + E3 dogfood re-run + E4 commit + E5 PR #189 force-push |
---
## §3 Per-step detail
### Step 1 (Group A): Bug #2 walker code limit fix
spec §3 의 Option A (code path only) — `is_oversized` 호출을 `is_code_file(path)` conditional 로 gate. PDF/image/markdown 의 size 는 parser 단계 자체 검증 (lopdf load_mem 256 KB+ 정상, image OCR 의 max_pixels self-cap).
#### Sub-action A1 — `is_code_file` helper 추가
- **Files affected**: `crates/kebab-source-fs/src/code_meta.rs` (line 129 `is_oversized` 함수 직후, 또는 `code_lang_for_path` 정의 직후).
- **Action** (spec §3.4 diff 그대로):
```rust
/// Returns true when `path`'s filename/extension is recognised as a code
/// file (per `code_lang_for_path`). Used by the walker to apply
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
/// not to PDF/image/markdown (which have their own size controls in
/// their respective parsers).
pub(crate) fn is_code_file(path: &Path) -> bool {
code_lang_for_path(path).is_some()
}
```
- **Acceptance**:
- `grep -c "pub(crate) fn is_code_file" crates/kebab-source-fs/src/code_meta.rs` = **1**.
- `cargo build -p kebab-source-fs -j 4` green.
#### Sub-action A2 — walker conditional size check
- **Files affected**: `crates/kebab-source-fs/src/connector.rs:168-190` (현재 verified line range).
- **Action** (spec §3.4 diff 그대로 — `is_oversized` 호출 앞에 `is_code_file` short-circuit):
```diff
- // Size-cap check (byte or line limit).
- if crate::code_meta::is_oversized(
- &abs_path,
- self.max_file_bytes,
- self.max_file_lines,
- )
- .unwrap_or(false)
+ // v0.20.0 sub-item 1 bugfix (#2): size-cap applies ONLY to
+ // code files. PDF/image/markdown bypass — their parsers
+ // have their own size controls. spec §3.3.
+ if crate::code_meta::is_code_file(&abs_path)
+ && crate::code_meta::is_oversized(
+ &abs_path,
+ self.max_file_bytes,
+ self.max_file_lines,
+ )
+ .unwrap_or(false)
{
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
...
tracing::debug!(
path = %rel_path.display(),
max_bytes = self.max_file_bytes,
max_lines = self.max_file_lines,
- "skip: file exceeds size cap"
+ "skip: code file exceeds size cap"
);
continue;
}
```
- **Acceptance**:
- `grep -nE "is_code_file\(&abs_path\)\s*$" crates/kebab-source-fs/src/connector.rs` ≥ **1**.
- `grep -c "skip: code file exceeds size cap" crates/kebab-source-fs/src/connector.rs` ≥ **1**.
- `cargo build -p kebab-source-fs -j 4` green.
#### Sub-action A3 — Bug #2 unit test 추가
- **Files affected**: `crates/kebab-source-fs/src/connector.rs` 의 기존 `#[cfg(test)] mod tests` (spec §3.5 "기존 test module 에 추가" 명시 — 새 file 아님).
- **Action** (spec §3.5 의 `size_cap_skips_only_code_files` test body 그대로):
- 300 KB PDF / 300 KB markdown / 300 KB `big.rs` (3 file) tempdir 합성.
- `FsSourceConnector` (`max_file_bytes = 262_144`, `max_file_lines = 5_000`) 의 `scan_with_skips(&SourceScope::default())`.
- assertions:
- `paths.contains("paper.pdf")` (PDF walker pass).
- `paths.contains("notes.md")` (Markdown walker pass).
- `!paths.contains("big.rs")` (code file walker skip).
- `skips.skip_examples.size_exceeded` 안 `big.rs` 1 entry, `paper.pdf` 0 entry.
- cfg helper: 기존 test module 의 `cfg_with_size_cap(root, max_bytes, max_lines)` 패턴 재사용 (필요 시 helper 추가).
- **Acceptance**:
- `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` green.
- 기존 `ingest_report_counts_oversized_files_by_bytes` (fixture `huge.rs`) + `ingest_report_size_cap_by_line_count` (fixture `longfile.rs`) regression 0 — fixture 명이 `.rs` 라 새 conditional 통과 (invariant preserved).
- `cargo test -p kebab-source-fs -j 4` 전체 green.
#### Commit (Step 1 전체)
```
fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
- crates/kebab-source-fs/src/code_meta.rs: add pub(crate) fn is_code_file
- crates/kebab-source-fs/src/connector.rs: walker conditional `is_code_file && is_oversized`
- crates/kebab-source-fs/src/connector.rs mod tests: size_cap_skips_only_code_files unit test
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §3
```
---
### Step 2 (Group B): Bug #3 chunk_id collision fix + chunker_version bump
spec §4.3 의 Option A (segment boundary `start` 를 `per_chunk_hash` suffix 로). `chunk_page` return tuple 을 3-tuple `(actual_start, chunk_end, slice)` → 4-tuple `(segment_start, actual_start, chunk_end, slice)` 로 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경. `VERSION_LABEL` `"pdf-page-v1"` → `"pdf-page-v1.1"` bump (spec §4.4.1 round 1c M-1 decision — explicit cascade audit trail).
#### Sub-action B1 — `chunk_page` 4-tuple expansion
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:200-204` (doc comment) + `:205-289` (signature line 205 → closing `}` line 289). 본 critic round 1 + verifier round 1 의 actual probe 결과 정정 (L-1).
- **Action** (spec §4.4 diff 그대로):
- doc comment 갱신 — `(char_start, char_end, text_slice)` → `(segment_start, actual_start, chunk_end, text_slice)`:
```rust
/// Split a single page's text into ordered chunks, each represented as
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize)
-> Vec<(usize, usize, usize, String)>
```
- early return: `vec![(0, n, text.to_string())]` → `vec![(0, 0, n, text.to_string())]`.
- loop body 의 push: `chunks.push((actual_start, chunk_end, slice))` → `chunks.push((start, actual_start, chunk_end, slice))`. (`start = bounds[seg_idx]` 는 이미 local var 로 존재 — line 245.)
- overlap walk 의 `let prev_min = prev.0` 가 기존 tuple 의 첫 field = post-fix tuple shape 에서는 `prev.1` (actual_start) — spec §4.4 의 invariant 보존 위해 변경:
```diff
- let actual_start = if let Some(prev) = chunks.last() {
- let prev_min = prev.0;
+ let actual_start = if let Some(prev) = chunks.last() {
+ // prev tuple shape = (segment_start, actual_start, chunk_end, slice).
+ // overlap walk floor = previous chunk's actual_start (prev.1).
+ let prev_min = prev.1;
...
```
- **Acceptance**:
- `grep -nE "fn chunk_page.*-> Vec<\(usize, usize, usize, String\)>" crates/kebab-chunk/src/pdf_page_v1.rs` = **1**.
- `grep -c "let prev_min = prev.1" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1**.
- `cargo build -p kebab-chunk -j 4` green (caller B2 sub-action 동시 적용 후 red 해소).
#### Sub-action B2 — caller `per_chunk_hash` suffix → `segment_start`
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:149-186` (현재 verified — `chunk` method 의 `for (...) in chunk_page(...)` loop start line 149 → loop end line 186, verifier round 1 L-2 정정).
- **Action** (spec §4.4 diff 그대로):
```diff
- for (char_start, char_end, slice) in
- chunk_page(&p.text, target_bytes, overlap_bytes)
+ for (segment_start, char_start, char_end, slice) in
+ chunk_page(&p.text, target_bytes, overlap_bytes)
{
...
let span = SourceSpan::Page {
page: page_num,
char_start: Some(char_start_u32),
char_end: Some(char_end_u32),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
- // Per-chunk policy_hash variant prevents chunk_id
- // collision when a page produces multiple chunks. See
- // module docs for rationale.
- let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
+ // v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
+ // variant uses `segment_start` (pre-overlap boundary,
+ // strictly increasing) instead of `char_start` (post-
+ // overlap, may collapse to prev_min). See module docs +
+ // spec §4.1 root cause + HOTFIXES.md 2026-05-27.
+ let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
...
}
```
- `SourceSpan::Page.char_start` 는 여전히 post-overlap `char_start` (= `actual_start`) 보존 — citation locality semantic 유지.
- **Acceptance** (verifier round 1 M-2: B2+B4 가 같은 logical commit 안 → grep 시점 = Step 2 commit time, 즉 post-B4):
- `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1** (B2 단독 적용 시 = 1 call site; B4 module doc 적용 후 = 2 — B4 acceptance 가 ≥ 2 검증).
- `grep -c "#c{char_start}" crates/kebab-chunk/src/pdf_page_v1.rs` = **0** (call site + module doc 모두 segment_start 로 교체 — B2+B4 의 same-commit consolidated invariant).
- sub-action-by-sub-action 분리 검증 시 B2 단독 grep `#c{char_start}` 는 module doc line 56 의 literal 잔존으로 ≥ 1 — Step 2 commit boundary 도달 후 = 0 으로 확정.
#### Sub-action B3 — `VERSION_LABEL` bump `"pdf-page-v1"` → `"pdf-page-v1.1"` + hardcoded literal 2 site 갱신
- **Files affected** (verifier round 1 H-1 의 actual probe `grep -rn '"pdf-page-v1"' crates/ --include='*.rs'` 결과 2 site enumerate):
- `crates/kebab-chunk/src/pdf_page_v1.rs:67` (현재 verified — `const VERSION_LABEL: &str = "pdf-page-v1";`).
- `crates/kebab-app/tests/pdf_pipeline.rs:168` (현재 verified — `assert_eq!(pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()), Some("pdf-page-v1"))` hard assertion, v1.1 bump 후 fail).
- `crates/kebab-app/tests/pdf_pipeline.rs:368` (현재 verified — error message string literal `"pdf-page-v1 emits 0 chunks for the empty page; total = 2"`, hard assertion 아니지만 stale 방지).
- **Action** (spec §4.4.1 결정):
- **(a) primary const bump** (`crates/kebab-chunk/src/pdf_page_v1.rs:67`):
```diff
-const VERSION_LABEL: &str = "pdf-page-v1";
+const VERSION_LABEL: &str = "pdf-page-v1.1";
```
기존 test `chunker_version_is_pdf_page_v1` (pdf_page_v1.rs:374) 의 assertion 은 `VERSION_LABEL` const 인용 → 자동 갱신, test code 변경 불요.
- **(b) test assertion literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:168`, required):
```diff
- Some("pdf-page-v1")
+ Some("pdf-page-v1.1")
```
- **(c) test error message literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:368`, recommended):
```diff
- "pdf-page-v1 emits 0 chunks for the empty page; total = 2"
+ "pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
```
- **Acceptance**:
- `grep -nE 'const VERSION_LABEL: &str = "pdf-page-v[0-9.]+";' crates/kebab-chunk/src/pdf_page_v1.rs` 결과 = `"pdf-page-v1.1"`.
- `cargo test -p kebab-chunk chunker_version_is_pdf_page_v1 -j 4` green (VERSION_LABEL 인용이라 자동 통과).
- `grep -rn '"pdf-page-v1"' crates/ --include='*.rs' | grep -v 'pdf-page-v1\.1'` = 결과 **0** (regex 의 false-positive 방지 — `pdf-page-v1.1` 의 substring `"pdf-page-v1"` 은 ".1" suffix 로 exclude). `grep -v` filter 후 line 0 이면 stale literal 잔존 0.
- `cargo test -p kebab-app pdf_pipeline -j 4` green (line 168 assertion 갱신 후).
#### Sub-action B4 — module doc 갱신 + HOTFIXES.md entry
- **Files affected**:
- `crates/kebab-chunk/src/pdf_page_v1.rs:47-60` (현재 verified — module doc `## chunk_id collision deviation` 단락).
- `tasks/HOTFIXES.md` (new dated entry append, 기존 entry 위치 — file 의 latest entry 가 `2026-05-26` 이므로 그 위에 `2026-05-27 — v0.20.0 sub-item 1` entry insert; 본 file 의 chronological pattern 따름).
- **Action**:
- **(a) module doc** — spec §4.4 의 갱신본 그대로:
```diff
-//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
-//! variant `format!("{base_policy_hash}#c{char_start}")` into the
-//! recipe's `policy_hash` slot (so distinct chunks distinguish via
-//! different policy_hash inputs), while storing the unmodified
-//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
-//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
+//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
+//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
+//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
+//! segment boundary, strictly increasing across the returned chunks
+//! even when the overlap walk collapses `actual_start` to a previous
+//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
+//! `Chunk.policy_hash` so the field still answers "what policy was
+//! active". v1.1 second-iteration patch — logged in
+//! `tasks/HOTFIXES.md` (2026-05-27).
```
- **(b) HOTFIXES.md entry** (spec §4.4 의 entry body 그대로):
```markdown
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround collapses under aggressive overlap (Bug #3 second-iteration patch)
**Symptom**: F2 (1580 chars OCR, scanned_page2.pdf) ingest 시
`DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
failed`. `kebab v0.20.0` (commit `b4d9e60`) dogfood (qwen2.5vl:3b 의
`192.168.0.47:11434` Ollama endpoint, `/build/cache/tmp/v0.20-dogfood`
isolated KB) `--force-reingest` 마다 reproducible.
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의
`per_chunk_hash = format!("{base_policy_hash}#c{char_start}")` 에서
`char_start` = post-overlap `actual_start`. line 266-281 의 overlap
walk 가 `prev_min` floor 까지만 back-walk 하므로 aggressive overlap
+ 첫 segment 가 작은 page (F2 의 한국어 OCR text: 첫 ~10 char 안
sentence-end → segment_1 = [0, 30], segment_2 = [30, n], overlap_bytes
240 / chars=80 → segment_2 의 actual_start 가 prev_min=0 으로
collapse) → 두 chunk 의 `#c0` suffix identical → identical chunk_id →
`chunks` PRIMARY KEY violation.
**Fix** (spec §4.4): `chunk_page` return tuple 에 `segment_start`
추가 (3-tuple → 4-tuple `(segment_start, actual_start, chunk_end,
slice)`), caller `per_chunk_hash` 의 suffix 를 `segment_start` 로
변경. `segment_start` 는 `bounds[seg_idx]` (dedup 후 strictly
increasing) — overlap walk 와 무관하게 모든 chunk distinct. citation
locality 의 `SourceSpan::Page.char_start` 는 여전히 post-overlap
`actual_start` 유지.
**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
(spec §4.4.1 round 1c M-1 결정, design §9 cascade rule 의 직접 적용).
multi-chunk PDF page (pre-OCR 시점 `metro-korea.pdf` 의 21 block /
34 chunk 같은 정상 path) 의 chunk_id 가 변경 — explicit user-facing
audit trail 확보, store layer 의 자동 invalidation report. v0.20.0
force-update path 라 사용자 cost zero (어차피 fresh ingest).
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체 unchanged (workaround
layer 의 internal computation 만 변경). parent PR #189
(`feat/pdf-scanned-ocr`, force-update path).
```
- **Acceptance**:
- `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **2** (module doc + line 170 의 actual call).
- `grep -c "2026-05-27 — v0.20.0 sub-item 1: chunk_id" tasks/HOTFIXES.md` = **1**.
#### Sub-action B5 — unit regression test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` (현재 verified — `make_pdf_doc(&[&str])` + `default_policy(target, overlap)` helper 이미 존재, line 300-371).
- **Action** (spec §4.5 의 test body 그대로):
```rust
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
// → first segment [0, 12), second segment [12, n).
// page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
// → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
// → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
//
// default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
// (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
// verifier round 1 L-3 보강.
let early_seg: String = std::iter::repeat('가').take(10).collect();
let tail: String = std::iter::repeat('나').take(500).collect();
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
```
- **Acceptance**:
- `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` green.
- `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` green (기존 determinism invariant 보존).
- `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` green (기존 overlap clamp invariant 보존).
- `cargo test -p kebab-chunk -j 4` 전체 green.
#### Commit (Step 2 전체)
```
fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
- crates/kebab-chunk/src/pdf_page_v1.rs: chunk_page returns 4-tuple
(segment_start, actual_start, chunk_end, slice); caller per_chunk_hash
suffix uses segment_start (pre-overlap boundary, strictly increasing)
instead of char_start (post-overlap, may collapse to prev_min).
- VERSION_LABEL "pdf-page-v1" → "pdf-page-v1.1" (design §9 cascade,
explicit user-facing audit trail).
- module docs: workaround description updated to segment_start.
- mod tests: multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids
regression pin.
- tasks/HOTFIXES.md: 2026-05-27 entry (symptom F2 1580 char OCR,
intra-doc collision root cause, second-iteration patch rationale).
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4
```
---
### Step 3 (Group C): Bug #3 multi-scanned PDF integration test
spec §4.5 + §4.5.1 — `kebab-app` integration 수준의 chunk_id collision-free regression. real Ollama 회피 위해 `OcrEngine` trait 의 MockOcrEngine. 기존 `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` 의 private MockOcrEngine 가 같은 crate 의 별 test binary 라 직접 import 불가 — executor 의 first sub-action 으로 share path 결정.
#### Sub-action C1 — MockOcrEngine share decision (executor 의 dependency 확인 task)
- **Files affected** (Option 별 분기):
- **Option A (share via `tests/common/`)** — verifier round 1 H-2 의 actual probe 결과 정정:
- `crates/kebab-app/tests/common/mod.rs` 는 **이미 존재** (172 line `TestEnv` infrastructure, `#![allow(dead_code)]` + `pub struct TestEnv` + `pub fn ingest_md` + `pub fn lexical_query` 등). action = **`pub mod mock_ocr;` 1줄 append** (mod.rs 신규 X).
- `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** file, MockOcrEngine lift + per-page ctor).
- 기존 `pdf_ocr_apply.rs:20-45` 의 MockOcrEngine struct + impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` import 추가 + ctor call site migration (M-3 참조).
- 신규 integration test 가 `mod common; use common::mock_ocr::MockOcrEngine;` 으로 share.
- **Option B (inline 중복)**: 신 test file `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` 안에 inline `struct LocalMockOcr` + `impl OcrEngine for LocalMockOcr` (test isolation 우선, common/mod.rs touch X).
- **Action**:
- **(a) dependency probe** — spec §4.5.1 의 결정 path 따름:
```bash
grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/
# 실 결과:
# crates/kebab-parse-image/src/ocr.rs:235 — production OllamaVisionOcr.
# crates/kebab-app/tests/pdf_ocr_apply.rs:25 — test-only MockOcrEngine.
ls crates/kebab-app/tests/common/mod.rs
# 실 결과: -rw-r--r-- ... 172 line (TestEnv infrastructure 이미 존재).
```
- **(b) executor 결정**:
- 기존 MockOcrEngine 의 ctor 가 `MockOcrEngine { expected_text: String, fail: bool }` — per-page 다른 text 길이 지원 위해 ctor signature 확장 필요 (예: `expected_text: Vec<String>` + internal `Mutex<usize>` cursor). 확장이 trivial + 두 test 가 같은 crate → **Option A 권장**.
- Option A 시 `pdf_ocr_apply.rs` 의 MockOcrEngine ctor 호출 site (현재 실 verifier probe = **10 instantiation site** at lines 140, 170, 193, 210, 242, 284, 311, 334, 359, 399 — critic round 1 L-2 의 "9 → 10" off-by-1 정정. struct define line 21 제외) 가 새 ctor signature 로 migration — backward-compat 위해 두 ctor (`MockOcrEngine::single(text, fail)` + `MockOcrEngine::per_page(texts, fail)`) 제공. **mechanical migration**: 각 site 의 `MockOcrEngine { expected_text: <text>, fail: <bool> }` → `MockOcrEngine::single(<text>, <bool>)` (10 site × 1 line edit, verifier round 1 M-3 의 actual cost).
- Option B (inline) 는 sharing 비용 > test 격리 가치 시. 본 plan 의 first preference = Option A.
- **(c) 결정 결과 record**: result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 closing summary 의 §6 open question 1 에 결정 path 기록 — Option A 시 sub-action C2 의 file edit = (existing) `common/mod.rs` append 1 line + (new) `common/mock_ocr.rs` + (modify) `pdf_ocr_apply.rs` + (new) `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` = 4 file. Option B 시 1 new file 만.
- **Acceptance**:
- probe grep 결과 ≥ 2 line (production + existing mock).
- probe ls 결과 — `common/mod.rs` existing 확인.
- executor 의 결정이 plan 의 §6 open question OQ-1 안에 명시.
#### Sub-action C2 — integration test 작성 (conditional on C1 결정)
- **Files affected** (Option A 채택 가정, verifier round 1 H-2 정정):
- `crates/kebab-app/tests/common/mod.rs` (**existing** 172 line — `pub mod mock_ocr;` 1줄 append 만).
- `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** — MockOcrEngine lift + per-page ctor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` (기존 inline impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` add — file head 의 mod declaration 1 줄 추가) + ctor call site 10 개 mechanical migration (M-3).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs` (**신규**) — `mod common; use common::mock_ocr::MockOcrEngine;` import.
- **Action** (spec §4.5 의 test body — 본 plan 의 sub-action 안 expanded):
- **fixture**: F1 (`scanned_page1.pdf`, 779 char OCR) + F2 (`scanned_page2.pdf`, 1580 char OCR) + 1 synthetic small-page PDF (300 char) — 3 scanned PDF.
- **MockOcrEngine ctor**: per-page text vec `["text for F1", "text for F2 의 1580 char string", "text for synthetic 300 char"]` + `fail: false`.
- **isolated KB**: `tempfile::tempdir()` + `Config::default()` 의 `data_dir` 만 override + workspace `[ingest.pdf].enabled = true`.
- **assertion path**:
1. `kebab_app::ingest_with_config_opts(&cfg, ...)` (facade) 호출.
2. `report.items.iter().filter(|i| i.kind == IngestItemKind::Error).count() == 0` — chunk_id collision 시 발생할 `ErrorKind::Storage` row 부재.
3. `store.get_chunks_count() == sum(per-PDF chunk_counts)` — DELETE+INSERT path 의 final row count.
4. `store.get_all_chunk_ids().iter().collect::<HashSet<_>>().len() == chunks_count` — chunk_id global uniqueness.
- **executor degradation path** (spec §4.5.1 conditional downgrade): 만약 Option A 의 share 가 비용/위험 크고 Option B 도 비현실적 (예: integration setup 의 ExtractContext / Facade wiring 가 본 sub-action scope 초과) → §6 row 7 의 acceptance 를 conditional downgrade — `kebab-chunk` 의 unit-level invariant (Step 2 B5) 만으로 Bug #3 의 core regression 핀 확보, integration 회피.
- **Acceptance**:
- `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` green.
- `cargo test -p kebab-app pdf_ocr_apply -j 4` green (existing test regression 0 — `MockOcrEngine { expected_text, fail }` literal struct construction 10 ctor site 가 `MockOcrEngine::single(text, fail)` 로 migration 후, critic round 1 L-2 actual count).
- downgrade path 시: result file + commit body 안 "§6 row 7 conditional skip — Bug #3 core regression = kebab-chunk unit B5" 1줄 record.
#### Commit (Step 3 전체)
```
test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
- crates/kebab-app/tests/common/{mod,mock_ocr}.rs: MockOcrEngine lift
with per-page text ctor (shared by pdf_ocr_apply.rs + new test).
- crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs:
3 scanned PDF (F1 + F2 + synthetic 300char) ingest via mock OCR,
assert all chunk_ids globally unique + zero ErrorKind::Storage rows.
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4.5
```
Option B (inline) 또는 conditional downgrade 채택 시 commit body 와 file list 그에 맞춰 조정.
---
### Step 4 (Group D): Bug #4 F4 fixture re-generation
spec §5 — `tests/fixtures/_synth/mojibake.py` 의 byte-level `re.sub` + 수작업 startxref edit 를 pikepdf 의 proper PDF surgery (open + delete /ToUnicode + save 자동 xref regen) 로 교체. F4 fixture 자체 (`crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`) regenerate + 3 신규 invariant test.
#### Sub-action D1 — `tests/fixtures/_synth/mojibake.py` pikepdf rewrite
- **Files affected**: `tests/fixtures/_synth/mojibake.py` (전체 rewrite — 기존 byte-edit 패턴 폐기).
- **Action** (spec §5.4 의 body 그대로):
- Step 1: reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
- Step 2: pikepdf 로 open + 모든 dictionary 의 `/ToUnicode` entry 제거 + `pdf.save(allow_overwriting_input=True)` (xref 자동 regen).
- Step 3: invariant 검증 — `len(pdf.pages) == 1` + `b"/ToUnicode" not in dst.read_bytes()`.
- 실패 시 비-zero exit code + stderr message (Step 2 의 removed count = 0 → exit 2; Step 3 의 page count mismatch → exit 3; ToUnicode 잔존 → exit 4).
- **Dep install** (executor 의 pre-action):
```bash
pip install --cache-dir /build/cache/pip pikepdf reportlab
python -c "import pikepdf; import reportlab; print(pikepdf.__version__, reportlab.Version)"
# font availability probe (critic round 1 L-3) — mojibake.py 의 hardcode path.
test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf \
|| sudo apt-get install -y fonts-dejavu-core
```
CI 환경 미반영 — fixture 자체를 commit 하므로 generation 은 1회성 (Step 4 D2 의 executor local). `tasks/HOTFIXES.md` 에 pikepdf install hint 만 1줄 추가 가능.
- **Acceptance**:
- `grep -c "import pikepdf" tests/fixtures/_synth/mojibake.py` = **1**.
- `grep -c "re.sub" tests/fixtures/_synth/mojibake.py` = **0** (byte-edit 패턴 폐기 확인).
- `test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf` exit 0 (font probe, critic round 1 L-3 fast failover signal).
- `python tests/fixtures/_synth/mojibake.py /tmp/mojibake_dryrun.pdf && echo OK` exit 0 + stderr 무.
#### Sub-action D2 — F4 fixture binary regenerate + snapshot regen + commit
- **Files affected** (verifier round 1 H-4 + critic round 1 M-2 의 actual probe `grep -rn 'fixtures/mojibake.pdf' crates/` 결과 2 consumer enumerate):
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` (regenerate).
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` (**snapshot baseline file 자체** — delete + auto-regen). verifier round 1 H-4 의 actual probe `text_extractor_regression.rs:59-64` 의 hand-rolled `unwrap_or_else { write baseline }` 패턴.
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` (existing test — 코드 자체 변경 0, snapshot regen path 만 trigger).
- `crates/kebab-parse-pdf/src/text_quality.rs:96` (verifier round 1 H-4 의 2번째 consumer — `let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");` 의 unit test/doctest 가 fixture binary 변경 시 동시 verify, 코드 변경 0).
- **Action**:
- **(a) regenerate command**:
```bash
python tests/fixtures/_synth/mojibake.py \
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
```
- **(b) regenerate 후 manual probe**:
```bash
python -c "import pikepdf; pdf = pikepdf.open('crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf'); print(len(pdf.pages))"
# expected: 1
grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
# expected: 0 (binary grep — Pages dict 안 ToUnicode 부재)
```
- **(c) snapshot baseline regen** (verifier round 1 H-4 + critic round 1 M-2 의 actual mechanic — OQ-2 closure):
- `text_extractor_regression.rs:59-64` 는 `let baseline = std::fs::read_to_string("tests/snapshots/vector_pdf_canonical.json").unwrap_or_else(|_| { std::fs::write(baseline_path, &actual).expect(...); actual.clone() })` 의 hand-rolled pattern (insta crate 사용 X).
- fixture binary 변경 → 다음 cargo test 시 `actual != baseline` → `assert_eq!` fail.
- executor 의 regen step:
```bash
rm crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json
cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
# 1st run: snapshot file 부재 → unwrap_or_else write 패턴이 새 baseline 작성 → assert pass.
cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
# 2nd run: 새 baseline 와 byte-identical → assert pass (regression invariant 확립).
```
- OQ-2 closure — insta crate 미사용, cargo-insta CLI 불요. spec §5.6 의 "기존 `text_extractor_regression.rs` 의 F4 baseline 갱신" 의 actual mechanic 명문화.
- **Acceptance**:
- `stat crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` size > 0.
- `grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` = **0**.
- python 의 page count probe = `1`.
- `test -f crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` exit 0 (snapshot regen 후).
- `cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4` 2회 연속 green (regen + verify, H-4 mechanic).
- `cargo test -p kebab-parse-pdf -j 4` 전체 green — D3 의 신규 test + text_quality.rs:96 의 2번째 consumer 도 동시 verify.
#### Sub-action D3 — parse-pdf 의 3 신규 invariant test
- **Files affected** (verifier round 1 H-3 의 actual probe 결과 결정):
- actual `ls crates/kebab-parse-pdf/tests/` 결과: `common`, `extractor.rs`, `fixtures`, `ocr_e2e.rs`, `page_image.rs`, `snapshots`, `text_extractor_regression.rs` — plan round 0 의 primary candidate `text_extractor.rs` 는 **존재 안 함**.
- **결정**: `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` append (F4 fixture consumer locality + D2 snapshot regen mechanic 와 same file). 3 신규 `#[test] fn` append.
- 대안 (executor 가 file size / cohesion 고려해 split 결정 시): 신규 `crates/kebab-parse-pdf/tests/mojibake_invariants.rs`. plan first preference = append to `text_extractor_regression.rs`.
- **Action** (spec §5.5 의 3 test body 의 path 정정 — verifier round 1 H-3):
1. `mojibake_fixture_load_yields_one_page` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` (integration test 는 이미 `crates/kebab-parse-pdf/tests/` root, `text_extractor_regression.rs:42` 의 canonical pattern 따름; spec §5.5 의 `"../tests/fixtures/mojibake.pdf"` 가 잘못 — `"fixtures/mojibake.pdf"` 직접). `lopdf::Document::load_mem(bytes).unwrap().get_pages().len() == 1`.
2. `mojibake_fixture_has_no_tounicode_cmap` — CWD-relative `std::fs::read("tests/fixtures/mojibake.pdf")` 위험 회피 (cargo test 의 CARGO_MANIFEST_DIR ≠ CWD 환경 가능): `let bytes = include_bytes!("fixtures/mojibake.pdf");` 사용. `bytes.windows(b"/ToUnicode".len()).filter(|w| *w == b"/ToUnicode").count() == 0`.
3. `pdf_text_extractor_on_mojibake_yields_one_block` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` + PdfTextExtractor 의 `1 Block::Paragraph per page` invariant 검증, `canonical.blocks.len() == 1`, `scanned candidate` warning 또는 non-empty text. ExtractContext setup 의 actual body 는 executor 가 `text_extractor_regression.rs` 의 existing helper (있을 시) 또는 spec §5.5 의 placeholder 의 expansion.
- **Acceptance**:
- `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` green.
- `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` green.
- `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` green.
- `cargo test -p kebab-parse-pdf -j 4` 전체 green.
#### Commit (Step 4 전체)
```
fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
- tests/fixtures/_synth/mojibake.py: full rewrite — replace byte-level
re.sub + manual startxref edit with pikepdf open+del+save (auto xref
regen). Type 0 font + ToUnicode strip via dictionary walk.
- crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf: regenerate.
- crates/kebab-parse-pdf/tests/text_extractor_regression.rs: append 3
invariant tests (lopdf 1-page / no ToUnicode marker / PdfTextExtractor
1-block) — verifier round 1 H-3 의 file path decision (same locality
with snapshot regen).
- crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json:
delete + auto-regen via 2-run cargo test (hand-rolled unwrap_or_else
pattern, verifier round 1 H-4).
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §5
```
verifier round 1 NIT-2 정정 — commit scope `test-fixture` → `parse-pdf` (crate name, conventional commit typical scope).
---
### Step 5 (Group E): Workspace verify + commit + PR #189 force-push
spec §6 의 16-row consolidated acceptance 를 본 step 의 verifier checklist. 모든 acceptance command 가 scriptable.
#### Sub-action E1 — cargo workspace test `-j 1`
- **Files affected**: 변경 0 (verification only).
- **Action**:
- **(a) conditional cargo clean** — memory `feedback_cargo_clean_policy` (verifier round 1 NIT-1 정정: TB vs GB unit 혼동 회피 위해 `-BG` 으로 GB unit 강제):
```bash
# /build avail 을 GB 단위 정수로 직접 가져옴 (df -BG output 의 'G' suffix 만 strip).
AVAIL_GB=$(df -BG --output=avail /build | tail -1 | tr -d ' G')
# CARGO_TARGET_DIR 의 size 도 GB 단위 정수로 (du -BG output).
TARGET_GB=$(du -BG -s "${CARGO_TARGET_DIR:-target}" 2>/dev/null | awk '{print $1}' | tr -d 'G')
# /build avail < 500 GB OR target > 200 GB → clean
if [[ "${AVAIL_GB:-9999}" -lt 500 ]] || [[ "${TARGET_GB:-0}" -gt 200 ]]; then
cargo clean
fi
```
임계 미달 시 skip + commit body / result file 안 1줄 record (예: "skipped cargo clean — /build avail ${AVAIL_GB}G, target ${TARGET_GB}G").
- **(b) workspace test**:
```bash
cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -100
```
- tail 100 line + final summary "test result: ok. N passed; 0 failed" 확인.
- **Acceptance**:
- exit code 0.
- stdout 의 "test result: ok" + "0 failed".
- spec §6 row 14 (workspace full test pass) 충족.
#### Sub-action E2 — `cargo clippy --workspace -- -D warnings`
- **Files affected**: 변경 0.
- **Action**:
```bash
cargo clippy --workspace --all-targets -j 1 -- -D warnings 2>&1 | tail -50
```
- **Acceptance**:
- exit code 0.
- "warning" 키워드 0 (or `-D warnings` 가 자동 error 화).
- spec §6 row 13 (workspace clippy clean) 충족.
#### Sub-action E3 — dogfood re-run (Ollama qwen2.5vl:3b 환경)
- **Files affected**: 변경 0. `/build/cache/tmp/v0.20-dogfood/` (isolated KB, 동일 dogfood 재사용).
- **Action** (memory `feedback_pr_workflow` + `_external/` invariant 따름):
- **(a) release build**:
```bash
cargo build --release -p kebab-cli -j 4 2>&1 | tail -10
"${CARGO_TARGET_DIR:-target}/release/kebab" --version
# expected: kebab 0.20.0
```
- **(b) dogfood KB clean + re-ingest 9 PDF** (spec §1.1 의 dogfood 환경 동일).
canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (§0 가정). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml` 은 **존재 안 함** — critic round 1 H-1 의 actual probe 결과. 따라서 config 의 **자체 backup 후 clean + restore** path 사용 (destructive `rm -rf` 시 config 동시 삭제 방지):
```bash
# Step A: config 의 임시 backup (KB clean 전 보존).
cp /build/cache/tmp/v0.20-dogfood/config.toml \
/build/cache/tmp/v0.20-dogfood-config.toml.bak
# Step B: KB 전체 clean (config 포함 destructive — backup 으로 보존됨).
rm -rf /build/cache/tmp/v0.20-dogfood/
mkdir -p /build/cache/tmp/v0.20-dogfood/
# Step C: backup 에서 config restore.
cp /build/cache/tmp/v0.20-dogfood-config.toml.bak \
/build/cache/tmp/v0.20-dogfood/config.toml
# config.toml 안 [ingest.pdf].enabled = true, ollama endpoint =
# http://192.168.0.47:11434, ocr_model = qwen2.5vl:3b
# Step D: ingest.
"$RELEASE_BIN" ingest --config /build/cache/tmp/v0.20-dogfood/config.toml \
--json --force-reingest 2>&1 | tee /build/cache/tmp/v0.20-dogfood-ingest.ndjson
# Step E (optional): backup file cleanup. 다음 dogfood iteration 의 redundant
# backup 누적 방지. config 자체는 v0.20-dogfood/config.toml 가 in-place
# canonical, .bak 은 transient.
rm /build/cache/tmp/v0.20-dogfood-config.toml.bak
```
(대안 selective-delete 의 single-step path — config 보존 + 그 외 destructive):
```bash
find /build/cache/tmp/v0.20-dogfood/ -mindepth 1 -not -name 'config.toml' \
-exec rm -rf {} +
```
실 procedure 는 plain 5-step 의 명료성 우선 (executor 의 default).
- **(c) acceptance**:
- spec §6 row 3: 9 PDF 의 `skipped_size_exceeded == 0` for non-code (= 모두 0 — workspace 가 code 0).
- spec §6 row 8: F1 + F2 의 `kind != "Error"` (chunk_id collision 부재).
- spec §6 row 12: mojibake.pdf 의 ingest item `block_count: 1`.
- spec §6 row 15: 9 PDF 모두 ingest, `errors = 2` (encrypted only — pre-existing dogfood baseline 동일).
- **Ollama 미가용 시 fallback**: endpoint 가 unreachable 면 본 sub-action 의 partial skip 가능 — workspace test (E1) + clippy (E2) 의 unit/integration 수준 evidence 로 spec §6 row 1, 2, 4-7, 9-11, 13-14, 16 충족 + dogfood row 3, 8, 12, 15 skip 1줄 record (commit body + result file).
- **Acceptance**:
- ingest report 의 ndjson 안 errors = 2 (encrypted only).
- F1/F2/mojibake 각각의 item line `kind` field 가 success path (= `"new"` 또는 `"unchanged"`, not `"Error"`).
- dogfood log path: `/build/cache/tmp/v0.20-dogfood-ingest.ndjson` (commit body 안 reference).
#### Sub-action E4 — commit 점검 + 최종 organize
- **Files affected**: 모든 step 의 누적 changes.
- **Action**:
- `git status` + `git log --oneline b4d9e60..HEAD` — Step 1-4 의 4 commit + Step 5 의 verify-only commit 0 (verification 만, commit 없음).
- 만약 work-in-progress 잔존 file 있으면 reset.
- commit message 의 `Co-Authored-By:` line 점검 (CLAUDE.md gitea-pr workflow).
- **Acceptance**:
- `git log --oneline b4d9e60..HEAD | wc -l` = **4** (Step 1-4 의 각 1 commit).
- `git status` 의 untracked + modified = 0.
#### Sub-action E5 — PR #189 force-push
- **Files affected**: remote ref `gitea/feat/pdf-scanned-ocr`.
- **Action** (gitea-ops skill 의 직접 호출 가능):
```bash
git push gitea feat/pdf-scanned-ocr --force-with-lease
```
- `--force-with-lease` — local 의 fetch state 와 remote HEAD 가 match 시에만 force-push (다른 collaborator 의 push 보호; 본 single-user 환경 cheap safety).
- PR #189 의 body 갱신 — Bug #2/#3/#4 fix summary + dogfood evidence 추가 (gitea API `PATCH /repos/altair823-org/kebab/pulls/189`).
- 사용자 memory `feedback_pr_workflow` 따라 `gitea-pr-review` skill 의 review 루프 진입 (multi-round critic + verifier).
- **Acceptance**:
- `gh-equivalent` (gitea-ops `gitea-pr-status 189`) 의 head SHA = local `git rev-parse HEAD`.
- PR #189 의 commit count = 이전 force-push 시점 의 commit count + 4.
- sequencing summary 의 5-commit table (§7) 와 final state 일치.
#### Commit (Step 5)
verification only — git commit 0. Step 1-4 의 4 commit 가 final tree.
---
## §4 Verifier checklist (spec §6 16-row 1:1 mapping)
각 row 가 scriptable command. step 5 E1-E3 의 누적 실행으로 모두 가능.
| # | Verifier | Bug | step | 명령 |
|---|---------|-----|------|------|
| 1 | walker bypasses size cap for PDF | #2 | A3 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` |
| 2 | walker still skips oversized code files | #2 | A3 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes -j 4` |
| 3 | 256KB+ PDF/markdown ingest default config | #2 | E3 | dogfood: `$RELEASE_BIN ingest ...` 의 ingest report 의 `skipped_size_exceeded = 0` for non-code |
| 4 | chunker collision regression test | #3 | B5 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` |
| 5 | chunker determinism preserved | #3 | B5 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` |
| 6 | chunker overlap clamp preserved | #3 | B5 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` |
| 7 | integration: multi-scanned PDF ingest (conditional, §4.5.1) | #3 | C2 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` (Option A/B downgrade path 시 skip + record) |
| 8 | dogfood: F1 + F2 force-reingest errors=0 | #3 | E3 | dogfood: `$RELEASE_BIN ingest --force-reingest ...` 의 errors = 0 (encrypted 제외) |
| 9 | F4 fixture lopdf 1-page invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` |
| 10 | F4 fixture ToUnicode 부재 invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` |
| 11 | F4 PdfTextExtractor 1-block invariant | #4 | D3 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` |
| 12 | dogfood: F4 ingest block_count=1 | #4 | E3 | dogfood: mojibake.pdf 의 ingest item `block_count: 1` |
| 13 | workspace clippy clean | all | E2 | `cargo clippy --workspace --all-targets -j 1 -- -D warnings` |
| 14 | workspace full test pass | all | E1 | `cargo test --workspace --no-fail-fast -j 1` |
| 15 | dogfood end-to-end 9 PDF | all | E3 | dogfood: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
| 16 | chunker_version cascade final value | #3 | B3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` |
executor 의 final step (E1-E3) 에서 16 row 모두 scriptable 실행 + result file 안 row-by-row pass/fail/skip 기록.
#### Workspace baseline expected test count delta (verifier round 1 M-1 closure)
`cargo test --workspace -j 1` (Step 5 E1) 의 expected `test result: ok. N passed` 의 delta 산수 — pre-fix baseline 대비:
| Step | Sub-action | new test name | crate | type |
|---|---|---|---|---|
| 1 | A3 | `size_cap_skips_only_code_files` | kebab-source-fs | unit (in `mod tests`) |
| 2 | B5 | `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` | kebab-chunk | unit (in `mod tests`) |
| 3 | C2 | `multi_scanned_pdf_ingest_no_chunk_id_collision` | kebab-app | integration (new test binary) |
| 4 | D3 | `mojibake_fixture_load_yields_one_page` | kebab-parse-pdf | unit-style integration |
| 4 | D3 | `mojibake_fixture_has_no_tounicode_cmap` | kebab-parse-pdf | unit-style integration |
| 4 | D3 | `pdf_text_extractor_on_mojibake_yields_one_block` | kebab-parse-pdf | unit-style integration |
- **Option A (full path, C2 active)**: total = **+6 unit/integration test cases + 1 new integration test binary**.
- **Option B (C2 conditional downgrade per §4.5.1)**: total = **+5 test cases + 0 new binary**.
기타: B3 의 `chunker_version_is_pdf_page_v1` 기존 test 의 assertion content 변경 없음 (VERSION_LABEL const 인용) → test count delta 0. D2 의 `vector_pdf_extract_byte_identical_to_baseline` 기존 test 의 assertion 결과만 변경 (fixture 변경 → baseline 변경) → test count delta 0, snapshot regen action 만 추가.
executor 가 E1 acceptance 의 N 비교 시 본 delta 산수 와 일치 확인 (regression 시 detection).
---
## §5 Risks (plan 단계)
- **R-1 (MockOcrEngine sharing complexity)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) 가 기존 `pdf_ocr_apply.rs:20-45` 의 9 test 의 ctor migration 필요 — backward-compat ctor 2 개 (single + per_page) 도입 시 trivial, 실패 시 Option B (inline) downgrade. spec §6 row 7 conditional skip 가능.
- **R-2 (chunker_version bump cascade scope)**: `pdf-page-v1.1` 의 영향 = multi-chunk PDF page 의 chunk_id 변경. parser_version / embedding_version / prompt_template_version / index_version unchanged — `kebab-eval::eval_runs.config_snapshot_json` 의 5-version snapshot 의 chunker_version field 만 새 값. parent design §9 의 cascade rule invariant 보존, eval baseline 의 re-run 권장 (spec §7.1 Risk 1 의 user-facing note).
- **R-3 (F4 fixture binary churn)**: pikepdf 의 save output 가 reportlab+byte-edit 와 다른 PDF object ordering → SHA256 변경 + git binary diff noise. `text_extractor_regression.rs` baseline 도 새 fixture 의 actual output 으로 same-commit update — Step 4 D2 안 동시 처리.
- **R-4 (dogfood Ollama 의존)**: spec §6 row 3 + 8 + 12 + 15 dogfood acceptance 가 real `192.168.0.47:11434` qwen2.5vl:3b 호출. endpoint 미가용 시 unit/integration evidence (row 1-2, 4-7, 9-11, 13-14, 16) 로 partial closure + commit body / result file 안 skip record.
- **R-5 (pikepdf dependency install)**: Step 4 D1 의 mojibake.py 의 `import pikepdf` — 본 머신의 Python venv 에 pip install 필요. CI 의존성 미발생 (fixture commit 후 1회성 generation).
- **R-6 (parent plan 와의 동시 진행 충돌 0 확인)**: parent plan (`2026-05-27-pdf-scanned-ocr-plan.md` round 1c ACCEPT) 의 Step 11 (final verify + PR open) 가 이미 commit `b4d9e60` 으로 closed. 본 plan 의 fix commits 가 그 commit 위에 stack — branch ordering 충돌 0.
---
## §6 Open questions deferred to executor
- **OQ-1 (MockOcrEngine sharing path)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) vs Option B (inline) 결정. executor 의 Step 3 C1 안 first action — probe `grep -rn "impl OcrEngine"` 후 결정 + result file 안 record. plan first preference = Option A.
- **OQ-2 (F4 baseline snapshot update tool)**: ✅ **CLOSED (round 1c, critic M-2 + verifier H-4)** — `text_extractor_regression.rs:59-64` 의 actual pattern = hand-rolled `unwrap_or_else { write baseline }` (insta crate 사용 X). regen procedure = snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify). cargo-insta CLI 불요. detail = §3 Step 4 D2 의 Action (c).
- **OQ-3 (pikepdf install command)**: `pip install` 의 cache-dir + venv 결정 — global `--user` pip 또는 fixture generation 전용 venv 또는 conda environment. plan 의 default = `pip install --cache-dir /build/cache/pip pikepdf reportlab` (memory `feedback_disk_layout`).
- **OQ-4 (dogfood config.toml 의 endpoint 변경 시점)**: 본 dogfood 환경의 `192.168.0.47:11434` Ollama endpoint 가 변경되면 executor 가 alternative endpoint (`localhost:11434` 등) 로 override + result file 안 record.
- **OQ-5 (PR #189 review 루프의 round 수)**: memory `feedback_pr_workflow` 의 gitea-pr + 리뷰 루프 — round 1 critic + verifier 의 결과에 따라 round 2/2c 진입 가능. 본 plan 은 round 0 (drafter) — review round 의 outcome 은 plan 외 scope.
---
## §7 Sequencing summary (logical commit boundaries)
| commit # | step range | logical scope | file count |
|---:|---|---|---:|
| 1 | Step 1 (A1+A2+A3) | `fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)` | 2 |
| 2 | Step 2 (B1+B2+B3+B4+B5) | `fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)` | 4 (pdf_page_v1.rs + HOTFIXES.md + pdf_pipeline.rs:168 + :368, verifier H-1) |
| 3 | Step 3 (C1+C2) | `test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)` | **4 (Option A: existing common/mod.rs append + new common/mock_ocr.rs + modify pdf_ocr_apply.rs + new multi_scanned_pdf_ingest_no_chunk_id_collision.rs, verifier H-2)** / 1 (Option B) |
| 4 | Step 4 (D1+D2+D3) | `fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)` | **5 (mojibake.py + fixtures/mojibake.pdf + snapshots/vector_pdf_canonical.json + text_extractor_regression.rs (D3 append) + src/text_quality.rs:96 consumer verify, verifier H-4 + H-3 + NIT-2)** |
| 5 | Step 5 (E1-E5) | verification only — git commit 0; final state = commits 1-4 위 PR #189 force-push | 0 |
총 4 commit + 1 verify-only step. force-push 후 PR #189 의 head = local HEAD.
---
## §8 Round 1c rewrite changelog (drafter trace)
round 1 critic + verifier 의 합산 21 finding (critic 7 + verifier 14) 적용. detail 은 result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 §1 traceability matrix 참조. 본 §8 은 plan body 의 substantive change summary.
### Critic r1 (7 finding)
| ID | Severity | Action | Plan section |
|---|---|---|---|
| critic H-1 | HIGH | E3 dogfood config 의 backup 후 clean + restore 5-step procedure (외부 backup file 부재 reality 반영) | §3 Step 5 E3 (b) |
| critic M-1 | MEDIUM | line 15 "17 sub-action" → "18 sub-action" | §0 prelude line |
| critic M-2 | MEDIUM | D2 snapshot baseline 갱신 mechanic 명문 (hand-rolled `unwrap_or_else` pattern, OQ-2 closure) | §3 Step 4 D2 + §6 OQ-2 |
| critic L-1 | LOW | B1 line range "200-289" → "200-204 (doc) + 205-289 (body)" 명시 | §3 Step 2 B1 |
| critic L-2 | LOW | MockOcrEngine ctor count "9 test (existing)" → "10 instantiation site" (actual probe) | §3 Step 3 C1 + C2 |
| critic L-3 | LOW | D1 pre-action 에 DejaVuSans.ttf existence probe 1줄 추가 | §3 Step 4 D1 |
| critic NIT-1 | NIT | "5 logical commit" → "4 commit + 1 verify-only step (= 5 step total, 4 commit boundary)" | §0 prelude line |
### Verifier r1 (14 finding)
| ID | Severity | Action | Plan section |
|---|---|---|---|
| verifier H-1 | HIGH | B3 sub-action 에 `pdf_pipeline.rs:168` (hard assertion) + `:368` (error message) literal 갱신 명시 + acceptance grep regex 정밀화 (`grep -v 'pdf-page-v1\.1'`) | §3 Step 2 B3 |
| verifier H-2 | HIGH | Step 3 Option A 의 `common/mod.rs` 가 existing infrastructure 반영 — `pub mod mock_ocr;` 1줄 append + 신규 `common/mock_ocr.rs` + `pdf_ocr_apply.rs` lift + 신규 integration test = 4 file edit | §3 Step 3 C1 + C2 + §7 commit 3 file count |
| verifier H-3 | HIGH | D3 file path `text_extractor.rs` 부재 정정 → `text_extractor_regression.rs` append (locality with D2 snapshot regen). `include_bytes!` path 도 `../tests/fixtures/...` → `fixtures/...` 직접 + CWD-relative `std::fs::read` 회피 | §3 Step 4 D3 |
| verifier H-4 | HIGH | D2 snapshot regen mechanic — snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify) + `src/text_quality.rs:96` 2번째 consumer enumerate | §3 Step 4 D2 + §7 commit 4 file count |
| verifier M-1 | MEDIUM | §4 verifier checklist 뒤에 expected workspace test count delta 산수 표 추가 (+6 unit + 1 integration, Option A / +5 + 0, Option B) | §4 (sub-section) |
| verifier M-2 | MEDIUM | B2 acceptance phrasing 갱신 — "Step 2 commit time" 명시 + sub-action 별 grep 시점 명문 | §3 Step 2 B2 acceptance |
| verifier M-3 | MEDIUM | C2 Option A 의 "기존 10 ctor site mechanical migration" 명령 명시 | §3 Step 3 C1 (b) |
| verifier L-1 | LOW | pdf_page_v1.rs line range 200-289 → 205-289 (critic L-1 와 same edit pass) | §3 Step 2 B1 |
| verifier L-2 | LOW | caller line range 155-185 → 149-186 | §3 Step 2 B2 |
| verifier L-3 | LOW | B5 test scenario comment 의 target=1500 byte + overlap=240 byte 산수 보강 | §3 Step 2 B5 |
| verifier L-4 | LOW | `ingest_pdf_ocr_smoke.rs` 의 grep B3 scope safety 확인 (별도 action 0, finding 자체 = no action) | (verified safe) |
| verifier NIT-1 | NIT | E1 의 `df -h` unit 처리 산수 정밀화 → `df -BG --output=avail` 으로 GB unit 강제 | §3 Step 5 E1 (a) |
| verifier NIT-2 | NIT | Step 4 commit scope `test-fixture` → `parse-pdf` (crate name) | §3 Step 4 commit |
| verifier NIT-3 | NIT | dogfood config canonical path 의 single-definition (in §0) + 모든 acceptance command 참조 | §0 pre-flight + §3 Step 5 E3 |
### Summary
- frontmatter `status` `draft (round 0)` → `draft (round 1c)`.
- frontmatter `review_history` 에 round 1 critic + verifier + round 1c rewrite 항목 3 줄 add.
- plan body line 15 의 prelude statement 2 token 정정 (sub-action count + commit boundary 표현).
- §0 pre-flight 에 dogfood KB layout 가정 1 bullet add.
- §3 5 step 의 sub-action body 의 detail 보강 (file path / acceptance grep / mechanic / migration cost).
- §4 verifier checklist 의 expected test count delta sub-section add.
- §6 OQ-2 closure 표시 (✅ CLOSED).
- §7 sequencing summary 의 file count 갱신 (commit 2: 2→4, commit 3: 3-4→4, commit 4: 3-4→5).
- §8 round 1c rewrite changelog (본 단락) populate.
총 plan body line 변경 = ~+250 net add (round 0 698 line → round 1c ~950 line).

View File

@@ -0,0 +1,388 @@
---
title: "v0.20.0 sub-item 1 bugfix round 2 — plan"
created: 2026-05-27
status: "DRAFT round 0"
spec_path: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
spec_status: ACCEPT (round 1c, 308 line)
critic_round_1: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md
critic_round_2: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md
step_count: 4 (3 commit + 1 sanity-only)
commit_count: 3
branch: feat/pdf-scanned-ocr
head_at_draft: e674ff4
---
# v0.20.0 sub-item 1 bugfix round 2 — plan
## §0 Overview
Spec ACCEPT (round 1c, 9/9 critic finding 반영) 의 implementation map. fix scope = 2 bug:
- **Bug #6** (Critical): `?Identity-H Unimplemented?` mojibake marker bypass — `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()` 의 marker strip + dominance heuristic.
- **Bug #7** (Minor doc): `kebab search --help``--media` value list 에서 `code` 누락 — clap doc-comment + SKILL.md 동기 갱신 + CLI help regression test.
총 변경 file 5 + 신규 test file 1 + HOTFIXES entry. branch = `feat/pdf-scanned-ocr` (HEAD = `e674ff4`, round 1 의 4 commit 적층 위에 round 2 의 3 commit append). env = `CARGO_TARGET_DIR=/build/out/cargo-target/target`. fresh release binary = `/build/out/cargo-target/target/release/kebab`.
**non-scope (spec §2.2 + 본 plan §5 OQ-4)**: spec ACCEPT 의 surface 외에 `crates/kebab-mcp/src/tools/search.rs:44`, `crates/kebab-core/src/search.rs:32+52`, `crates/kebab-app/src/ingest_progress.rs:69`, `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` 가 같은 stale value list (`markdown, pdf, image, audio, other``code` 누락) 을 보유. spec 의 frozen grep boundary (`integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) 외이므로 본 round 의 commit 대상 X — follow-up issue 권장.
---
## §1 Step table
| Step | Title | Scope summary | Commit subject | Files touched |
|------|-------|----------------|----------------|----------------|
| 1 | Bug #6 implementation | `MOJIBAKE_MARKERS` const + `compute_valid_char_ratio()` rewrite + 2 new unit test + HOTFIXES entry | `fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)` | `crates/kebab-parse-pdf/src/text_quality.rs`, `tasks/HOTFIXES.md` |
| 2 | Bug #7 doc-comment + SKILL.md | clap doc-comment 의 `--media` value list 에 `code` 추가 + SKILL.md line 57 동기 | `docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)` | `crates/kebab-cli/src/main.rs`, `integrations/claude-code/kebab/SKILL.md` |
| 3 | Bug #7 CLI help assertion | 신규 test file `crates/kebab-cli/tests/cli_help_smoke.rs``search_help_lists_code_in_media_values` test | `test(cli): assert 'code' in search --help output (Bug #7 regression pin)` | `crates/kebab-cli/tests/cli_help_smoke.rs` (신규) |
| 4 | Final sanity (no commit) | workspace test + workspace clippy + optional dogfood retest | — | none |
---
## §2 Per-step detail
### Step 1 — Bug #6 implementation
#### §2.1 Files affected
- `crates/kebab-parse-pdf/src/text_quality.rs` (현재 103 line — line 1-37 body, line 39-103 tests).
- `tasks/HOTFIXES.md` (dated entry append).
#### §2.2 Action
**§2.2.1** — `text_quality.rs` line 1-18 (file header comment + `compute_valid_char_ratio` body) **rewrite** per spec §4.1 의 diff. 추가:
- 새 const `MOJIBAKE_MARKERS: &[&str] = &["?Identity-H Unimplemented?"]` (line 8-12 위치, lopdf 0.32.0 source 추적 comment 포함).
- `compute_valid_char_ratio()` body 의 4-단계 marker strip → trim-empty zero → dominance cap-0.3 → 기존 ratio 계산.
- `is_valid_text_char()` (line 20-37) **변경 없음** (signature + range list 보존).
**§2.2.2** — `text_quality.rs::tests` module (line 39-103) 에 2 신규 test **append**:
```rust
#[test]
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
let r = compute_valid_char_ratio(&s);
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
}
#[test]
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
let header = "x".repeat(200);
let s = format!("{header} ?Identity-H Unimplemented?");
let r = compute_valid_char_ratio(&s);
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
}
```
**중요 — 스펙 §4.2 wording 보정 (critic r2 NEW-1)**: spec §4.2 의 "Replace existing Bug #6 test set with two new tests" 는 stale wording. 현 `text_quality.rs::tests` 는 8 test 보유, **Identity-H marker 관련 test 0**. 즉 net change = **+2 / -0**. brief §2.1 의 "기존 test `identity_h_marker_mixed_with_some_real_text_low_ratio` 제거" 도 동일 stale — 해당 test 미존재. executor 는 8 existing test (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) 모두 **보존**.
**§2.2.3** — `tasks/HOTFIXES.md` 의 latest dated section 위에 entry append:
```markdown
## 2026-05-27 — Identity-H mojibake marker bypassed OCR fallback (Bug #6)
- **Symptom**: `metro-korea.pdf` (Identity-H CID font without ToUnicode CMap) 의 ingest 가 `pdf_ocr_pages=0` 으로 종료. text 전체가 `?Identity-H Unimplemented?` marker 1154회 반복 (lopdf 0.32.0 emit). text-detect ratio = 1.0 → OCR fallback threshold 0.5 bypass.
- **Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()``is_valid_text_char()` 가 ASCII printable range (0x0020..=0x007E) 를 unconditional valid 처리. marker (28 ASCII char) 는 valid 로 count.
- **Fix**: `MOJIBAKE_MARKERS` const 도입 + marker strip after-strip 의 trim-empty → 0.0 + dominance heuristic (strip > 잔여 일 때 cap 0.3). spec ACCEPT: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` §4.1. parser_version/wire schema 영향 0.
- **User action**: 이미 `metro-korea.pdf` class 의 mojibake-heavy PDF 를 v0.20.0 pre-bugfix2 binary 로 indexed 한 경우, `kebab ingest --force-reingest <workspace>` 로 cached skip 무효화 필요 (release notes 동등 안내).
```
#### §2.3 Acceptance
actionable verify command (per-step):
```bash
# A) text_quality 신규 test 2 + 기존 8 = 10 모두 green
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 | tail -10
# B) parse-pdf crate clean compile
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-parse-pdf -j 4 2>&1 | tail -3
# C) parse-pdf clippy clean (-D warnings)
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-parse-pdf --all-targets -j 4 -- -D warnings 2>&1 | tail -5
```
기대: A 의 tail = `test result: ok. 10 passed; 0 failed`, B = `Finished`, C = warning 0.
#### §2.4 Commit
```bash
git add crates/kebab-parse-pdf/src/text_quality.rs tasks/HOTFIXES.md
git commit -m "$(cat <<'EOF'
fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.
Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.
Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.
EOF
)"
```
---
### Step 2 — Bug #7 doc-comment + SKILL.md
#### §2.5 Files affected
- `crates/kebab-cli/src/main.rs` line 158-160 (실측: SearchArgs `media` 의 3-line clap doc-comment).
- `integrations/claude-code/kebab/SKILL.md` line 57.
#### §2.6 Action
**§2.6.1** — `crates/kebab-cli/src/main.rs` line 158-160 의 doc-comment edit:
```diff
/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated.
- /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`,
- /// `image`, `audio`, `other`. Unknown values match nothing.
+ /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`,
+ /// `image`, `audio`, `code`, `other`. Unknown values match nothing.
```
(critic r2 NEW-2 보정: spec §4.3 의 1-line 표기 vs 실제 3-line clap doc-comment 차이. 실제 file 의 multi-line 분포 그대로 유지하며 line 160 의 `image`, `audio` 사이에 `code` 삽입.)
**§2.6.2** — `integrations/claude-code/kebab/SKILL.md` line 57 의 edit:
```diff
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
```
#### §2.7 Acceptance
```bash
# A) cli crate clean compile (doc-comment edit — compile 영향 0 기대)
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli -j 4 2>&1 | tail -3
# B) SKILL.md 의 `code` substring grep
grep -nF '"code"' integrations/claude-code/kebab/SKILL.md
# C) fresh binary 의 search --help 가 `code` 노출
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli -j 4 2>&1 | tail -3
/build/out/cargo-target/target/release/kebab search --help 2>&1 | grep -F 'code'
```
기대: A = `Finished`, B = line 57 1건 hit, C = `code` 포함 1+ line.
#### §2.8 Commit
```bash
git add crates/kebab-cli/src/main.rs integrations/claude-code/kebab/SKILL.md
git commit -m "$(cat <<'EOF'
docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.
Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.
Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.
EOF
)"
```
---
### Step 3 — Bug #7 CLI help assertion test
#### §2.9 Files affected
- `crates/kebab-cli/tests/cli_help_smoke.rs` (신규 — 기존 file list 에 미존재).
#### §2.10 Action
신규 file 생성. 기존 test convention (`cli_*` prefix, `Command::new(env!("CARGO_BIN_EXE_kebab"))` pattern — 참고: `cli_readonly_quiet.rs`, `cli_schema.rs`) 답습:
```rust
// crates/kebab-cli/tests/cli_help_smoke.rs
//
// Regression pin — `kebab search --help` 의 `--media` value list 가
// `code` 를 노출. Bug #7 (v0.20.0 bugfix round 2 spec §4.4).
#[test]
fn search_help_lists_code_in_media_values() {
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
.args(["search", "--help"])
.output()
.expect("kebab search --help");
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("`code`"),
"search --help must list 'code' as accepted --media value; stdout = {stdout}"
);
}
```
#### §2.11 Acceptance
```bash
# A) 신규 test target 빌드 + 실행
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 | tail -10
# B) cli crate tests target clean compile (전체)
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli --tests -j 4 2>&1 | tail -3
# C) cli clippy clean (-D warnings) — 신규 test file 포함
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-cli --all-targets -j 4 -- -D warnings 2>&1 | tail -5
```
기대: A = `test result: ok. 1 passed; 0 failed`, B = `Finished`, C = warning 0.
#### §2.12 Commit
```bash
git add crates/kebab-cli/tests/cli_help_smoke.rs
git commit -m "$(cat <<'EOF'
test(cli): assert 'code' in search --help output (Bug #7 regression pin)
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.
Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).
EOF
)"
```
---
### Step 4 — Final sanity (no commit)
#### §2.13 Scope
3 commit append 후 workspace 전수 verify + optional dogfood. **commit 발생 X** (코드 변경 0 — verification only).
#### §2.14 Acceptance
```bash
# A) workspace test 전수 — 기존 1316 + 본 round +2 unit + +1 cli = 1319 expected
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test --workspace --no-fail-fast -j 1 2>&1 | tee /tmp/v0.20-bugfix2-test.log | tail -15
echo "exit = ${PIPESTATUS[0]:-$?}"
# B) workspace clippy clean (-D warnings)
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 | tail -8
# C) (optional) dogfood retest — metro-korea.pdf
# binary 의 fresh build 가 이미 Step 2 acceptance 에서 완료.
# --force-reingest 후 pdf_ocr_pages 가 0 → 21+ 변화 관찰.
# OCR latency ≈ 10 min cost — plan drafter 가 executor 에게 optional 명시.
# 실측 corpus 가 user-private (KEBAB_WORKSPACE 또는 ~/Documents/test/) 이면 skip 가능.
```
기대: A = `test result: ok. <N> passed; 0 failed` (N ≥ 1319), B = warning 0, C = 사용자 선택 (verifier round 0 에서 평가).
#### §2.15 Commit
없음 (sanity-only). executor 가 sanity green 확인 후 PR push 단계로 진행.
---
## §3 Verifier checklist (cumulative)
spec §5 의 7 row acceptance criteria 와 1:1 mapping. verifier round 0 의 actionable command:
| # | Spec §5 criterion | Verifier command | Step coverage | Pass condition |
|---|-------------------|------------------|---------------|----------------|
| 1 | `identity_h_marker_dominance_caps_ratio_below_threshold` green | `cargo test -p kebab-parse-pdf identity_h_marker_dominance_caps_ratio_below_threshold -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` |
| 2 | `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` green | `cargo test -p kebab-parse-pdf identity_h_marker_minority_with_long_valid_text_keeps_high_ratio -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` |
| 3 | 기존 text_quality 8 test green (regression 0) | `cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 \| tail -5` | Step 1 | `10 passed; 0 failed` (8 기존 + 2 신규) |
| 4 | `search_help_lists_code_in_media_values` green | `cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 \| tail -3` | Step 3 | `1 passed; 0 failed` |
| 5 | SKILL.md 의 `"code"` substring 존재 | `grep -nF '"code"' integrations/claude-code/kebab/SKILL.md` | Step 2 | line 57 1 hit |
| 6 | workspace test 전수 green | `cargo test --workspace --no-fail-fast -j 1 2>&1 \| tail -10` | Step 4 | `0 failed`, N ≥ 1319 |
| 7 | workspace clippy clean (-D warnings) | `cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 \| tail -5` | Step 4 | warning 0 |
| 8 (optional) | dogfood retest — metro-korea.pdf 의 `pdf_ocr_pages` 0 → 21+ | manual: `kebab ingest --force-reingest <ws>` 후 ingest_report.v1 의 `items[].pdf_ocr_pages` 검사 | Step 4 | `pdf_ocr_pages > 0` for metro-korea.pdf row |
executor 는 row 1-7 모두 green 시 PR push gate 통과. row 8 = verifier round 0 의 optional (사용자 corpus 가용성 + 10 min cost 평가).
---
## §4 Risks resolution (spec §6 의 plan-level)
| ID | Spec §6 status | Plan-level action |
|----|----------------|--------------------|
| R-1 | resolved per critic r1 (lopdf 0.32.0 = marker 1 entry) | 본 plan §2.2.1 의 source comment 가 lopdf upgrade 시 re-verify trigger. |
| R-2 | resolved (`trim().is_empty()` cover) | Step 1 implementation 의 §2.2.1 4-단계 중 2-단계 = trim-empty zero. |
| R-3 | resolved (wire schema 변경 0) | parser_version `"pdf-text-v1"` / chunker_version `"pdf-page-v1.1"` 보존. version cascade 영향 0 (CLAUDE.md §Versioning cascade). |
| R-4 | resolved per critic r1 (grep boundary = `integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) | Step 2 가 spec 범위 내 2 surface 모두 커버. **추가 발견 (out of scope)** → §5 OQ-4. |
| R-5 | resolved (`bulk.rs:161` alias normalize 통해 영향 0) | 본 plan 동작 변경 0. |
추가 risk — 본 plan drafter 가 식별:
- **R-6 (NEW)**: Step 4 의 optional dogfood retest 가 `KEBAB_WORKSPACE` 또는 user-private corpus 의존. CI 환경에서 verify 불가 — verifier round 0 가 evidence 부재 시 row 8 skip 명시 권고.
---
## §5 Open questions for executor
spec ACCEPT 가 명확하므로 OQ-1/2/3 모두 resolved. 본 plan drafter 가 추가 식별:
- **OQ-4 (NEW)**: spec §2.2 의 R-4 grep 결과, frozen boundary 외부 surface 가 동일 stale list 보유:
- `crates/kebab-mcp/src/tools/search.rs:44` — MCP tool 의 `--media` doc.
- `crates/kebab-core/src/search.rs:32``MEDIA_KINDS` const = `&["markdown", "pdf", "image", "audio", "other"]`. 주의: 이 const 가 functional 일 수 있음 — `code` 는 v0.18.0 부터 separate path 로 first-class 처리 (`schema.v1.media_breakdown.code` 존재 확인 per spec §1.2). const 자체 수정은 behavior change risk 동반 → 별도 spec 으로 분리.
- `crates/kebab-core/src/search.rs:52``MediaFilter::media` doc-comment.
- `crates/kebab-app/src/ingest_progress.rs:69` — progress label doc-comment.
- `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` — test fixture array (functional, 변경 시 test 의미 영향).
**executor action**: 본 round 미포함. PR description 또는 Step 2 commit body 에 "follow-up: open issue for stale --media value list in 5 additional surfaces" 한 줄 명시 권장.
- **OQ-5 (NEW)**: spec §6 의 UX consequence — pre-bugfix2 v0.20.0 user 의 `--force-reingest` 권고가 release notes 에 들어가야 하며, 별도 phase (PR review/merge 시점) 의 작업. 본 plan 의 Step 1 §2.2.3 HOTFIXES entry 가 user-facing surface 의 일부 — release notes 가 HOTFIXES 의 user action 항목을 인용 가능.
---
## §6 References
- **Spec ACCEPT (parent contract)**: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` (308 line, round 1c).
- **Critic round 1**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md` (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit, 9 finding 모두 spec 에 반영).
- **Critic round 2**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md` (NEW-1 = §4.2 stale arithmetic, NEW-2 = §4.3 scope description drift — 본 plan §2.2.2 + §2.6.1 에 정정 반영).
- **Plan drafter brief**: `.omc/reviews/2026-05-27-v0.20-bugfix2-plan-drafter-brief.md`.
- **Parent design**: `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md` §1.3 (text-detect threshold metric), §9 (version cascade).
- **Round 1 history**: branch `feat/pdf-scanned-ocr` HEAD = `e674ff4`, 4 commit 적층 (Bug #2 source-fs, Bug #3 chunk_id collision, Bug #3 test, Bug #4 pikepdf F4 fixture).
- **Code locations (line 실측)**:
- `crates/kebab-parse-pdf/src/text_quality.rs:1-103` (전체 file).
- `crates/kebab-cli/src/main.rs:158-160` (SearchArgs `media` clap doc-comment, 3-line multi-line attribute).
- `integrations/claude-code/kebab/SKILL.md:57` (search input filter 설명).
- `crates/kebab-cli/tests/cli_help_smoke.rs` (신규, Step 3).
- **External source**: `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?`).
---
## §7 Constraints (spec §9 + brief §9)
1. **branch 변경 0** — plan 자체는 documentation only. 본 file = plan deliverable.
2. **spec ACCEPT frozen** — round 1c body 보수 X. 본 plan 의 §2.2 / §2.6 의 wording 정정 (`Replace existing``+2 / -0 additive`) 은 plan 의 local note 로 명문, spec 본문 미변경.
3. **regression 0** — workspace test N ≥ 1319.
4. **wire schema / version cascade 변경 0**`parser_version="pdf-text-v1"`, `chunker_version="pdf-page-v1.1"` 보존.
5. **subagent skip** — executor 가 in-session 단일 thread 실행 (worker protocol per task assignment).
6. **lightweight scope** — 본 plan 의 line target = 200-400 (round 1 plan = 849 line 의 1/3 미만).
**Status**: DRAFT round 0 — verifier review 대기.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,965 @@
---
title: v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
created: 2026-05-27
status: ACCEPT (round 2 closure — Phase A complete)
target_version: 0.20.0 (PR #189 force-update)
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
dogfood_evidence: .omc/reviews/2026-05-27-v0.20-dogfood-report.md
review_history:
- "2026-05-27 spec round 1 critic (opus, thorough) — ACCEPT, HIGH 0 + MEDIUM 3 + LOW 2 + NIT 2"
- "2026-05-27 spec round 1c rewrite (opus, drafter) — MEDIUM/LOW/NIT all applied"
- "2026-05-27 spec round 2 closure critic (opus) — ACCEPT, 7/7 applied + 1 NIT (frontmatter status, applied here)"
---
# v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된
3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다.
후속 plan + executor 단계의 source 다.
## §1 Background + dogfood evidence chain
### §1.1 dogfood 환경
| 항목 | 값 |
|------|----|
| Binary | `kebab v0.20.0` (commit `b4d9e60`) |
| Ollama endpoint | `http://192.168.0.47:11434` (qwen2.5vl:3b) |
| Isolated KB | `/build/cache/tmp/v0.20-dogfood/` |
| Corpus | 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB) |
### §1.2 3 bug 의 reproducibility
| Bug | Severity | Trigger | Reproducible |
|-----|----------|---------|--------------|
| #2 walker code limit | Important | 256 KB+ PDF/image/markdown ingest | 항상 (default config) |
| #3 chunk_id collision | **Critical** | scanned_page2.pdf (1580 OCR chars) ingest | force-reingest 마다 |
| #4 F4 Pages tree | Important | mojibake.pdf (F4 fixture) ingest | 항상 |
### §1.3 dogfood report 인용
dogfood report (`.omc/reviews/2026-05-27-v0.20-dogfood-report.md`) 의 핵심 인용:
- Bug #2: `scanned=3, skipped_size_exceeded=6` — workspace 9 PDF 중 3 만 통과,
6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label
2.7MB / internals 820KB) walker 단계 skip.
- Bug #3: `"DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
failed"` — scanned_page2.pdf chunk INSERT 단계에서 발생.
- Bug #4: `block_count: 0, chunk_count: 0` — F4 mojibake.pdf 의 ingest 결과
가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반.
## §2 Goals + non-goals
### §2.1 Goals
- 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은
branch 위에 stack).
- parent spec
(`docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`) 의 invariant 보존:
- §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page".
- §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path).
- §4.6 wire schema additive 만 (V00X migration 불필요).
- parent plan
(`docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`) round 1c ACCEPT
의 design decisions 와 충돌 0.
- workspace test regression 0 (`cargo test --workspace -j 1`).
### §2.2 Non-goals
- 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0.
- 새 V00X sqlite migration — `chunks` table DDL 변경 없음, fix 는 chunk_id 계산
path 한정.
- F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF
요구사항 유지).
- 새 config knob 추가 (`[ingest.pdf].max_file_bytes` 같은 per-media-type limit
은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만).
## §3 Bug #2 — walker code limit
### §3.1 Root cause (file:line evidence)
`crates/kebab-source-fs/src/connector.rs:42-72` — `FsSourceConnector` 가
`Config::new` 에서 `max_file_bytes` 와 `max_file_lines` 를
`config.ingest.code` 단일 namespace 에서 읽는다:
```rust
Ok(Self {
default_root: root,
default_exclude: config.workspace.exclude.clone(),
copy_threshold_bytes,
skip_generated_header: config.ingest.code.skip_generated_header,
max_file_bytes: config.ingest.code.max_file_bytes, // <-- code-specific
max_file_lines: config.ingest.code.max_file_lines, // <-- code-specific
})
```
`crates/kebab-source-fs/src/connector.rs:169-190` — walker 의 size check 가
`is_oversized(...)` 호출 시 path 의 media type 무관:
```rust
if crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes, // generic limit, applied 모든 file
self.max_file_lines,
).unwrap_or(false) {
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
// ...
continue;
}
```
`crates/kebab-source-fs/src/code_meta.rs:114-129` — `is_oversized(...)` 자체는
generic helper (extension 무관):
```rust
pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
let meta = std::fs::metadata(path)?;
if meta.len() > max_bytes {
return Ok(true);
}
// line cap (streaming)
...
}
```
`crates/kebab-config/src/lib.rs:535-547` — `IngestCodeCfg::default()` 의
`max_file_bytes = 262_144` (256 KB) — 대부분 PDF/image 가 이를 초과.
### §3.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — code path only** | walker 의 size check 를 code file (extension recognized by `code_lang_for_path`) 에만 적용 | 간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 | PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과 |
| B — per-type config | 새 `[ingest.pdf]`, `[ingest.image]`, `[ingest.markdown]` section 추가 + per-type limit | user-tunable | 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과 |
| C — generic limit + docs note | 같은 generic limit 유지하지만 의도 명문화 | code 변경 0 | UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → **거부** |
### §3.3 Chosen path — Option A
walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는
parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리,
image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config
필요 시 Option B 로 진화.
`is_code_file(path: &Path) -> bool` helper 추가:
- `code_meta::code_lang_for_path(path).is_some()` = code file. 기존 helper
재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list
완전 동일).
### §3.4 Implementation (Rust diff)
`crates/kebab-source-fs/src/code_meta.rs` — `pub(crate)` helper 추가:
```rust
/// Returns true when `path`'s filename/extension is recognised as a code file
/// (per `code_lang_for_path`). Used by the walker to apply
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
/// not to PDF/image/markdown (which have their own size controls in their
/// respective parsers).
pub(crate) fn is_code_file(path: &Path) -> bool {
code_lang_for_path(path).is_some()
}
```
`crates/kebab-source-fs/src/connector.rs:168-190` — walker conditional 추가:
```rust
// p10-1A-1: apply per-file generated-header + size-cap checks on files
// that passed the override (gitignore/builtin/kebabignore) matching.
// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines)
// applies ONLY to code files. PDF/image/markdown bypass — their parsers
// have their own size controls.
if crate::code_meta::is_code_file(&abs_path)
&& crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes,
self.max_file_lines,
)
.unwrap_or(false)
{
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
push_sample(
&mut fs_skips.skip_examples.size_exceeded,
&abs_path,
&root,
);
tracing::debug!(
path = %rel_path.display(),
max_bytes = self.max_file_bytes,
max_lines = self.max_file_lines,
"skip: code file exceeds size cap"
);
continue;
}
```
`skip_generated_header` 의 conditional 적용은 별개 — generated header sniff
은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image
의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지
않으므로 false positive 0. **`is_generated_file` 의 walker conditional 추가는
불필요** — 기존 behavior 유지.
### §3.5 Test additions
`crates/kebab-source-fs/src/connector.rs` 의 기존 test module 에 추가:
```rust
#[test]
fn size_cap_skips_only_code_files() {
let dir = tempfile::tempdir().unwrap();
let root = dir.path();
// 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code).
let big_blob: Vec<u8> = vec![b'x'; 300_000];
std::fs::write(root.join("paper.pdf"), &big_blob).unwrap();
std::fs::write(root.join("notes.md"), &big_blob).unwrap();
std::fs::write(root.join("big.rs"), &big_blob).unwrap();
let conn = FsSourceConnector::new(
&cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
)
.unwrap();
let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
// PDF + Markdown pass through walker.
assert!(paths.contains(&"paper.pdf".to_string()));
assert!(paths.contains(&"notes.md".to_string()));
// Code file gets skipped.
assert!(!paths.contains(&"big.rs".to_string()));
assert!(
skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")),
"size_exceeded examples should contain only big.rs: {:?}",
skips.skip_examples.size_exceeded
);
assert!(
!skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")),
"PDF must NOT appear in size_exceeded examples: {:?}",
skips.skip_examples.size_exceeded
);
}
```
추가로 기존 test `ingest_report_counts_oversized_files_by_bytes` 의 fixture
이름이 `huge.rs` 라서 invariant 보존됨. `ingest_report_size_cap_by_line_count`
도 `longfile.rs` 라서 동일.
## §4 Bug #3 — chunk_id collision (Critical)
### §4.1 Root cause investigation
#### §4.1.1 chunker 의 collision-avoidance workaround
`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 에 collision 회피
설명:
```
Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
|| policy_hash) collides when one block (= one PDF page) is split into
multiple chunks: every chunk on that page has identical block_ids.
Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}")
into the recipe's policy_hash slot.
```
`crates/kebab-chunk/src/pdf_page_v1.rs:170-172` 의 actual call:
```rust
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
```
여기 `char_start` = `chunk_page(...)` 의 첫 번째 tuple field = **post-overlap
`actual_start`** (NOT 원본 segment boundary `start`).
#### §4.1.2 overlap 의 actual_start 계산
`crates/kebab-chunk/src/pdf_page_v1.rs:266-281`:
```rust
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0; // previous chunk 의 actual_start
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
let cl = chars[a - 1].len_utf8();
if acc_o + cl > overlap_bytes {
break;
}
acc_o += cl;
a -= 1;
}
a
} else {
start
};
```
`while a > prev_min` — overlap walk 는 previous chunk 의 actual_start 까지만
back-walk. overlap_bytes 가 충분히 크고 `start - prev_min` 이 작으면
`actual_start = prev_min`. **두 chunk 가 같은 actual_start = 같은 `#c{N}`**.
#### §4.1.3 가설 검증 — F2 (1580 chars OCR)
가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (`.` + whitespace)
또는 paragraph break (`\n\n`) 를 포함.
- 기본 chunking policy: `target_tokens=500` → `target_bytes=1500`,
`overlap_tokens=80` → `overlap_bytes = min(240, 750) = 240`.
- 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk.
- 가정한 bounds = `[0, 30, ~n]` (첫 ~30 chars 안 sentence-end 1 개).
- segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)). `#c0`.
- segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk.
- actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240.
- 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료.
- actual_start = 0 = prev_min.
- chunks.push((0, n, ...)). `#c0` — **collision with chunk 1**.
같은 doc 안 두 chunk 의 chunk_id input:
- `{kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1",
block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}`
- canonical JSON 동일 → blake3 동일 → chunk_id 동일.
→ `put_chunks` 의 `INSERT INTO chunks` 에서 첫 row 성공, 두 번째 row 가
PRIMARY KEY violation.
#### §4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유
F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence
boundary 부재. 그 경우 bounds = `[0, n]` 또는 first boundary 가 80 char 이후
→ chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct `#c{N}`
값 → distinct chunk_id.
→ **F2 만 collision** 이라는 dogfood 의 observation 과 일치.
#### §4.1.5 dogfood report 의 가설 평가
dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision
을 추정. 본 spec 의 investigation 결과 = **intra-doc (F2 내부) collision**.
근거:
- chunk_id input 에 `doc_id` 포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름.
- 같은 doc 안 두 chunk 가 같은 block_id + 같은 `#c{N}` policy_hash 면
identical chunk_id.
- 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는
policy 의 canonical JSON blake3 (deterministic).
- 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id
collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른
collision pattern).
- 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per
chunk 이므로 ordering N/A.
- 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text
미포함, N/A.
**Confirmed root cause** = 가설 C 의 variant — 단일 page 가 multi-chunk 일
때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, `#c{N}`
suffix 동일.
### §4.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — segment boundary `start`** | `per_chunk_hash` 의 suffix 를 post-overlap `actual_start` 대신 segment boundary `start` 로 변경 | minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 | chunk_page 의 return tuple shape 변경 필요 |
| B — chunk ordinal | `per_chunk_hash = "#c{ordinal}"` (page 안 chunk index 0, 1, 2, ...) | 가장 simple / segment boundary 무관 | chunk_id 의 "meaningful hash input" semantic 약화 |
| C — (`char_start`, `char_end`) pair | `per_chunk_hash = "#c{char_start}-{char_end}"` | 두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct | char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함 |
| D — sequence number + char_start | `per_chunk_hash = "#c{ordinal}-{char_start}"` | invariant 완전 보장 | redundant info, hash input 가 더 길어짐 |
### §4.3 Chosen path — Option A
근거:
- chunk_page 의 main loop 는 `seg_idx` 가 strictly increasing, segment
boundary `bounds[seg_idx]` 도 strictly increasing (bounds 가 dedup 후 unique).
따라서 segment boundary `start` 를 hash suffix 로 쓰면 같은 page 안 chunk
들의 hash input 가 보장된 distinct.
- chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전
의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary
를 위한 enrichment.
- chunk_page 의 return tuple 을 `(segment_start, actual_start, chunk_end,
slice)` 의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도
track) — minimal diff.
### §4.4 Implementation
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `chunk_page` return signature 확장:
```rust
/// Split a single page's text into ordered chunks, each represented as
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
fn chunk_page(
text: &str,
target_bytes: usize,
overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
// ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice))
chunks.push((start, actual_start, chunk_end, slice));
// ...
}
```
caller 의 `chunk` method 도 동일하게 update:
```rust
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// ... existing u32 conversion + span construction ...
let span = SourceSpan::Page {
page: page_num,
char_start: Some(u32::try_from(char_start).expect("...")),
char_end: Some(u32::try_from(char_end).expect("...")),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// segment_start (pre-overlap boundary) is strictly increasing across
// chunks, even when overlap walk collapses actual_start to prev_min.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
// ... rest unchanged ...
}
```
`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 도 동시 update —
기존 description 의 `"#c{char_start}"` 가 새 fix 에 stale 하므로:
```rust
//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
//! || policy_hash) collides when one block (= one PDF page) is split into
//! multiple chunks: every chunk on that page has identical block_ids.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment
//! boundary, strictly increasing across the returned chunks even when the
//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`.
//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch).
```
추가로 `tasks/HOTFIXES.md` 에 dated entry 추가 (본 fix 이 chunk_id deviation
의 **second-iteration patch** — 첫 iteration 의 `#c{char_start}` workaround 가
aggressive overlap case 에서 collision 을 leave 했음을 명문화):
```markdown
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround
collapses under aggressive overlap (Bug #3 second-iteration patch)
**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf):
UNIQUE constraint failed: chunks.chunk_id`. ...
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ...
post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ...
**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start`
추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ...
**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9
cascade trigger 로 explicit invalidation.
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer
의 internal computation 만 변경).
```
#### §4.4.1 chunk_id determinism 보존
기존 single-chunk-per-page case (e.g. small pages, `text.len() <= target_bytes`)
:
- `chunk_page` 의 early return: `vec![(0, n, text.to_string())]` → 새 shape
로 `vec![(0, 0, n, text.to_string())]`. `segment_start = 0 = actual_start`.
- `#c0` suffix 동일 → chunk_id 동일.
multi-chunk case 의 첫 chunk:
- segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk).
- `#c0` suffix 동일 → chunk_id 동일.
multi-chunk case 의 second-and-later chunk:
- 기존: `actual_start` (overlap-walked, may be == 0).
- 새: `segment_start` = bounds[seg_idx] > 0.
- → chunk_id 변경 (intentional, collision 회피).
→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨.
이는 v0.20 의 force-reingest path 에서 자동 갱신.
**Decision (round 1c, closes §7.2 Open Q1): chunker_version bump
`pdf-page-v1` → `pdf-page-v1.1`** (critic round 1 M-1 권장 채택).
근거:
- 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf
의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가
internal computation 변경으로 silent 하게 다른 값으로 mapping.
chunker_version 을 `pdf-page-v1` 유지하면 store/embedding layer 의 cascade
audit 가 발생 안 함 → 사용자가 `--force-reingest` 를 명시적으로 호출하지
않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능.
- design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit
version bump → store layer 의 자동 invalidation report. `pdf-page-v1.1`
bump 는 그 rule 의 직접 적용.
- bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일
release commit 위에 cumulative bugfix stack) 이고, parent spec
(`2026-05-27-pdf-scanned-ocr-spec.md`) 의 OCR feature 활성화가 어차피
force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만
다르면 새 doc_id chain 안에서 동일하게 cascade 재계산.
- benefit = explicit user-facing audit trail. 다음 ingest 시 cascade
invalidation 이 store layer report 에 명시.
cascade 의 다른 version field (parser_version / embedding_version /
prompt_template_version / index_version) 는 unchanged — chunker layer
한정 patch.
`PdfPageV1Chunker` 의 `chunker_version()` 상수 update:
```rust
impl Chunker for PdfPageV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion("pdf-page-v1.1".to_string()) // was: "pdf-page-v1"
}
// ...
}
```
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `PARSER_VERSION` 또는 const
`CHUNKER_VERSION` 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서).
### §4.5 Test additions
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` 에 추가:
```rust
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start
// can collapse to prev_min, producing identical `#c{char_start}` suffixes
// and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation
// at put_chunks INSERT time.
//
// Synthesises Korean OCR text shape: dense Korean characters (3 bytes
// per char) with a single early sentence-end boundary at char ~10 +
// long tail.
// 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars.
let early_seg: String = std::iter::repeat('가').take(10).collect();
let tail: String = std::iter::repeat('나').take(500).collect();
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
// Hard invariant: all chunk_ids must be unique. Without the fix, the
// second chunk would have actual_start = 0 (== first chunk's
// actual_start) under the aggressive-overlap walk → identical `#c0`
// hash suffix → identical chunk_id → PRIMARY KEY violation.
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
```
(round 1c L-1: 원래 round 0 의 second test
`chunk_id_recipe_uses_segment_start_not_actual_start` 는 본 test 의
uniqueness 검증과 redundant + 실제 assertion 이 `assert!(chunks.len() >= 2)`
뿐이라 test name 의 의도와 mismatch — 제거.)
추가로 `crates/kebab-app/tests/` 에 integration 수준의 regression test:
```rust
// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new)
//
// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page +
// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과.
//
// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이
// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록
// 구성. real Ollama 호출 회피.
#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
// ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB
// ... assert: ingest_report.items 모두 kind != Error
// ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts
}
```
(round 1c NIT-1: 파일명과 함수명을 `multi_scanned_pdf_ingest_no_chunk_id_collision`
로 통일 — 원래 round 0 의 파일명 `pdf_multi_scan_no_chunk_id_collision.rs` 는
fn name 과 mismatch.)
#### §4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3)
본 integration test 는 `OcrEngine` trait 의 mock impl 을 요구. executor 단계의
1st step:
1. `grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/`
로 MockOcrEngine 위치 확인.
2. **현재 상태** (2026-05-27 verifier probe):
- `crates/kebab-parse-image/src/ocr.rs:235` — production `impl OcrEngine for OllamaVisionOcr`.
- `crates/kebab-app/tests/pdf_ocr_apply.rs:25` — `impl OcrEngine for MockOcrEngine` (test-only).
3. 본 새 integration test (`multi_scanned_pdf_ingest_no_chunk_id_collision.rs`)
는 같은 crate (`kebab-app`) 안의 별 test binary 라 `pdf_ocr_apply.rs` 의
private MockOcrEngine 를 직접 import 불가. executor 의 선택지:
- **Option A (권장)**: `crates/kebab-app/tests/common/mock_ocr.rs` 에
MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는
parameterised 형태). 두 test (`pdf_ocr_apply.rs` + 본 신 test) 모두
`mod common;` 으로 share.
- **Option B**: 본 신 test 안에 inline `impl OcrEngine for LocalMock { ... }`
중복 정의 (test isolation 우선, share 비용 회피).
4. 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의
acceptance 를 **conditional downgrade** — `kebab-chunk` 의
unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀
확보. integration 회피.
executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서
closed.
### §4.6 Acceptance (Bug #3 fix)
- F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0.
- `--force-reingest` 마다 collision 0.
- 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0.
- `crates/kebab-chunk` 의 기존 1000-determinism test
(`deterministic_chunk_ids_1000`) 통과 보존.
- workspace test regression 0.
- new test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
+ integration test `multi_scanned_pdf_ingest_no_chunk_id_collision` 추가.
## §5 Bug #4 — F4 fixture Pages tree
### §5.1 Root cause
#### §5.1.1 현상
```json
{
"doc_path": "mojibake.pdf",
"kind": "new",
"byte_len": 22568,
"pdf_ocr_pages": 0,
"pdf_ocr_ms_total": 0,
"block_count": 0,
"chunk_count": 0,
"warnings": []
}
```
PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반.
#### §5.1.2 lopdf get_pages() 의 reaction
dogfood probe:
- `lopdf::Document::load_mem(F4_bytes)` → OK.
- `pdf_doc.get_pages()` → empty `BTreeMap`.
- PDF byte stream 안 `/Type /Page` count = 1, `/Count` value = 1.
→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal
(`/Pages` → `/Kids` chain) 가 broken.
#### §5.1.3 fixture 생성 path 분석
`tests/fixtures/_synth/mojibake.py`:
```python
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30*mm
for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]:
c.drawString(30*mm, y, line)
y -= 16
c.save()
data = dst.read_bytes()
# pattern: "/ToUnicode <objref>" — strip indirect object reference
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
dst.write_bytes(new_data)
```
**Step 2 분석**: `re.sub` 가 `/ToUnicode N M R` byte sequence 를 제거하지만:
- 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift.
- cross-reference table (`xref`) 의 offset entries 가 stale.
- `startxref` value 의 offset 도 stale.
**Step 3 의 startxref fix** (`tasks/HOTFIXES.md` 의 commit `c2cd3a7`):
- manual byte edit `22130 → 22114` 로 startxref 갱신.
- 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의
`/Kids` array 가 가리키는 indirect object 의 actual byte position 가
xref entry 와 mismatch.
- lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공
하지만 Pages tree traversal 시 indirect object resolution fail → empty
Pages map.
### §5.2 Fixture re-generation strategy
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — pikepdf** | reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) | proper xref regeneration / Pages tree intact / library available (pip install pikepdf) | 새 Python dependency (`pikepdf`) |
| B — qpdf normalize | byte-edit 후 `qpdf --linearize input.pdf output.pdf` | external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) | qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음) |
| C — reportlab disable ToUnicode | reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable | byte-edit 회피 — clean | reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요) |
### §5.3 Chosen path — Option A (pikepdf)
근거:
- pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings.
- xref table 의 auto-regeneration + Pages tree 의 integrity 보존.
- `pip install pikepdf` 로 dependency 추가 — 이미 fixture generation 용 Python
venv 가 reportlab 사용 중이라 추가 install 가 trivial.
#### §5.3.1 ToUnicode strip 의 pikepdf approach
reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안
`/ToUnicode <ref>` 로 등장. pikepdf 로 font dictionary 의 `/ToUnicode` entry 만
제거:
```python
import pikepdf
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
# Walk all indirect objects, delete /ToUnicode entry whenever found.
# PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 →
# false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4
# 의 actual implementation 과 동일 형태).
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
pdf.save(str(dst))
```
pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존.
### §5.4 Implementation (mojibake.py revision)
`tests/fixtures/_synth/mojibake.py` 의 완전 rewrite:
```python
"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
Strategy:
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen).
Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`.
Usage:
python3 tests/fixtures/_synth/mojibake.py \
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
"""
import sys
from pathlib import Path
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
import pikepdf
DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
FONT_NAME = "DejaVuSans"
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
dst = Path(sys.argv[1])
# Step 1: 정상 PDF 합성.
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30 * mm
for line in [
"Mojibake fixture (no ToUnicode CMap)",
"Text extraction yields garbage \x00\x01\x02",
]:
c.drawString(30 * mm, y, line)
y -= 16
c.save()
# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration.
removed = 0
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
removed += 1
pdf.save(str(dst))
if removed == 0:
print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr)
sys.exit(2)
# Step 3: invariant 검증 — load + page count.
with pikepdf.open(str(dst)) as pdf:
n_pages = len(pdf.pages)
if n_pages != 1:
print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr)
sys.exit(3)
# ToUnicode 부재 invariant 확인.
raw = Path(dst).read_bytes()
if b"/ToUnicode" in raw:
print("ERROR: /ToUnicode still present after strip", file=sys.stderr)
sys.exit(4)
print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)")
```
### §5.5 Test additions
`crates/kebab-parse-pdf/tests/text_extractor.rs` (or relevant existing test
file) 에 추가:
```rust
/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation
/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장.
///
/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는
/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken
/// indirect object resolution → empty pages map → block_count=0.
#[test]
fn mojibake_fixture_load_yields_one_page() {
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load");
let pages = doc.get_pages();
assert_eq!(
pages.len(),
1,
"F4 fixture must have exactly 1 page (Pages tree integrity)"
);
}
#[test]
fn mojibake_fixture_has_no_tounicode_cmap() {
// Step 2 의 ToUnicode 부재 invariant.
let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap();
let count = bytes.windows(b"/ToUnicode".len())
.filter(|w| *w == b"/ToUnicode")
.count();
assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker");
}
#[test]
fn pdf_text_extractor_on_mojibake_yields_one_block() {
// PdfTextExtractor 의 invariant: 1 Block::Paragraph per page.
// F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는
// empty → 1 empty Block::Paragraph + "scanned candidate" warning.
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
// ... ExtractContext setup + extractor.extract(&ctx, bytes) ...
let canonical = extractor.extract(&ctx, bytes).unwrap();
assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page");
// text 는 garbage 또는 empty — invariant 는 block 자체의 존재.
let warning_present = canonical.provenance.events.iter().any(|e| {
matches!(e.kind, ProvenanceKind::Warning)
&& e.note.as_ref().is_some_and(|n| n.contains("scanned candidate"))
});
assert!(warning_present || !canonical.blocks[0].text_is_empty(),
"text-detect first 의 empty fallback 시 scanned-candidate warning 필수");
}
```
### §5.6 Acceptance (Bug #4 fix)
- F4 fixture re-generation 후 `lopdf::Document::load_mem(...).get_pages().len() = 1`.
- F4 fixture 의 ToUnicode CMap 부재 invariant 보존
(`grep -c "/ToUnicode" mojibake.pdf` = 0).
- PdfTextExtractor 의 F4 ingest 시 `block_count = 1`,
warning `"page1 empty (scanned candidate)"` 또는 garbage text.
- dogfood retest 시 mojibake.pdf 의 `block_count: 1`,
`chunk_count: 0~1` (depending on text content).
- 기존 `text_extractor_regression.rs` 의 F4 baseline 갱신 — old baseline 자체
가 broken invariant 의 snapshot 이라 update 필요.
- workspace test regression 0.
## §6 Acceptance criteria (consolidated)
| # | Verifier | Bug | 명령 |
|---|---------|-----|------|
| 1 | walker bypasses size cap for PDF | #2 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files` |
| 2 | walker still skips oversized code files | #2 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes` |
| 3 | 256KB+ PDF/markdown ingest default config | #2 | dogfood retest: `kebab ingest` 시 `skipped_size_exceeded = 0` for non-code |
| 4 | chunker collision regression test | #3 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` |
| 5 | chunker determinism preserved | #3 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000` |
| 6 | chunker overlap clamp preserved | #3 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target` |
| 7 | integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) | #3 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision` |
| 8 | dogfood: F1 + F2 force-reingest | #3 | dogfood retest: `kebab ingest --force-reingest` 시 errors = 0 (encrypted 제외) |
| 9 | F4 fixture lopdf 1-page invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page` |
| 10 | F4 fixture ToUnicode 부재 invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap` |
| 11 | F4 PdfTextExtractor 1-block invariant | #4 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block` |
| 12 | dogfood: F4 ingest yields block_count=1 | #4 | dogfood retest: mojibake.pdf 의 ingest item `block_count: 1` |
| 13 | workspace clippy clean | all | `cargo clippy --workspace --all-targets -- -D warnings` |
| 14 | workspace full test pass | all | `cargo test --workspace --no-fail-fast -j 1` |
| 15 | dogfood end-to-end 9 PDF | all | dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
| 16 | chunker_version cascade — final value | #3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` (round 1c M-1 결정) |
## §7 Risks + open questions
### §7.1 Risks
- **Bug #3 fix 가 chunk_id 변경**: multi-chunk PDF page (pre-OCR 시점에 1500
byte 초과 page) 의 chunk_id 가 변경됨. 사용자가 `--force-reingest` 1회
필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh
ingest). README 또는 release note 에 명시.
- **Bug #2 fix 의 side-effect**: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의
load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming
parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable.
- **Bug #4 fix 의 fixture binary 변경**: F4 mojibake.pdf 의 SHA256 가 변경
→ git LFS / binary diff 의 noise. `text_extractor_regression.rs` 의
baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리.
- **pikepdf install requirement**: fixture re-generation 시 `pip install
pikepdf` 필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의
Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로
generation 은 1회성, CI 의존성 미발생.
### §7.2 Open questions
1. **chunker_version bump 의 cost-benefit**: ✅ **CLOSED (round 1c M-1)** —
`pdf-page-v1` → `pdf-page-v1.1` bump 결정. cascade audit trail explicit
+ v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision
(round 1c, closes §7.2 Open Q1)" 단락.
2. **Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion**: 본 spec
은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음.
3. **F4 fixture 의 invariant**: critic round 1 ACCEPT — ToUnicode 부재 +
valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible.
Step 2 의 design (mojibake.py rewrite) sound.
4. **integration test 의 mock OCR**: ✅ **CLOSED (round 1c M-3)** —
`crates/kebab-app/tests/pdf_ocr_apply.rs:25` 에 이미 `impl OcrEngine for
MockOcrEngine` 존재. executor 의 share path (Option A — `tests/common/
mock_ocr.rs` lift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가
능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition
— MockOcrEngine availability" 단락.
5. **chunk_page tuple shape 변경**: Option A 의 4-tuple `(segment_start,
actual_start, chunk_end, slice)` 가 외부 callers 에 영향을 주는가?
`chunk_page` 는 module-private (`fn chunk_page`) 이라 외부 caller 0,
안전. critic round 1 ACCEPT.
## §8 References
- dogfood report: `.omc/reviews/2026-05-27-v0.20-dogfood-report.md`
- parent spec (frozen): `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`
- parent plan (round 1c ACCEPT): `docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`
- source code (root cause evidence):
- `crates/kebab-source-fs/src/connector.rs` (Bug #2)
- `crates/kebab-source-fs/src/code_meta.rs` (is_oversized + code_lang_for_path)
- `crates/kebab-config/src/lib.rs` (IngestCodeCfg)
- `crates/kebab-core/src/ids.rs` (id_for_chunk / id_for_block recipes)
- `crates/kebab-chunk/src/pdf_page_v1.rs` (PdfPageV1Chunker + chunk_page)
- `crates/kebab-app/src/pdf_ocr_apply.rs` (post-extract OCR enrichment)
- `crates/kebab-app/src/lib.rs:1769-1968` (ingest_one_pdf_asset wiring)
- `crates/kebab-store-sqlite/src/documents.rs:103-155` (put_chunks DELETE+INSERT)
- `migrations/V001__init.sql:80-94` (chunks table DDL — chunk_id PRIMARY KEY)
- `tests/fixtures/_synth/mojibake.py` (Bug #4 fixture source)
- design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe),
§4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist),
§9 (versioning cascade).

View File

@@ -0,0 +1,308 @@
---
title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
created: 2026-05-27
status: "DRAFT round 1c"
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
related_specs:
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
related_dogfood:
- .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
---
# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text
## §1 Problem statement
### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection
**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.
**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.
**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.
### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values
**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.
**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.
**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.
---
## §2 Scope + non-scope
### §2.1 Included in this spec
- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
- **Traceability**: Link both fixes to parent spec §1.3 design intent.
### §2.2 Explicitly out of scope
**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.
**Non-bug observations**:
- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
- Binary staleness: Environmental artifact, not applicable to spec.
**Ancillary risks**:
- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.
---
## §3 Decisions
### §3.1 Bug #6: Known mojibake marker stripping
Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.
**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.
**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.
### §3.2 Bug #7: CLI doc-comment update
Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.
### §3.3 Parent spec traceability
Both fixes uphold parent spec §1.3:
- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).
No changes to parser_version, chunker_version, or wire schema.
---
## §4 Implementation specification
### §4.1 Bug #6: text_quality.rs diff
**File**: `crates/kebab-parse-pdf/src/text_quality.rs`
**Change**:
1. Add constant array of known mojibake markers (lines 810):
```rust
// Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
// Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
// encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
// PUA / replacement-char territory already covered by `pure_pua_zero`.
// Re-verify on lopdf dependency upgrade.
const MOJIBAKE_MARKERS: &[&str] = &[
"?Identity-H Unimplemented?",
];
```
2. Refactor `compute_valid_char_ratio()` (lines 39106):
```rust
pub fn compute_valid_char_ratio(s: &str) -> f32 {
// 1) Strip known mojibake markers before counting valid chars.
// Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
// substrings (bypassing PUA detection).
let mut cleaned: String = s.to_string();
// `had_marker` guard preserves prior behavior for whitespace-only input
// (returns ratio of whitespace validity, not 0.0) when no markers found.
// With markers stripped, the guard enables the trim-empty check.
let mut had_marker = false;
for marker in MOJIBAKE_MARKERS {
if cleaned.contains(marker) {
had_marker = true;
cleaned = cleaned.replace(marker, "");
}
}
// 2) Whitespace-only cleaned text → 0.0 (marker-only page).
if had_marker && cleaned.trim().is_empty() {
return 0.0;
}
// 3) Marker-dominance heuristic — when stripped chars exceed remaining
// chars (i.e. marker > 50% of original), the page is "mostly mojibake
// with some decodeable page-furniture" (e.g. metro-korea.pdf has
// header text in a separate font + body that is Identity-H CID).
// Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
if had_marker {
let stripped_chars = s.len().saturating_sub(cleaned.len());
if stripped_chars > cleaned.len() {
// Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
// The 0.3 cap (not 0.0) preserves a small signal that some text
// WAS decodeable, useful for downstream metrics if ever exposed.
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
return raw_ratio.min(0.3);
}
}
// 4) Otherwise compute ratio on cleaned text (existing logic).
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
if total == 0 {
return 0.0;
}
valid as f32 / total as f32
}
```
**Invariants preserved**:
- Function signature and return type unchanged (→ byte-identical caller surface).
- Existing character category logic (hangul, CJK, Latin-1) unmodified.
- Empty-string behavior (return 0.0) preserved.
### §4.2 Bug #6: Unit tests
Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:
```rust
#[test]
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
// metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
// Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
// With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
let r = compute_valid_char_ratio(&s);
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
}
#[test]
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
// Inverse case: short marker noise + long valid text → ratio stays high
// (no false OCR trigger on otherwise-good pages).
let header = "x".repeat(200); // 200 char valid ASCII
let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char
let r = compute_valid_char_ratio(&s);
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
}
```
**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.
### §4.3 Bug #7: CLI doc-comment diff
**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150160)
**Change**:
```diff
-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
```
### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update
Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):
**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)
**Change**:
```diff
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
```
### §4.4 Bug #7: CLI help assertion
Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):
```rust
#[test]
fn search_help_lists_code_in_media_values() {
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
.args(["search", "--help"])
.output()
.expect("kebab search --help");
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
}
```
### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)
- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.
---
## §5 Acceptance criteria
- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.
---
## §6 Risks + open questions
### Identified risks
**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**
**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.
**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.
**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.
**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.
### Open questions
**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.
**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.
**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.
### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest
This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.
**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.
**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.
**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.
Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126128 precedent).
---
## §7 References
- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
- **Code locations**:
- text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
- CLI help: `crates/kebab-cli/src/main.rs:157159`
- Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
- CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)
---
**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.
**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).

View File

@@ -0,0 +1,410 @@
---
title: "v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings"
created: 2026-05-27
status: DRAFT
round: 1c
parent_spec: 2026-04-27-kebab-final-form-design.md
contract_sections:
- "1.1 (ask streaming)"
- "2.2 (error handling)"
- "2.4 (JSON wire schema)"
- "3.1 (config XDG)"
- "4.1 (capabilities schema)"
source_report: .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md
---
# v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings
Post-bugfix2 final dogfood (2026-05-27) 에서 발견된 **5개 bug** 의 fix design. PR #189 force-update (base=main). Spec scope: root cause + fix decision + acceptance criteria + parent spec traceability. Bug #12 falsified (scope 외). Fix 5개 모두 trivial ~ small refactor (기존 1350 test + 추가 5+ test).
---
## §1 Problem statement
### §1.1 Bug #9: capabilities false negative (Critical)
`kebab schema --json``capabilities.streaming_ask``capabilities.single_file_ingest` 가 모두 `false` hardcoded. 그러나 실제 구현:
- `kebab ask --stream``answer_event.v1` ndjson events 정상 emit (191 events 검증).
- `kebab ingest-file <path>``ingest_report.v1` 신규/갱신 정상.
- `kebab ingest-stdin --title <T>` → 정상.
**Impact**: MCP host, Claude Code skill 등 agent 가 `capabilities: { streaming_ask: false, single_file_ingest: false }` 보고 routing 결정 시 false negative. user 가 실제 동작하는 feature 를 사용 불가능하다고 오인.
### §1.2 Bug #10: config fail-fast (UX)
```bash
kebab search "rust" --config /tmp/nonexistent.toml --json
# exit=0, {"hits":[],"schema_version":"search_response.v1"}
```
explicit path 가 missing 시 silent fallback to default config (XDG path). debugging nightmare — typo 또는 wrong path 가 0 hit 으로만 surface.
### §1.3 Bug #11: OCR timeout 600s (Critical UX)
`config.pdf.ocr.request_timeout_secs = 600` (10분/page default). metro-korea.pdf dogfood 증거:
- page 8 + page 13 에서 Ollama remote 의 slow response → 600s 완전 timeout.
- 결과: `ms: 600000, chars: 0, skipped: true` emit → 본문 indexed 안 됨 + 20분 cost waste.
**Production impact**: 사용자가 ingest 완료 signal 못 받음, 일부 page 검색 불가.
### §1.4 Bug #13: schema.models single value (UX)
```json
{
"chunker_version": "md-heading-v1",
"parser_version": "md-frontmatter-v2",
...
}
```
그러나 corpus 안 multi-active:
- parsers: `md-frontmatter-v2`, `pdf-text-v1`, `code-rust-v1`, `code-python-v1`, `none-v1`.
- chunkers: `md-heading-v1`, `pdf-page-v1.1`, `code-rust-ast-v1`, `code-python-ast-v1`, `dockerfile-file-v1`, `k8s-manifest-resource-v1`, `manifest-file-v1`, `code-text-paragraph-v1`.
**Impact**: user 가 `kebab schema` 보고 active version 식별 불완전, version cascade audit 시 누락 risk.
### §1.5 Bug #14: empty query silent (Minor UX)
```bash
kebab search "" --json
# exit=0, {"hits":[],"next_cursor":null,"schema_version":"search_response.v1"}
```
empty query (또는 whitespace-only) 가 silent 0 hit return. user mistake → explicit error 가 정합.
---
## §2 Scope + non-scope
### §2.1 Included: 5 bug fix
| Bug | Category | Severity | Fix type |
|-----|----------|----------|----------|
| #9 | wire schema | critical | capability flag hardcoded boolean → actual feature check |
| #10 | config UX | medium | silent fallback → error.v1 with config_not_found |
| #11 | OCR config | critical | default 600s → 60s timeout |
| #13 | wire schema | medium | single field → additive array fields (backward compat) |
| #14 | input validation | minor | empty query silent → error.v1 with invalid_input |
### §2.2 Out of scope
- **Bug #12 (falsified)**: `inspect doc` blocks[].text 가 code parser 에서 "?" placeholder. 근본: `.text` 아님, `.code` field 정상 emit. user workflow 는 `.code` 로 접근 가능 → spec 범위 외.
- dogfood report §12 의 다른 axis (ranking bias, multi-root caveat) → 별도 phase.
---
## §3 Decisions
### §3.1 Bug #9: capabilities 정정
**Decision**: `schema.rs::capabilities_snapshot()` 의 두 field 를 true 로 update.
```rust
fn capabilities_snapshot() -> Capabilities {
Capabilities {
json_mode: true,
ingest_progress: true,
ingest_cancellation: true,
rag_multi_turn: true,
search_cache: true,
incremental_ingest: true,
streaming_ask: true, // ← WAS FALSE, actual TRUE
http_daemon: false, // ← preserved (not-impl, separate sub-item)
mcp_server: true,
single_file_ingest: true, // ← WAS FALSE, actual TRUE
bulk_search: true,
}
}
```
**Rationale**: actual implementation 이 production-grade streaming ask + single-file ingest 지원. schema report 가 reality 와 정합되어야 agent routing 정확함.
### §3.2 Bug #10: config_not_found error
**Decision**: `kebab-config` 가 자체 error type `ConfigNotFound` 정의, `kebab-app::error_wire` 가 classify arm 추가.
Pseudo-code:
```rust
// crates/kebab-config/src/lib.rs (또는 적절한 error module)
#[derive(Debug, thiserror::Error)]
#[error("config file does not exist: {path}")]
pub struct ConfigNotFound {
pub path: PathBuf,
}
// Config::load 안:
pub fn load(opt_path: Option<&Path>) -> anyhow::Result<Config> {
match opt_path {
Some(p) if !p.exists() => Err(anyhow::Error::new(ConfigNotFound { path: p.to_path_buf() })),
Some(p) => Self::from_file(p),
None => Self::from_xdg_default_or_defaults(),
}
}
```
Classify arm in `kebab-app/src/error_wire.rs`:
```rust
if let Some(e) = err.downcast_ref::<kebab_config::ConfigNotFound>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "config_not_found".to_string(),
message: format!("config file does not exist: {}", e.path.display()),
details: json!({ "path": e.path }),
hint: Some("verify --config argument; use --config to point to a writable toml file, or omit to use XDG default".to_string()),
};
}
```
**Exit code**: 2 (config error, not 0 silent).
### §3.3 Bug #11: OCR timeout 60s
**Decision**: `default_pdf_ocr_request_timeout_secs()` → 600 에서 60 으로 감소.
```rust
fn default_pdf_ocr_request_timeout_secs() -> u64 {
60 // 1 min, production-friendly per dogfood evidence
}
```
**Doc-comment 추가**:
```rust
/// Default OCR request timeout in seconds. Most pages complete in 6-32s.
/// Set to upper-bound valid throughput; exceeding 60s may indicate
/// Ollama unavailability or very dense/high-res pages.
/// Override via [pdf.ocr] request_timeout_secs = N in config.toml.
```
### §3.4 Bug #13: active_parsers + active_chunkers (additive)
**Decision**: wire schema additive minor — `Models` struct 에 두 배열 추가, 기존 single field 보존 (backward compat). `kebab-store-sqlite` 가 fetch methods 제공.
**Store API** (crates/kebab-store-sqlite/src/lib.rs):
```rust
impl SqliteStore {
/// SELECT DISTINCT parser_version FROM documents WHERE parser_version IS NOT NULL ORDER BY parser_version
pub fn fetch_distinct_parser_versions(&self) -> anyhow::Result<Vec<String>> {
let conn = self.conn()?;
let mut stmt = conn.prepare(
"SELECT DISTINCT parser_version FROM documents
WHERE parser_version IS NOT NULL
ORDER BY parser_version"
)?;
let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
let mut out = Vec::new();
for r in rows { out.push(r?); }
Ok(out)
}
pub fn fetch_distinct_chunker_versions(&self) -> anyhow::Result<Vec<String>> {
let conn = self.conn()?;
let mut stmt = conn.prepare(
"SELECT DISTINCT chunker_version FROM chunks
WHERE chunker_version IS NOT NULL
ORDER BY chunker_version"
)?;
let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
let mut out = Vec::new();
for r in rows { out.push(r?); }
Ok(out)
}
}
```
**Models struct** (crates/kebab-app/src/schema.rs):
```rust
pub struct Models {
/// Deprecated since v0.20.1. Use active_parsers for multi-parser corpus.
/// Reports default parser version (markdown path).
pub parser_version: String,
/// Deprecated since v0.20.1. Use active_chunkers for multi-chunker corpus.
pub chunker_version: String,
/// All parser versions active in corpus (v0.20.1+). May be empty if corpus is empty.
pub active_parsers: Vec<String>,
/// All chunker versions active in corpus (v0.20.1+). May be empty if corpus is empty.
pub active_chunkers: Vec<String>,
pub embedding_version: String,
pub prompt_template_version: String,
pub index_version: String,
pub corpus_revision: u64,
}
```
**Computation** (crates/kebab-app/src/schema.rs::collect_models):
```rust
let store = open_store_for_stats(cfg)?;
let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default();
let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default();
Ok(Models {
parser_version: active_parsers.first().cloned().unwrap_or_else(|| kebab_parse_md::PARSER_VERSION.to_string()),
chunker_version: active_chunkers.first().cloned().unwrap_or_else(|| kebab_chunk::md_heading_v1::VERSION_LABEL.to_string()),
active_parsers,
active_chunkers,
...
})
```
**Fallback**: markdown-fallback 유지. 기존 `parser_version` + `chunker_version` hardcode 보존 (backward compat).
### §3.5 Bug #14: empty query validation
**Decision**: `search``ask` command 모두에 query empty check + error.v1 emit.
**Search command** (crates/kebab-cli/src/main.rs::search arm):
```rust
if let Some(q) = query.as_ref() {
if q.trim().is_empty() {
return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty search term or use --bulk".into(),
details: Value::Null,
hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()),
})));
}
}
```
**Ask command** (crates/kebab-cli/src/main.rs::ask arm):
```rust
if query.trim().is_empty() {
return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty prompt".into(),
details: Value::Null,
hint: Some("e.g. `kebab ask 'explain this code'`".into()),
})));
}
```
Both commands now validate; no silent fallback.
---
## §4 Implementation specification
### §4.1 Files to modify
1. **Bug #9 capability fix**: `crates/kebab-app/src/schema.rs`
- line 137151: `capabilities_snapshot()` — flip `streaming_ask: false``true`, `single_file_ingest: false``true`.
- add test: `capabilities_streaming_ask_matches_cli_surface()`.
- add test: `capabilities_single_file_ingest_matches_cli_surface()`.
2. **Bug #10 config_not_found**: Two files
- `crates/kebab-config/src/lib.rs`:
- Define `ConfigNotFound` error struct (with `#[derive(Debug, thiserror::Error)]`).
- Modify `Config::load(opt_path: Option<&Path>)` — path existence check, `return Err(anyhow::Error::new(ConfigNotFound { ... }))`.
- add test: `config_load_explicit_nonexistent_path_returns_error()`.
- `crates/kebab-app/src/error_wire.rs`:
- Add classify arm after existing `ConfigInvalid` case.
- Map `kebab_config::ConfigNotFound``ErrorV1 { code: "config_not_found", ... }`.
3. **Bug #13 schema.models**: Three components
- `crates/kebab-store-sqlite/src/lib.rs`:
- Implement `fetch_distinct_parser_versions()` — SQL SELECT DISTINCT on documents.parser_version + ORDER BY.
- Implement `fetch_distinct_chunker_versions()` — SQL SELECT DISTINCT on chunks.chunker_version + ORDER BY.
- `crates/kebab-app/src/schema.rs`:
- Modify `Models` struct — add `active_parsers: Vec<String>`, `active_chunkers: Vec<String>` fields.
- Modify computation logic (`collect_models` or equiv) — call store methods, populate arrays, fallback to markdown defaults for single fields.
- add test: `schema_models_active_arrays_empty_on_empty_corpus()`.
- add test: `schema_models_active_arrays_populated_after_mixed_ingest()`.
- `docs/wire-schema/v1/schema.schema.json`:
- `Models` object — add `"active_parsers": { "type": "array", "items": { "type": "string" } }`.
- add `"active_chunkers": { "type": "array", "items": { "type": "string" } }`.
- Mark deprecated in comment: `parser_version` + `chunker_version` (additive, backward compat).
4. **Bug #14 empty query validation**: `crates/kebab-cli/src/main.rs`
- search command arm: add `if query.trim().is_empty()` check → error.v1 code=invalid_input.
- ask command arm: add identical `if query.trim().is_empty()` check → error.v1 code=invalid_input.
5. **Wire schema v1 doc update**: `docs/wire-schema/v1/`
- Update schema doc to note `active_parsers` / `active_chunkers` optional (additive).
6. **Integration**: `integrations/claude-code/kebab/SKILL.md`
- Update `schema.models` surface docs — reference new `active_*` arrays for multi-version corpora.
7. **Tests** (new or extended):
- `crates/kebab-cli/tests/`: invalid --config path (absolute + relative) → error.v1 + exit≠0.
- `crates/kebab-cli/tests/`: empty query (search + ask) → error.v1 code=invalid_input + exit≠0.
- `crates/kebab-config/tests/`: config file not found → ConfigNotFound error.
- `crates/kebab-app/tests/`: mixed corpus schema — active_parsers/chunkers include all ingested versions.
### §4.2 Regression checks
- Existing 1350 workspace tests: `cargo test --workspace --no-fail-fast -j 1` must pass green.
- All non-bug capabilities (json_mode, ingest_progress, ingest_cancellation, rag_multi_turn, search_cache, incremental_ingest, mcp_server, bulk_search) stay true.
- Default config path resolution (no --config) unchanged — silent fallback to XDG only if `--config` not passed.
- Relative path behavior (cwd-relative, Rust std path::Path::exists()) preserved.
- Empty corpus → empty `active_parsers` / `active_chunkers` array (not null, not error).
- Existing hardcoded `parser_version` + `chunker_version` fields continue to report markdown defaults (backward compat).
- Schema version bump not required (wire schema additive minor, backward compat).
---
## §5 Acceptance criteria
| # | Criterion | Evidence |
|----|-----------|----------|
| AC-1 | `kebab schema --json` emit `streaming_ask: true` + `single_file_ingest: true` | `cargo test -p kebab-app capabilities_* -j 4` green |
| AC-2 | `kebab search "x" --config /nonexistent.toml --json` emit exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-config config_load_explicit_nonexistent_path_returns_error -j 4` green |
| AC-3 | `cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4` → green | unit test confirms default = 60s (no manual timing) |
| AC-4 | After mixed ingest (MD + PDF + code), `kebab schema --json` emits both `active_parsers` + `active_chunkers` arrays containing all versions | integration test pass |
| AC-5 | `kebab search "" --json` and `kebab search " " --json` both emit exit≠0 + error.v1 code=invalid_input | integration test pass |
| AC-6 | `kebab ask "" --json` emit exit≠0 + error.v1 code=invalid_input (ask symmetry) | integration test pass |
| AC-7 | `kebab search "rust" --config nonexistent-relative.toml --json` (relative path) emit exit≠0 + error.v1 code=config_not_found | integration test pass |
| AC-8 | All 1350+ workspace tests pass; no new failures | `cargo test --workspace --no-fail-fast -j 1` exit=0 |
| AC-9 | Wire schema backward compat: old clients reading `parser_version` + `chunker_version` still work; `active_*` arrays optional per schema | JSON schema `additionalProperties: false` review |
| AC-10 | `kebab ask --stream` still works; streaming events emitted (no regression) | manual `kebab ask --stream "explain this" 2>&1 | head -3` |
---
## §6 Risks + resolutions
### Risks
- **R-1** (Bug #10): Relative path `./config.toml` must resolve from cwd, not from binary location. **Resolution**: Rust `std::path::Path::exists()` is cwd-relative; no workaround needed.
- **R-2** (Bug #13): Empty corpus → empty `active_parsers` / `active_chunkers` array. **Resolution**: Unit test `schema_models_active_arrays_empty_on_empty_corpus()` mandated (AC-4).
- **R-3** (resolved): `collect_models` uses no cache (every-call re-computation). `active_parsers/chunkers` reflect corpus state at invocation time. If future caching is added, `corpus_revision` increment signals invalidation — document at that time.
- **R-4** (Bug #14): `ask` command validation — covered by same fix (§3.5 mandates both search + ask).
- **R-5** (Bug #11): 60s may still timeout on very dense/high-res pages. **Mitigation**: User can override via `config.toml [pdf.ocr] request_timeout_secs = N`. Release notes explicitly call this out.
---
---
## §7 Parent spec deviation (HOTFIXES handoff)
**F-11 MEDIUM finding**: parent spec `2026-04-27-kebab-final-form-design.md` (frozen) specifies PDF OCR request_timeout_secs = 600s (§1000 + §1628 OQ-1, rationale: "CPU 환경 105s 의 5x 여유"). Bug #11 (dogfood evidence) contradicts — 600s causes timeouts; 60s production-optimal.
**Deviation handling**:
1. Parent spec stays frozen (no edits).
2. **HOTFIXES entry (executor Step N)**: `tasks/HOTFIXES.md` receives dated entry:
```markdown
2026-05-27 — PDF OCR request_timeout_secs default 600s → 60s (v0.20.0 bugfix3 dogfood evidence). Bug #11.
```
3. **Parent spec cross-link (executor Step N)**: parent spec `2026-04-27-kebab-final-form-design.md` receives inline comment at §1000 (default value code block) or §1628 (OQ-1 paragraph):
```markdown
<!-- HOTFIX 2026-05-27: default 60s (Bug #11). See tasks/HOTFIXES.md 2026-05-27 entry. -->
```
**Parent spec invariant**: No changes to parent spec text; only cross-link comment + HOTFIXES.md entry. Frozen design contract preserved.
---
## §8 References
- [Dogfood report](../../../.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md) — 5 bugs discovered + decisions.
- [Parent spec (frozen contract)](2026-04-27-kebab-final-form-design.md) — §1, §2, §4 (capabilities, error handling, JSON schema, config XDG).
- `crates/kebab-app/src/schema.rs:137151` (capabilities_snapshot).
- `crates/kebab-config/src/lib.rs` (Config::load, default_pdf_ocr_request_timeout_secs).
- `crates/kebab-app/src/error_wire.rs` (classify ConfigNotFound).
- `crates/kebab-store-sqlite/src/lib.rs` (fetch_distinct_parser_versions, fetch_distinct_chunker_versions).
- `crates/kebab-cli/src/main.rs` (search + ask query validation).
- `docs/wire-schema/v1/schema.schema.json` (Models + Capabilities objects).
- `tasks/HOTFIXES.md` (2026-05-27 entry, Bug #11 deviation record).