docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md

3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 01:21:34 +00:00
parent 9b44e27dfe
commit 46e99470eb
7 changed files with 4794 additions and 0 deletions

View File

@@ -0,0 +1,965 @@
---
title: v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
created: 2026-05-27
status: ACCEPT (round 2 closure — Phase A complete)
target_version: 0.20.0 (PR #189 force-update)
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
dogfood_evidence: .omc/reviews/2026-05-27-v0.20-dogfood-report.md
review_history:
- "2026-05-27 spec round 1 critic (opus, thorough) — ACCEPT, HIGH 0 + MEDIUM 3 + LOW 2 + NIT 2"
- "2026-05-27 spec round 1c rewrite (opus, drafter) — MEDIUM/LOW/NIT all applied"
- "2026-05-27 spec round 2 closure critic (opus) — ACCEPT, 7/7 applied + 1 NIT (frontmatter status, applied here)"
---
# v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture
본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된
3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다.
후속 plan + executor 단계의 source 다.
## §1 Background + dogfood evidence chain
### §1.1 dogfood 환경
| 항목 | 값 |
|------|----|
| Binary | `kebab v0.20.0` (commit `b4d9e60`) |
| Ollama endpoint | `http://192.168.0.47:11434` (qwen2.5vl:3b) |
| Isolated KB | `/build/cache/tmp/v0.20-dogfood/` |
| Corpus | 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB) |
### §1.2 3 bug 의 reproducibility
| Bug | Severity | Trigger | Reproducible |
|-----|----------|---------|--------------|
| #2 walker code limit | Important | 256 KB+ PDF/image/markdown ingest | 항상 (default config) |
| #3 chunk_id collision | **Critical** | scanned_page2.pdf (1580 OCR chars) ingest | force-reingest 마다 |
| #4 F4 Pages tree | Important | mojibake.pdf (F4 fixture) ingest | 항상 |
### §1.3 dogfood report 인용
dogfood report (`.omc/reviews/2026-05-27-v0.20-dogfood-report.md`) 의 핵심 인용:
- Bug #2: `scanned=3, skipped_size_exceeded=6` — workspace 9 PDF 중 3 만 통과,
6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label
2.7MB / internals 820KB) walker 단계 skip.
- Bug #3: `"DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
failed"` — scanned_page2.pdf chunk INSERT 단계에서 발생.
- Bug #4: `block_count: 0, chunk_count: 0` — F4 mojibake.pdf 의 ingest 결과
가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반.
## §2 Goals + non-goals
### §2.1 Goals
- 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은
branch 위에 stack).
- parent spec
(`docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`) 의 invariant 보존:
- §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page".
- §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path).
- §4.6 wire schema additive 만 (V00X migration 불필요).
- parent plan
(`docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`) round 1c ACCEPT
의 design decisions 와 충돌 0.
- workspace test regression 0 (`cargo test --workspace -j 1`).
### §2.2 Non-goals
- 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0.
- 새 V00X sqlite migration — `chunks` table DDL 변경 없음, fix 는 chunk_id 계산
path 한정.
- F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF
요구사항 유지).
- 새 config knob 추가 (`[ingest.pdf].max_file_bytes` 같은 per-media-type limit
은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만).
## §3 Bug #2 — walker code limit
### §3.1 Root cause (file:line evidence)
`crates/kebab-source-fs/src/connector.rs:42-72` — `FsSourceConnector` 가
`Config::new` 에서 `max_file_bytes` 와 `max_file_lines` 를
`config.ingest.code` 단일 namespace 에서 읽는다:
```rust
Ok(Self {
default_root: root,
default_exclude: config.workspace.exclude.clone(),
copy_threshold_bytes,
skip_generated_header: config.ingest.code.skip_generated_header,
max_file_bytes: config.ingest.code.max_file_bytes, // <-- code-specific
max_file_lines: config.ingest.code.max_file_lines, // <-- code-specific
})
```
`crates/kebab-source-fs/src/connector.rs:169-190` — walker 의 size check 가
`is_oversized(...)` 호출 시 path 의 media type 무관:
```rust
if crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes, // generic limit, applied 모든 file
self.max_file_lines,
).unwrap_or(false) {
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
// ...
continue;
}
```
`crates/kebab-source-fs/src/code_meta.rs:114-129` — `is_oversized(...)` 자체는
generic helper (extension 무관):
```rust
pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
let meta = std::fs::metadata(path)?;
if meta.len() > max_bytes {
return Ok(true);
}
// line cap (streaming)
...
}
```
`crates/kebab-config/src/lib.rs:535-547` — `IngestCodeCfg::default()` 의
`max_file_bytes = 262_144` (256 KB) — 대부분 PDF/image 가 이를 초과.
### §3.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — code path only** | walker 의 size check 를 code file (extension recognized by `code_lang_for_path`) 에만 적용 | 간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 | PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과 |
| B — per-type config | 새 `[ingest.pdf]`, `[ingest.image]`, `[ingest.markdown]` section 추가 + per-type limit | user-tunable | 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과 |
| C — generic limit + docs note | 같은 generic limit 유지하지만 의도 명문화 | code 변경 0 | UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → **거부** |
### §3.3 Chosen path — Option A
walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는
parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리,
image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config
필요 시 Option B 로 진화.
`is_code_file(path: &Path) -> bool` helper 추가:
- `code_meta::code_lang_for_path(path).is_some()` = code file. 기존 helper
재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list
완전 동일).
### §3.4 Implementation (Rust diff)
`crates/kebab-source-fs/src/code_meta.rs` — `pub(crate)` helper 추가:
```rust
/// Returns true when `path`'s filename/extension is recognised as a code file
/// (per `code_lang_for_path`). Used by the walker to apply
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
/// not to PDF/image/markdown (which have their own size controls in their
/// respective parsers).
pub(crate) fn is_code_file(path: &Path) -> bool {
code_lang_for_path(path).is_some()
}
```
`crates/kebab-source-fs/src/connector.rs:168-190` — walker conditional 추가:
```rust
// p10-1A-1: apply per-file generated-header + size-cap checks on files
// that passed the override (gitignore/builtin/kebabignore) matching.
// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines)
// applies ONLY to code files. PDF/image/markdown bypass — their parsers
// have their own size controls.
if crate::code_meta::is_code_file(&abs_path)
&& crate::code_meta::is_oversized(
&abs_path,
self.max_file_bytes,
self.max_file_lines,
)
.unwrap_or(false)
{
fs_skips.skipped_size_exceeded =
fs_skips.skipped_size_exceeded.saturating_add(1);
push_sample(
&mut fs_skips.skip_examples.size_exceeded,
&abs_path,
&root,
);
tracing::debug!(
path = %rel_path.display(),
max_bytes = self.max_file_bytes,
max_lines = self.max_file_lines,
"skip: code file exceeds size cap"
);
continue;
}
```
`skip_generated_header` 의 conditional 적용은 별개 — generated header sniff
은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image
의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지
않으므로 false positive 0. **`is_generated_file` 의 walker conditional 추가는
불필요** — 기존 behavior 유지.
### §3.5 Test additions
`crates/kebab-source-fs/src/connector.rs` 의 기존 test module 에 추가:
```rust
#[test]
fn size_cap_skips_only_code_files() {
let dir = tempfile::tempdir().unwrap();
let root = dir.path();
// 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code).
let big_blob: Vec<u8> = vec![b'x'; 300_000];
std::fs::write(root.join("paper.pdf"), &big_blob).unwrap();
std::fs::write(root.join("notes.md"), &big_blob).unwrap();
std::fs::write(root.join("big.rs"), &big_blob).unwrap();
let conn = FsSourceConnector::new(
&cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
)
.unwrap();
let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
// PDF + Markdown pass through walker.
assert!(paths.contains(&"paper.pdf".to_string()));
assert!(paths.contains(&"notes.md".to_string()));
// Code file gets skipped.
assert!(!paths.contains(&"big.rs".to_string()));
assert!(
skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")),
"size_exceeded examples should contain only big.rs: {:?}",
skips.skip_examples.size_exceeded
);
assert!(
!skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")),
"PDF must NOT appear in size_exceeded examples: {:?}",
skips.skip_examples.size_exceeded
);
}
```
추가로 기존 test `ingest_report_counts_oversized_files_by_bytes` 의 fixture
이름이 `huge.rs` 라서 invariant 보존됨. `ingest_report_size_cap_by_line_count`
도 `longfile.rs` 라서 동일.
## §4 Bug #3 — chunk_id collision (Critical)
### §4.1 Root cause investigation
#### §4.1.1 chunker 의 collision-avoidance workaround
`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 에 collision 회피
설명:
```
Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
|| policy_hash) collides when one block (= one PDF page) is split into
multiple chunks: every chunk on that page has identical block_ids.
Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}")
into the recipe's policy_hash slot.
```
`crates/kebab-chunk/src/pdf_page_v1.rs:170-172` 의 actual call:
```rust
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
```
여기 `char_start` = `chunk_page(...)` 의 첫 번째 tuple field = **post-overlap
`actual_start`** (NOT 원본 segment boundary `start`).
#### §4.1.2 overlap 의 actual_start 계산
`crates/kebab-chunk/src/pdf_page_v1.rs:266-281`:
```rust
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0; // previous chunk 의 actual_start
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
let cl = chars[a - 1].len_utf8();
if acc_o + cl > overlap_bytes {
break;
}
acc_o += cl;
a -= 1;
}
a
} else {
start
};
```
`while a > prev_min` — overlap walk 는 previous chunk 의 actual_start 까지만
back-walk. overlap_bytes 가 충분히 크고 `start - prev_min` 이 작으면
`actual_start = prev_min`. **두 chunk 가 같은 actual_start = 같은 `#c{N}`**.
#### §4.1.3 가설 검증 — F2 (1580 chars OCR)
가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (`.` + whitespace)
또는 paragraph break (`\n\n`) 를 포함.
- 기본 chunking policy: `target_tokens=500` → `target_bytes=1500`,
`overlap_tokens=80` → `overlap_bytes = min(240, 750) = 240`.
- 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk.
- 가정한 bounds = `[0, 30, ~n]` (첫 ~30 chars 안 sentence-end 1 개).
- segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)). `#c0`.
- segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk.
- actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240.
- 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료.
- actual_start = 0 = prev_min.
- chunks.push((0, n, ...)). `#c0` — **collision with chunk 1**.
같은 doc 안 두 chunk 의 chunk_id input:
- `{kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1",
block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}`
- canonical JSON 동일 → blake3 동일 → chunk_id 동일.
→ `put_chunks` 의 `INSERT INTO chunks` 에서 첫 row 성공, 두 번째 row 가
PRIMARY KEY violation.
#### §4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유
F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence
boundary 부재. 그 경우 bounds = `[0, n]` 또는 first boundary 가 80 char 이후
→ chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct `#c{N}`
값 → distinct chunk_id.
→ **F2 만 collision** 이라는 dogfood 의 observation 과 일치.
#### §4.1.5 dogfood report 의 가설 평가
dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision
을 추정. 본 spec 의 investigation 결과 = **intra-doc (F2 내부) collision**.
근거:
- chunk_id input 에 `doc_id` 포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름.
- 같은 doc 안 두 chunk 가 같은 block_id + 같은 `#c{N}` policy_hash 면
identical chunk_id.
- 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는
policy 의 canonical JSON blake3 (deterministic).
- 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id
collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른
collision pattern).
- 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per
chunk 이므로 ordering N/A.
- 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text
미포함, N/A.
**Confirmed root cause** = 가설 C 의 variant — 단일 page 가 multi-chunk 일
때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, `#c{N}`
suffix 동일.
### §4.2 Decision matrix
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — segment boundary `start`** | `per_chunk_hash` 의 suffix 를 post-overlap `actual_start` 대신 segment boundary `start` 로 변경 | minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 | chunk_page 의 return tuple shape 변경 필요 |
| B — chunk ordinal | `per_chunk_hash = "#c{ordinal}"` (page 안 chunk index 0, 1, 2, ...) | 가장 simple / segment boundary 무관 | chunk_id 의 "meaningful hash input" semantic 약화 |
| C — (`char_start`, `char_end`) pair | `per_chunk_hash = "#c{char_start}-{char_end}"` | 두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct | char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함 |
| D — sequence number + char_start | `per_chunk_hash = "#c{ordinal}-{char_start}"` | invariant 완전 보장 | redundant info, hash input 가 더 길어짐 |
### §4.3 Chosen path — Option A
근거:
- chunk_page 의 main loop 는 `seg_idx` 가 strictly increasing, segment
boundary `bounds[seg_idx]` 도 strictly increasing (bounds 가 dedup 후 unique).
따라서 segment boundary `start` 를 hash suffix 로 쓰면 같은 page 안 chunk
들의 hash input 가 보장된 distinct.
- chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전
의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary
를 위한 enrichment.
- chunk_page 의 return tuple 을 `(segment_start, actual_start, chunk_end,
slice)` 의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도
track) — minimal diff.
### §4.4 Implementation
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `chunk_page` return signature 확장:
```rust
/// Split a single page's text into ordered chunks, each represented as
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
fn chunk_page(
text: &str,
target_bytes: usize,
overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
// ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice))
chunks.push((start, actual_start, chunk_end, slice));
// ...
}
```
caller 의 `chunk` method 도 동일하게 update:
```rust
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// ... existing u32 conversion + span construction ...
let span = SourceSpan::Page {
page: page_num,
char_start: Some(u32::try_from(char_start).expect("...")),
char_end: Some(u32::try_from(char_end).expect("...")),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// segment_start (pre-overlap boundary) is strictly increasing across
// chunks, even when overlap walk collapses actual_start to prev_min.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
// ... rest unchanged ...
}
```
`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 도 동시 update —
기존 description 의 `"#c{char_start}"` 가 새 fix 에 stale 하므로:
```rust
//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids)
//! || policy_hash) collides when one block (= one PDF page) is split into
//! multiple chunks: every chunk on that page has identical block_ids.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment
//! boundary, strictly increasing across the returned chunks even when the
//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`.
//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch).
```
추가로 `tasks/HOTFIXES.md` 에 dated entry 추가 (본 fix 이 chunk_id deviation
의 **second-iteration patch** — 첫 iteration 의 `#c{char_start}` workaround 가
aggressive overlap case 에서 collision 을 leave 했음을 명문화):
```markdown
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround
collapses under aggressive overlap (Bug #3 second-iteration patch)
**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf):
UNIQUE constraint failed: chunks.chunk_id`. ...
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ...
post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ...
**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start`
추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ...
**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9
cascade trigger 로 explicit invalidation.
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer
의 internal computation 만 변경).
```
#### §4.4.1 chunk_id determinism 보존
기존 single-chunk-per-page case (e.g. small pages, `text.len() <= target_bytes`)
:
- `chunk_page` 의 early return: `vec![(0, n, text.to_string())]` → 새 shape
로 `vec![(0, 0, n, text.to_string())]`. `segment_start = 0 = actual_start`.
- `#c0` suffix 동일 → chunk_id 동일.
multi-chunk case 의 첫 chunk:
- segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk).
- `#c0` suffix 동일 → chunk_id 동일.
multi-chunk case 의 second-and-later chunk:
- 기존: `actual_start` (overlap-walked, may be == 0).
- 새: `segment_start` = bounds[seg_idx] > 0.
- → chunk_id 변경 (intentional, collision 회피).
→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨.
이는 v0.20 의 force-reingest path 에서 자동 갱신.
**Decision (round 1c, closes §7.2 Open Q1): chunker_version bump
`pdf-page-v1` → `pdf-page-v1.1`** (critic round 1 M-1 권장 채택).
근거:
- 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf
의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가
internal computation 변경으로 silent 하게 다른 값으로 mapping.
chunker_version 을 `pdf-page-v1` 유지하면 store/embedding layer 의 cascade
audit 가 발생 안 함 → 사용자가 `--force-reingest` 를 명시적으로 호출하지
않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능.
- design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit
version bump → store layer 의 자동 invalidation report. `pdf-page-v1.1`
bump 는 그 rule 의 직접 적용.
- bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일
release commit 위에 cumulative bugfix stack) 이고, parent spec
(`2026-05-27-pdf-scanned-ocr-spec.md`) 의 OCR feature 활성화가 어차피
force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만
다르면 새 doc_id chain 안에서 동일하게 cascade 재계산.
- benefit = explicit user-facing audit trail. 다음 ingest 시 cascade
invalidation 이 store layer report 에 명시.
cascade 의 다른 version field (parser_version / embedding_version /
prompt_template_version / index_version) 는 unchanged — chunker layer
한정 patch.
`PdfPageV1Chunker` 의 `chunker_version()` 상수 update:
```rust
impl Chunker for PdfPageV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion("pdf-page-v1.1".to_string()) // was: "pdf-page-v1"
}
// ...
}
```
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `PARSER_VERSION` 또는 const
`CHUNKER_VERSION` 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서).
### §4.5 Test additions
`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` 에 추가:
```rust
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start
// can collapse to prev_min, producing identical `#c{char_start}` suffixes
// and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation
// at put_chunks INSERT time.
//
// Synthesises Korean OCR text shape: dense Korean characters (3 bytes
// per char) with a single early sentence-end boundary at char ~10 +
// long tail.
// 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars.
let early_seg: String = std::iter::repeat('가').take(10).collect();
let tail: String = std::iter::repeat('나').take(500).collect();
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
// Hard invariant: all chunk_ids must be unique. Without the fix, the
// second chunk would have actual_start = 0 (== first chunk's
// actual_start) under the aggressive-overlap walk → identical `#c0`
// hash suffix → identical chunk_id → PRIMARY KEY violation.
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
```
(round 1c L-1: 원래 round 0 의 second test
`chunk_id_recipe_uses_segment_start_not_actual_start` 는 본 test 의
uniqueness 검증과 redundant + 실제 assertion 이 `assert!(chunks.len() >= 2)`
뿐이라 test name 의 의도와 mismatch — 제거.)
추가로 `crates/kebab-app/tests/` 에 integration 수준의 regression test:
```rust
// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new)
//
// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page +
// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과.
//
// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이
// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록
// 구성. real Ollama 호출 회피.
#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
// ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB
// ... assert: ingest_report.items 모두 kind != Error
// ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts
}
```
(round 1c NIT-1: 파일명과 함수명을 `multi_scanned_pdf_ingest_no_chunk_id_collision`
로 통일 — 원래 round 0 의 파일명 `pdf_multi_scan_no_chunk_id_collision.rs` 는
fn name 과 mismatch.)
#### §4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3)
본 integration test 는 `OcrEngine` trait 의 mock impl 을 요구. executor 단계의
1st step:
1. `grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/`
로 MockOcrEngine 위치 확인.
2. **현재 상태** (2026-05-27 verifier probe):
- `crates/kebab-parse-image/src/ocr.rs:235` — production `impl OcrEngine for OllamaVisionOcr`.
- `crates/kebab-app/tests/pdf_ocr_apply.rs:25` — `impl OcrEngine for MockOcrEngine` (test-only).
3. 본 새 integration test (`multi_scanned_pdf_ingest_no_chunk_id_collision.rs`)
는 같은 crate (`kebab-app`) 안의 별 test binary 라 `pdf_ocr_apply.rs` 의
private MockOcrEngine 를 직접 import 불가. executor 의 선택지:
- **Option A (권장)**: `crates/kebab-app/tests/common/mock_ocr.rs` 에
MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는
parameterised 형태). 두 test (`pdf_ocr_apply.rs` + 본 신 test) 모두
`mod common;` 으로 share.
- **Option B**: 본 신 test 안에 inline `impl OcrEngine for LocalMock { ... }`
중복 정의 (test isolation 우선, share 비용 회피).
4. 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의
acceptance 를 **conditional downgrade** — `kebab-chunk` 의
unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀
확보. integration 회피.
executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서
closed.
### §4.6 Acceptance (Bug #3 fix)
- F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0.
- `--force-reingest` 마다 collision 0.
- 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0.
- `crates/kebab-chunk` 의 기존 1000-determinism test
(`deterministic_chunk_ids_1000`) 통과 보존.
- workspace test regression 0.
- new test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
+ integration test `multi_scanned_pdf_ingest_no_chunk_id_collision` 추가.
## §5 Bug #4 — F4 fixture Pages tree
### §5.1 Root cause
#### §5.1.1 현상
```json
{
"doc_path": "mojibake.pdf",
"kind": "new",
"byte_len": 22568,
"pdf_ocr_pages": 0,
"pdf_ocr_ms_total": 0,
"block_count": 0,
"chunk_count": 0,
"warnings": []
}
```
PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반.
#### §5.1.2 lopdf get_pages() 의 reaction
dogfood probe:
- `lopdf::Document::load_mem(F4_bytes)` → OK.
- `pdf_doc.get_pages()` → empty `BTreeMap`.
- PDF byte stream 안 `/Type /Page` count = 1, `/Count` value = 1.
→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal
(`/Pages` → `/Kids` chain) 가 broken.
#### §5.1.3 fixture 생성 path 분석
`tests/fixtures/_synth/mojibake.py`:
```python
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30*mm
for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]:
c.drawString(30*mm, y, line)
y -= 16
c.save()
data = dst.read_bytes()
# pattern: "/ToUnicode <objref>" — strip indirect object reference
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
dst.write_bytes(new_data)
```
**Step 2 분석**: `re.sub` 가 `/ToUnicode N M R` byte sequence 를 제거하지만:
- 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift.
- cross-reference table (`xref`) 의 offset entries 가 stale.
- `startxref` value 의 offset 도 stale.
**Step 3 의 startxref fix** (`tasks/HOTFIXES.md` 의 commit `c2cd3a7`):
- manual byte edit `22130 → 22114` 로 startxref 갱신.
- 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의
`/Kids` array 가 가리키는 indirect object 의 actual byte position 가
xref entry 와 mismatch.
- lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공
하지만 Pages tree traversal 시 indirect object resolution fail → empty
Pages map.
### §5.2 Fixture re-generation strategy
| Option | 설명 | 장점 | 단점 |
|--------|------|------|------|
| **A — pikepdf** | reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) | proper xref regeneration / Pages tree intact / library available (pip install pikepdf) | 새 Python dependency (`pikepdf`) |
| B — qpdf normalize | byte-edit 후 `qpdf --linearize input.pdf output.pdf` | external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) | qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음) |
| C — reportlab disable ToUnicode | reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable | byte-edit 회피 — clean | reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요) |
### §5.3 Chosen path — Option A (pikepdf)
근거:
- pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings.
- xref table 의 auto-regeneration + Pages tree 의 integrity 보존.
- `pip install pikepdf` 로 dependency 추가 — 이미 fixture generation 용 Python
venv 가 reportlab 사용 중이라 추가 install 가 trivial.
#### §5.3.1 ToUnicode strip 의 pikepdf approach
reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안
`/ToUnicode <ref>` 로 등장. pikepdf 로 font dictionary 의 `/ToUnicode` entry 만
제거:
```python
import pikepdf
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
# Walk all indirect objects, delete /ToUnicode entry whenever found.
# PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 →
# false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4
# 의 actual implementation 과 동일 형태).
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
pdf.save(str(dst))
```
pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존.
### §5.4 Implementation (mojibake.py revision)
`tests/fixtures/_synth/mojibake.py` 의 완전 rewrite:
```python
"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
Strategy:
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen).
Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`.
Usage:
python3 tests/fixtures/_synth/mojibake.py \
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
"""
import sys
from pathlib import Path
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
import pikepdf
DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
FONT_NAME = "DejaVuSans"
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
dst = Path(sys.argv[1])
# Step 1: 정상 PDF 합성.
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30 * mm
for line in [
"Mojibake fixture (no ToUnicode CMap)",
"Text extraction yields garbage \x00\x01\x02",
]:
c.drawString(30 * mm, y, line)
y -= 16
c.save()
# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration.
removed = 0
with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf:
for obj in pdf.objects:
if isinstance(obj, pikepdf.Dictionary):
if "/ToUnicode" in obj:
del obj["/ToUnicode"]
removed += 1
pdf.save(str(dst))
if removed == 0:
print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr)
sys.exit(2)
# Step 3: invariant 검증 — load + page count.
with pikepdf.open(str(dst)) as pdf:
n_pages = len(pdf.pages)
if n_pages != 1:
print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr)
sys.exit(3)
# ToUnicode 부재 invariant 확인.
raw = Path(dst).read_bytes()
if b"/ToUnicode" in raw:
print("ERROR: /ToUnicode still present after strip", file=sys.stderr)
sys.exit(4)
print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)")
```
### §5.5 Test additions
`crates/kebab-parse-pdf/tests/text_extractor.rs` (or relevant existing test
file) 에 추가:
```rust
/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation
/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장.
///
/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는
/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken
/// indirect object resolution → empty pages map → block_count=0.
#[test]
fn mojibake_fixture_load_yields_one_page() {
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load");
let pages = doc.get_pages();
assert_eq!(
pages.len(),
1,
"F4 fixture must have exactly 1 page (Pages tree integrity)"
);
}
#[test]
fn mojibake_fixture_has_no_tounicode_cmap() {
// Step 2 의 ToUnicode 부재 invariant.
let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap();
let count = bytes.windows(b"/ToUnicode".len())
.filter(|w| *w == b"/ToUnicode")
.count();
assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker");
}
#[test]
fn pdf_text_extractor_on_mojibake_yields_one_block() {
// PdfTextExtractor 의 invariant: 1 Block::Paragraph per page.
// F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는
// empty → 1 empty Block::Paragraph + "scanned candidate" warning.
let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");
// ... ExtractContext setup + extractor.extract(&ctx, bytes) ...
let canonical = extractor.extract(&ctx, bytes).unwrap();
assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page");
// text 는 garbage 또는 empty — invariant 는 block 자체의 존재.
let warning_present = canonical.provenance.events.iter().any(|e| {
matches!(e.kind, ProvenanceKind::Warning)
&& e.note.as_ref().is_some_and(|n| n.contains("scanned candidate"))
});
assert!(warning_present || !canonical.blocks[0].text_is_empty(),
"text-detect first 의 empty fallback 시 scanned-candidate warning 필수");
}
```
### §5.6 Acceptance (Bug #4 fix)
- F4 fixture re-generation 후 `lopdf::Document::load_mem(...).get_pages().len() = 1`.
- F4 fixture 의 ToUnicode CMap 부재 invariant 보존
(`grep -c "/ToUnicode" mojibake.pdf` = 0).
- PdfTextExtractor 의 F4 ingest 시 `block_count = 1`,
warning `"page1 empty (scanned candidate)"` 또는 garbage text.
- dogfood retest 시 mojibake.pdf 의 `block_count: 1`,
`chunk_count: 0~1` (depending on text content).
- 기존 `text_extractor_regression.rs` 의 F4 baseline 갱신 — old baseline 자체
가 broken invariant 의 snapshot 이라 update 필요.
- workspace test regression 0.
## §6 Acceptance criteria (consolidated)
| # | Verifier | Bug | 명령 |
|---|---------|-----|------|
| 1 | walker bypasses size cap for PDF | #2 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files` |
| 2 | walker still skips oversized code files | #2 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes` |
| 3 | 256KB+ PDF/markdown ingest default config | #2 | dogfood retest: `kebab ingest` 시 `skipped_size_exceeded = 0` for non-code |
| 4 | chunker collision regression test | #3 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` |
| 5 | chunker determinism preserved | #3 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000` |
| 6 | chunker overlap clamp preserved | #3 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target` |
| 7 | integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) | #3 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision` |
| 8 | dogfood: F1 + F2 force-reingest | #3 | dogfood retest: `kebab ingest --force-reingest` 시 errors = 0 (encrypted 제외) |
| 9 | F4 fixture lopdf 1-page invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page` |
| 10 | F4 fixture ToUnicode 부재 invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap` |
| 11 | F4 PdfTextExtractor 1-block invariant | #4 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block` |
| 12 | dogfood: F4 ingest yields block_count=1 | #4 | dogfood retest: mojibake.pdf 의 ingest item `block_count: 1` |
| 13 | workspace clippy clean | all | `cargo clippy --workspace --all-targets -- -D warnings` |
| 14 | workspace full test pass | all | `cargo test --workspace --no-fail-fast -j 1` |
| 15 | dogfood end-to-end 9 PDF | all | dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
| 16 | chunker_version cascade — final value | #3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` (round 1c M-1 결정) |
## §7 Risks + open questions
### §7.1 Risks
- **Bug #3 fix 가 chunk_id 변경**: multi-chunk PDF page (pre-OCR 시점에 1500
byte 초과 page) 의 chunk_id 가 변경됨. 사용자가 `--force-reingest` 1회
필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh
ingest). README 또는 release note 에 명시.
- **Bug #2 fix 의 side-effect**: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의
load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming
parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable.
- **Bug #4 fix 의 fixture binary 변경**: F4 mojibake.pdf 의 SHA256 가 변경
→ git LFS / binary diff 의 noise. `text_extractor_regression.rs` 의
baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리.
- **pikepdf install requirement**: fixture re-generation 시 `pip install
pikepdf` 필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의
Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로
generation 은 1회성, CI 의존성 미발생.
### §7.2 Open questions
1. **chunker_version bump 의 cost-benefit**: ✅ **CLOSED (round 1c M-1)** —
`pdf-page-v1` → `pdf-page-v1.1` bump 결정. cascade audit trail explicit
+ v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision
(round 1c, closes §7.2 Open Q1)" 단락.
2. **Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion**: 본 spec
은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음.
3. **F4 fixture 의 invariant**: critic round 1 ACCEPT — ToUnicode 부재 +
valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible.
Step 2 의 design (mojibake.py rewrite) sound.
4. **integration test 의 mock OCR**: ✅ **CLOSED (round 1c M-3)** —
`crates/kebab-app/tests/pdf_ocr_apply.rs:25` 에 이미 `impl OcrEngine for
MockOcrEngine` 존재. executor 의 share path (Option A — `tests/common/
mock_ocr.rs` lift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가
능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition
— MockOcrEngine availability" 단락.
5. **chunk_page tuple shape 변경**: Option A 의 4-tuple `(segment_start,
actual_start, chunk_end, slice)` 가 외부 callers 에 영향을 주는가?
`chunk_page` 는 module-private (`fn chunk_page`) 이라 외부 caller 0,
안전. critic round 1 ACCEPT.
## §8 References
- dogfood report: `.omc/reviews/2026-05-27-v0.20-dogfood-report.md`
- parent spec (frozen): `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`
- parent plan (round 1c ACCEPT): `docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`
- source code (root cause evidence):
- `crates/kebab-source-fs/src/connector.rs` (Bug #2)
- `crates/kebab-source-fs/src/code_meta.rs` (is_oversized + code_lang_for_path)
- `crates/kebab-config/src/lib.rs` (IngestCodeCfg)
- `crates/kebab-core/src/ids.rs` (id_for_chunk / id_for_block recipes)
- `crates/kebab-chunk/src/pdf_page_v1.rs` (PdfPageV1Chunker + chunk_page)
- `crates/kebab-app/src/pdf_ocr_apply.rs` (post-extract OCR enrichment)
- `crates/kebab-app/src/lib.rs:1769-1968` (ingest_one_pdf_asset wiring)
- `crates/kebab-store-sqlite/src/documents.rs:103-155` (put_chunks DELETE+INSERT)
- `migrations/V001__init.sql:80-94` (chunks table DDL — chunk_id PRIMARY KEY)
- `tests/fixtures/_synth/mojibake.py` (Bug #4 fixture source)
- design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe),
§4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist),
§9 (versioning cascade).

View File

@@ -0,0 +1,308 @@
---
title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
created: 2026-05-27
status: "DRAFT round 1c"
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
related_specs:
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
related_dogfood:
- .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
---
# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text
## §1 Problem statement
### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection
**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.
**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.
**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.
### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values
**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.
**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.
**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.
---
## §2 Scope + non-scope
### §2.1 Included in this spec
- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
- **Traceability**: Link both fixes to parent spec §1.3 design intent.
### §2.2 Explicitly out of scope
**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.
**Non-bug observations**:
- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
- Binary staleness: Environmental artifact, not applicable to spec.
**Ancillary risks**:
- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.
---
## §3 Decisions
### §3.1 Bug #6: Known mojibake marker stripping
Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.
**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.
**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.
### §3.2 Bug #7: CLI doc-comment update
Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.
### §3.3 Parent spec traceability
Both fixes uphold parent spec §1.3:
- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).
No changes to parser_version, chunker_version, or wire schema.
---
## §4 Implementation specification
### §4.1 Bug #6: text_quality.rs diff
**File**: `crates/kebab-parse-pdf/src/text_quality.rs`
**Change**:
1. Add constant array of known mojibake markers (lines 810):
```rust
// Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
// Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
// encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
// PUA / replacement-char territory already covered by `pure_pua_zero`.
// Re-verify on lopdf dependency upgrade.
const MOJIBAKE_MARKERS: &[&str] = &[
"?Identity-H Unimplemented?",
];
```
2. Refactor `compute_valid_char_ratio()` (lines 39106):
```rust
pub fn compute_valid_char_ratio(s: &str) -> f32 {
// 1) Strip known mojibake markers before counting valid chars.
// Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
// substrings (bypassing PUA detection).
let mut cleaned: String = s.to_string();
// `had_marker` guard preserves prior behavior for whitespace-only input
// (returns ratio of whitespace validity, not 0.0) when no markers found.
// With markers stripped, the guard enables the trim-empty check.
let mut had_marker = false;
for marker in MOJIBAKE_MARKERS {
if cleaned.contains(marker) {
had_marker = true;
cleaned = cleaned.replace(marker, "");
}
}
// 2) Whitespace-only cleaned text → 0.0 (marker-only page).
if had_marker && cleaned.trim().is_empty() {
return 0.0;
}
// 3) Marker-dominance heuristic — when stripped chars exceed remaining
// chars (i.e. marker > 50% of original), the page is "mostly mojibake
// with some decodeable page-furniture" (e.g. metro-korea.pdf has
// header text in a separate font + body that is Identity-H CID).
// Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
if had_marker {
let stripped_chars = s.len().saturating_sub(cleaned.len());
if stripped_chars > cleaned.len() {
// Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
// The 0.3 cap (not 0.0) preserves a small signal that some text
// WAS decodeable, useful for downstream metrics if ever exposed.
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
return raw_ratio.min(0.3);
}
}
// 4) Otherwise compute ratio on cleaned text (existing logic).
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
if total == 0 {
return 0.0;
}
valid as f32 / total as f32
}
```
**Invariants preserved**:
- Function signature and return type unchanged (→ byte-identical caller surface).
- Existing character category logic (hangul, CJK, Latin-1) unmodified.
- Empty-string behavior (return 0.0) preserved.
### §4.2 Bug #6: Unit tests
Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:
```rust
#[test]
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
// metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
// Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
// With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
let r = compute_valid_char_ratio(&s);
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
}
#[test]
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
// Inverse case: short marker noise + long valid text → ratio stays high
// (no false OCR trigger on otherwise-good pages).
let header = "x".repeat(200); // 200 char valid ASCII
let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char
let r = compute_valid_char_ratio(&s);
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
}
```
**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.
### §4.3 Bug #7: CLI doc-comment diff
**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150160)
**Change**:
```diff
-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
```
### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update
Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):
**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)
**Change**:
```diff
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
```
### §4.4 Bug #7: CLI help assertion
Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):
```rust
#[test]
fn search_help_lists_code_in_media_values() {
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
.args(["search", "--help"])
.output()
.expect("kebab search --help");
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
}
```
### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)
- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.
---
## §5 Acceptance criteria
- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.
---
## §6 Risks + open questions
### Identified risks
**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**
**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.
**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.
**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.
**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.
### Open questions
**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.
**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.
**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.
### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest
This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.
**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.
**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.
**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.
Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126128 precedent).
---
## §7 References
- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
- **Code locations**:
- text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
- CLI help: `crates/kebab-cli/src/main.rs:157159`
- Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
- CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)
---
**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.
**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).

View File

@@ -0,0 +1,410 @@
---
title: "v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings"
created: 2026-05-27
status: DRAFT
round: 1c
parent_spec: 2026-04-27-kebab-final-form-design.md
contract_sections:
- "1.1 (ask streaming)"
- "2.2 (error handling)"
- "2.4 (JSON wire schema)"
- "3.1 (config XDG)"
- "4.1 (capabilities schema)"
source_report: .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md
---
# v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings
Post-bugfix2 final dogfood (2026-05-27) 에서 발견된 **5개 bug** 의 fix design. PR #189 force-update (base=main). Spec scope: root cause + fix decision + acceptance criteria + parent spec traceability. Bug #12 falsified (scope 외). Fix 5개 모두 trivial ~ small refactor (기존 1350 test + 추가 5+ test).
---
## §1 Problem statement
### §1.1 Bug #9: capabilities false negative (Critical)
`kebab schema --json``capabilities.streaming_ask``capabilities.single_file_ingest` 가 모두 `false` hardcoded. 그러나 실제 구현:
- `kebab ask --stream``answer_event.v1` ndjson events 정상 emit (191 events 검증).
- `kebab ingest-file <path>``ingest_report.v1` 신규/갱신 정상.
- `kebab ingest-stdin --title <T>` → 정상.
**Impact**: MCP host, Claude Code skill 등 agent 가 `capabilities: { streaming_ask: false, single_file_ingest: false }` 보고 routing 결정 시 false negative. user 가 실제 동작하는 feature 를 사용 불가능하다고 오인.
### §1.2 Bug #10: config fail-fast (UX)
```bash
kebab search "rust" --config /tmp/nonexistent.toml --json
# exit=0, {"hits":[],"schema_version":"search_response.v1"}
```
explicit path 가 missing 시 silent fallback to default config (XDG path). debugging nightmare — typo 또는 wrong path 가 0 hit 으로만 surface.
### §1.3 Bug #11: OCR timeout 600s (Critical UX)
`config.pdf.ocr.request_timeout_secs = 600` (10분/page default). metro-korea.pdf dogfood 증거:
- page 8 + page 13 에서 Ollama remote 의 slow response → 600s 완전 timeout.
- 결과: `ms: 600000, chars: 0, skipped: true` emit → 본문 indexed 안 됨 + 20분 cost waste.
**Production impact**: 사용자가 ingest 완료 signal 못 받음, 일부 page 검색 불가.
### §1.4 Bug #13: schema.models single value (UX)
```json
{
"chunker_version": "md-heading-v1",
"parser_version": "md-frontmatter-v2",
...
}
```
그러나 corpus 안 multi-active:
- parsers: `md-frontmatter-v2`, `pdf-text-v1`, `code-rust-v1`, `code-python-v1`, `none-v1`.
- chunkers: `md-heading-v1`, `pdf-page-v1.1`, `code-rust-ast-v1`, `code-python-ast-v1`, `dockerfile-file-v1`, `k8s-manifest-resource-v1`, `manifest-file-v1`, `code-text-paragraph-v1`.
**Impact**: user 가 `kebab schema` 보고 active version 식별 불완전, version cascade audit 시 누락 risk.
### §1.5 Bug #14: empty query silent (Minor UX)
```bash
kebab search "" --json
# exit=0, {"hits":[],"next_cursor":null,"schema_version":"search_response.v1"}
```
empty query (또는 whitespace-only) 가 silent 0 hit return. user mistake → explicit error 가 정합.
---
## §2 Scope + non-scope
### §2.1 Included: 5 bug fix
| Bug | Category | Severity | Fix type |
|-----|----------|----------|----------|
| #9 | wire schema | critical | capability flag hardcoded boolean → actual feature check |
| #10 | config UX | medium | silent fallback → error.v1 with config_not_found |
| #11 | OCR config | critical | default 600s → 60s timeout |
| #13 | wire schema | medium | single field → additive array fields (backward compat) |
| #14 | input validation | minor | empty query silent → error.v1 with invalid_input |
### §2.2 Out of scope
- **Bug #12 (falsified)**: `inspect doc` blocks[].text 가 code parser 에서 "?" placeholder. 근본: `.text` 아님, `.code` field 정상 emit. user workflow 는 `.code` 로 접근 가능 → spec 범위 외.
- dogfood report §12 의 다른 axis (ranking bias, multi-root caveat) → 별도 phase.
---
## §3 Decisions
### §3.1 Bug #9: capabilities 정정
**Decision**: `schema.rs::capabilities_snapshot()` 의 두 field 를 true 로 update.
```rust
fn capabilities_snapshot() -> Capabilities {
Capabilities {
json_mode: true,
ingest_progress: true,
ingest_cancellation: true,
rag_multi_turn: true,
search_cache: true,
incremental_ingest: true,
streaming_ask: true, // ← WAS FALSE, actual TRUE
http_daemon: false, // ← preserved (not-impl, separate sub-item)
mcp_server: true,
single_file_ingest: true, // ← WAS FALSE, actual TRUE
bulk_search: true,
}
}
```
**Rationale**: actual implementation 이 production-grade streaming ask + single-file ingest 지원. schema report 가 reality 와 정합되어야 agent routing 정확함.
### §3.2 Bug #10: config_not_found error
**Decision**: `kebab-config` 가 자체 error type `ConfigNotFound` 정의, `kebab-app::error_wire` 가 classify arm 추가.
Pseudo-code:
```rust
// crates/kebab-config/src/lib.rs (또는 적절한 error module)
#[derive(Debug, thiserror::Error)]
#[error("config file does not exist: {path}")]
pub struct ConfigNotFound {
pub path: PathBuf,
}
// Config::load 안:
pub fn load(opt_path: Option<&Path>) -> anyhow::Result<Config> {
match opt_path {
Some(p) if !p.exists() => Err(anyhow::Error::new(ConfigNotFound { path: p.to_path_buf() })),
Some(p) => Self::from_file(p),
None => Self::from_xdg_default_or_defaults(),
}
}
```
Classify arm in `kebab-app/src/error_wire.rs`:
```rust
if let Some(e) = err.downcast_ref::<kebab_config::ConfigNotFound>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "config_not_found".to_string(),
message: format!("config file does not exist: {}", e.path.display()),
details: json!({ "path": e.path }),
hint: Some("verify --config argument; use --config to point to a writable toml file, or omit to use XDG default".to_string()),
};
}
```
**Exit code**: 2 (config error, not 0 silent).
### §3.3 Bug #11: OCR timeout 60s
**Decision**: `default_pdf_ocr_request_timeout_secs()` → 600 에서 60 으로 감소.
```rust
fn default_pdf_ocr_request_timeout_secs() -> u64 {
60 // 1 min, production-friendly per dogfood evidence
}
```
**Doc-comment 추가**:
```rust
/// Default OCR request timeout in seconds. Most pages complete in 6-32s.
/// Set to upper-bound valid throughput; exceeding 60s may indicate
/// Ollama unavailability or very dense/high-res pages.
/// Override via [pdf.ocr] request_timeout_secs = N in config.toml.
```
### §3.4 Bug #13: active_parsers + active_chunkers (additive)
**Decision**: wire schema additive minor — `Models` struct 에 두 배열 추가, 기존 single field 보존 (backward compat). `kebab-store-sqlite` 가 fetch methods 제공.
**Store API** (crates/kebab-store-sqlite/src/lib.rs):
```rust
impl SqliteStore {
/// SELECT DISTINCT parser_version FROM documents WHERE parser_version IS NOT NULL ORDER BY parser_version
pub fn fetch_distinct_parser_versions(&self) -> anyhow::Result<Vec<String>> {
let conn = self.conn()?;
let mut stmt = conn.prepare(
"SELECT DISTINCT parser_version FROM documents
WHERE parser_version IS NOT NULL
ORDER BY parser_version"
)?;
let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
let mut out = Vec::new();
for r in rows { out.push(r?); }
Ok(out)
}
pub fn fetch_distinct_chunker_versions(&self) -> anyhow::Result<Vec<String>> {
let conn = self.conn()?;
let mut stmt = conn.prepare(
"SELECT DISTINCT chunker_version FROM chunks
WHERE chunker_version IS NOT NULL
ORDER BY chunker_version"
)?;
let rows = stmt.query_map([], |row| row.get::<_, String>(0))?;
let mut out = Vec::new();
for r in rows { out.push(r?); }
Ok(out)
}
}
```
**Models struct** (crates/kebab-app/src/schema.rs):
```rust
pub struct Models {
/// Deprecated since v0.20.1. Use active_parsers for multi-parser corpus.
/// Reports default parser version (markdown path).
pub parser_version: String,
/// Deprecated since v0.20.1. Use active_chunkers for multi-chunker corpus.
pub chunker_version: String,
/// All parser versions active in corpus (v0.20.1+). May be empty if corpus is empty.
pub active_parsers: Vec<String>,
/// All chunker versions active in corpus (v0.20.1+). May be empty if corpus is empty.
pub active_chunkers: Vec<String>,
pub embedding_version: String,
pub prompt_template_version: String,
pub index_version: String,
pub corpus_revision: u64,
}
```
**Computation** (crates/kebab-app/src/schema.rs::collect_models):
```rust
let store = open_store_for_stats(cfg)?;
let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default();
let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default();
Ok(Models {
parser_version: active_parsers.first().cloned().unwrap_or_else(|| kebab_parse_md::PARSER_VERSION.to_string()),
chunker_version: active_chunkers.first().cloned().unwrap_or_else(|| kebab_chunk::md_heading_v1::VERSION_LABEL.to_string()),
active_parsers,
active_chunkers,
...
})
```
**Fallback**: markdown-fallback 유지. 기존 `parser_version` + `chunker_version` hardcode 보존 (backward compat).
### §3.5 Bug #14: empty query validation
**Decision**: `search``ask` command 모두에 query empty check + error.v1 emit.
**Search command** (crates/kebab-cli/src/main.rs::search arm):
```rust
if let Some(q) = query.as_ref() {
if q.trim().is_empty() {
return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty search term or use --bulk".into(),
details: Value::Null,
hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()),
})));
}
}
```
**Ask command** (crates/kebab-cli/src/main.rs::ask arm):
```rust
if query.trim().is_empty() {
return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty prompt".into(),
details: Value::Null,
hint: Some("e.g. `kebab ask 'explain this code'`".into()),
})));
}
```
Both commands now validate; no silent fallback.
---
## §4 Implementation specification
### §4.1 Files to modify
1. **Bug #9 capability fix**: `crates/kebab-app/src/schema.rs`
- line 137151: `capabilities_snapshot()` — flip `streaming_ask: false``true`, `single_file_ingest: false``true`.
- add test: `capabilities_streaming_ask_matches_cli_surface()`.
- add test: `capabilities_single_file_ingest_matches_cli_surface()`.
2. **Bug #10 config_not_found**: Two files
- `crates/kebab-config/src/lib.rs`:
- Define `ConfigNotFound` error struct (with `#[derive(Debug, thiserror::Error)]`).
- Modify `Config::load(opt_path: Option<&Path>)` — path existence check, `return Err(anyhow::Error::new(ConfigNotFound { ... }))`.
- add test: `config_load_explicit_nonexistent_path_returns_error()`.
- `crates/kebab-app/src/error_wire.rs`:
- Add classify arm after existing `ConfigInvalid` case.
- Map `kebab_config::ConfigNotFound``ErrorV1 { code: "config_not_found", ... }`.
3. **Bug #13 schema.models**: Three components
- `crates/kebab-store-sqlite/src/lib.rs`:
- Implement `fetch_distinct_parser_versions()` — SQL SELECT DISTINCT on documents.parser_version + ORDER BY.
- Implement `fetch_distinct_chunker_versions()` — SQL SELECT DISTINCT on chunks.chunker_version + ORDER BY.
- `crates/kebab-app/src/schema.rs`:
- Modify `Models` struct — add `active_parsers: Vec<String>`, `active_chunkers: Vec<String>` fields.
- Modify computation logic (`collect_models` or equiv) — call store methods, populate arrays, fallback to markdown defaults for single fields.
- add test: `schema_models_active_arrays_empty_on_empty_corpus()`.
- add test: `schema_models_active_arrays_populated_after_mixed_ingest()`.
- `docs/wire-schema/v1/schema.schema.json`:
- `Models` object — add `"active_parsers": { "type": "array", "items": { "type": "string" } }`.
- add `"active_chunkers": { "type": "array", "items": { "type": "string" } }`.
- Mark deprecated in comment: `parser_version` + `chunker_version` (additive, backward compat).
4. **Bug #14 empty query validation**: `crates/kebab-cli/src/main.rs`
- search command arm: add `if query.trim().is_empty()` check → error.v1 code=invalid_input.
- ask command arm: add identical `if query.trim().is_empty()` check → error.v1 code=invalid_input.
5. **Wire schema v1 doc update**: `docs/wire-schema/v1/`
- Update schema doc to note `active_parsers` / `active_chunkers` optional (additive).
6. **Integration**: `integrations/claude-code/kebab/SKILL.md`
- Update `schema.models` surface docs — reference new `active_*` arrays for multi-version corpora.
7. **Tests** (new or extended):
- `crates/kebab-cli/tests/`: invalid --config path (absolute + relative) → error.v1 + exit≠0.
- `crates/kebab-cli/tests/`: empty query (search + ask) → error.v1 code=invalid_input + exit≠0.
- `crates/kebab-config/tests/`: config file not found → ConfigNotFound error.
- `crates/kebab-app/tests/`: mixed corpus schema — active_parsers/chunkers include all ingested versions.
### §4.2 Regression checks
- Existing 1350 workspace tests: `cargo test --workspace --no-fail-fast -j 1` must pass green.
- All non-bug capabilities (json_mode, ingest_progress, ingest_cancellation, rag_multi_turn, search_cache, incremental_ingest, mcp_server, bulk_search) stay true.
- Default config path resolution (no --config) unchanged — silent fallback to XDG only if `--config` not passed.
- Relative path behavior (cwd-relative, Rust std path::Path::exists()) preserved.
- Empty corpus → empty `active_parsers` / `active_chunkers` array (not null, not error).
- Existing hardcoded `parser_version` + `chunker_version` fields continue to report markdown defaults (backward compat).
- Schema version bump not required (wire schema additive minor, backward compat).
---
## §5 Acceptance criteria
| # | Criterion | Evidence |
|----|-----------|----------|
| AC-1 | `kebab schema --json` emit `streaming_ask: true` + `single_file_ingest: true` | `cargo test -p kebab-app capabilities_* -j 4` green |
| AC-2 | `kebab search "x" --config /nonexistent.toml --json` emit exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-config config_load_explicit_nonexistent_path_returns_error -j 4` green |
| AC-3 | `cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4` → green | unit test confirms default = 60s (no manual timing) |
| AC-4 | After mixed ingest (MD + PDF + code), `kebab schema --json` emits both `active_parsers` + `active_chunkers` arrays containing all versions | integration test pass |
| AC-5 | `kebab search "" --json` and `kebab search " " --json` both emit exit≠0 + error.v1 code=invalid_input | integration test pass |
| AC-6 | `kebab ask "" --json` emit exit≠0 + error.v1 code=invalid_input (ask symmetry) | integration test pass |
| AC-7 | `kebab search "rust" --config nonexistent-relative.toml --json` (relative path) emit exit≠0 + error.v1 code=config_not_found | integration test pass |
| AC-8 | All 1350+ workspace tests pass; no new failures | `cargo test --workspace --no-fail-fast -j 1` exit=0 |
| AC-9 | Wire schema backward compat: old clients reading `parser_version` + `chunker_version` still work; `active_*` arrays optional per schema | JSON schema `additionalProperties: false` review |
| AC-10 | `kebab ask --stream` still works; streaming events emitted (no regression) | manual `kebab ask --stream "explain this" 2>&1 | head -3` |
---
## §6 Risks + resolutions
### Risks
- **R-1** (Bug #10): Relative path `./config.toml` must resolve from cwd, not from binary location. **Resolution**: Rust `std::path::Path::exists()` is cwd-relative; no workaround needed.
- **R-2** (Bug #13): Empty corpus → empty `active_parsers` / `active_chunkers` array. **Resolution**: Unit test `schema_models_active_arrays_empty_on_empty_corpus()` mandated (AC-4).
- **R-3** (resolved): `collect_models` uses no cache (every-call re-computation). `active_parsers/chunkers` reflect corpus state at invocation time. If future caching is added, `corpus_revision` increment signals invalidation — document at that time.
- **R-4** (Bug #14): `ask` command validation — covered by same fix (§3.5 mandates both search + ask).
- **R-5** (Bug #11): 60s may still timeout on very dense/high-res pages. **Mitigation**: User can override via `config.toml [pdf.ocr] request_timeout_secs = N`. Release notes explicitly call this out.
---
---
## §7 Parent spec deviation (HOTFIXES handoff)
**F-11 MEDIUM finding**: parent spec `2026-04-27-kebab-final-form-design.md` (frozen) specifies PDF OCR request_timeout_secs = 600s (§1000 + §1628 OQ-1, rationale: "CPU 환경 105s 의 5x 여유"). Bug #11 (dogfood evidence) contradicts — 600s causes timeouts; 60s production-optimal.
**Deviation handling**:
1. Parent spec stays frozen (no edits).
2. **HOTFIXES entry (executor Step N)**: `tasks/HOTFIXES.md` receives dated entry:
```markdown
2026-05-27 — PDF OCR request_timeout_secs default 600s → 60s (v0.20.0 bugfix3 dogfood evidence). Bug #11.
```
3. **Parent spec cross-link (executor Step N)**: parent spec `2026-04-27-kebab-final-form-design.md` receives inline comment at §1000 (default value code block) or §1628 (OQ-1 paragraph):
```markdown
<!-- HOTFIX 2026-05-27: default 60s (Bug #11). See tasks/HOTFIXES.md 2026-05-27 entry. -->
```
**Parent spec invariant**: No changes to parent spec text; only cross-link comment + HOTFIXES.md entry. Frozen design contract preserved.
---
## §8 References
- [Dogfood report](../../../.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md) — 5 bugs discovered + decisions.
- [Parent spec (frozen contract)](2026-04-27-kebab-final-form-design.md) — §1, §2, §4 (capabilities, error handling, JSON schema, config XDG).
- `crates/kebab-app/src/schema.rs:137151` (capabilities_snapshot).
- `crates/kebab-config/src/lib.rs` (Config::load, default_pdf_ocr_request_timeout_secs).
- `crates/kebab-app/src/error_wire.rs` (classify ConfigNotFound).
- `crates/kebab-store-sqlite/src/lib.rs` (fetch_distinct_parser_versions, fetch_distinct_chunker_versions).
- `crates/kebab-cli/src/main.rs` (search + ask query validation).
- `docs/wire-schema/v1/schema.schema.json` (Models + Capabilities objects).
- `tasks/HOTFIXES.md` (2026-05-27 entry, Bug #11 deviation record).