From 46e99470ebb3561b26d66eec71baeafdd67bdfca Mon Sep 17 00:00:00 2001 From: altair823 Date: Thu, 28 May 2026 01:21:34 +0000 Subject: [PATCH] docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/DOGFOOD.md | 831 +++++++++++++ .../2026-05-27-v0.20-sub1-bugfix-plan.md | 849 ++++++++++++++ .../2026-05-27-v0.20-sub1-bugfix2-plan.md | 388 ++++++ .../2026-05-27-v0.20-sub1-bugfix3-plan.md | 1043 +++++++++++++++++ .../2026-05-27-v0.20-sub1-bugfix-spec.md | 965 +++++++++++++++ .../2026-05-27-v0.20-sub1-bugfix2-spec.md | 308 +++++ .../2026-05-27-v0.20-sub1-bugfix3-spec.md | 410 +++++++ 7 files changed, 4794 insertions(+) create mode 100644 docs/DOGFOOD.md create mode 100644 docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md create mode 100644 docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md create mode 100644 docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md create mode 100644 docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md create mode 100644 docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md create mode 100644 docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md diff --git a/docs/DOGFOOD.md b/docs/DOGFOOD.md new file mode 100644 index 0000000..e38b894 --- /dev/null +++ b/docs/DOGFOOD.md @@ -0,0 +1,831 @@ +# DOGFOOD — kebab 도그푸딩 시나리오 + +> 본 file 은 kebab 의 모든 기능에 대한 dogfood 시나리오 의 single reference. 새 dogfood request 시 본 file 의 시나리오 list 를 참조. 새 sub-item 또는 새 release 마다 §관련 section 의 시나리오 갱신. +> +> 사용법: 도그푸딩 요청 시 본 doc 의 시나리오 카탈로그 → 적절한 시나리오 list 선정 → 환경 setup (§0) → 시나리오 실행 → bug log → 발견된 bug 즉시 fix cycle. + +## 목차 +- [§0 환경 + 사전 준비](#0-환경--사전-준비) +- [§1 Ingest](#1-ingest) +- [§2 Search](#2-search) +- [§3 Ask (RAG)](#3-ask-rag) +- [§4 Inspect / Fetch / List](#4-inspect--fetch--list) +- [§5 Version cascade](#5-version-cascade) +- [§6 Wire schema (`--json`)](#6-wire-schema---json) +- [§7 TUI (P9)](#7-tui-p9) +- [§8 MCP (P9-fb-30)](#8-mcp-p9-fb-30) +- [§9 Doctor / Schema / Reset](#9-doctor--schema--reset) +- [§10 Eval (P5)](#10-eval-p5) +- [§11 Edge cases](#11-edge-cases) +- [§12 Bug discovery checklist](#12-bug-discovery-checklist) +- [§13 Reference dogfood corpus](#13-reference-dogfood-corpus) + +--- + +## §0 환경 + 사전 준비 + +### 0.1 release binary + +```bash +export CARGO_TARGET_DIR=/build/out/cargo-target/target +export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab" +cargo build --release -p kebab-cli -j 4 2>&1 | tail -5 +"$RELEASE_BIN" --version # 기대: kebab X.Y.Z (current workspace version) +``` + +### 0.2 Ollama endpoint check + +```bash +# Remote (사용자 dogfood host) +curl -s --connect-timeout 3 http://192.168.0.47:11434/api/tags | jq -r '.models[]?.name' +# 기대: qwen2.5vl:3b / gemma4:e4b / gemma4:26b / bge-m3:latest / nomic-embed-text:latest + +# Local fallback +curl -s --connect-timeout 3 http://localhost:11434/api/tags 2>&1 | head -3 +``` + +### 0.3 isolated KB workspace + +```bash +DOGFOOD=/build/cache/tmp/-dogfood +mkdir -p "$DOGFOOD/kb" "$DOGFOOD/data" + +# Default config from `kebab init` then customize +HOME=/tmp/dogfood-home XDG_CONFIG_HOME="$DOGFOOD/xdg" "$RELEASE_BIN" init +cp "$DOGFOOD/xdg/kebab/config.toml" "$DOGFOOD/config.toml" + +# 사용자 customize: workspace.root, storage.data_dir, models.{embedding,llm}.endpoint, 기능 enable +``` + +### 0.4 disk + cargo clean policy + +memory `feedback-cargo-clean-policy`: +- `/build` avail < 500GB OR target > 200GB 시만 cargo clean. +- 평소 incremental build. + +### 0.5 dogfood corpus location + +본 doc §13 reference. 표준 corpus = PoC 9 PDF + markdown notes + code samples + image set. + +--- + +## §1 Ingest + +### §1.1 Markdown ingest (P1) + +**기본**: +```bash +# Test markdown KB +mkdir -p "$DOGFOOD/kb/notes" +cat > "$DOGFOOD/kb/notes/sample.md" <<'EOF' +# Title + +본문 paragraph. **bold** + *italic*. + +## Heading 2 + +- list item 1 +- list item 2 + +```rust +fn main() { println!("code block"); } +``` +EOF + +"$RELEASE_BIN" ingest --config "$DOGFOOD/config.toml" --json | tail -1 | jq '.items[] | {doc_path, kind, block_count, chunk_count}' +``` + +**verify**: +- `kind` = `"new"` (first ingest) / `"unchanged"` (no change) / `"updated"` (modified). +- `block_count` ≥ heading/paragraph count. +- `chunk_count` ≥ 1. + +**scenarios**: +- 1.1.a empty markdown (0 byte) → expected warning + 0 chunks. +- 1.1.b deeply nested heading (h1-h6) → block 모두 capture. +- 1.1.c frontmatter (YAML/TOML) → metadata preserve. +- 1.1.d code fence (Rust/Python/etc.) → code block 으로 chunk. +- 1.1.e mixed inline (link/image/inline-code/strong/em) → inline preserve. +- 1.1.f huge markdown (1 MB+) → walker pass + chunk count delta. + +### §1.2 Image ingest (P6) + +**Config**: +```toml +[image.ocr] +enabled = true +engine = "ollama-vision" +model = "gemma4:e4b" +endpoint = "http://192.168.0.47:11434" + +[image.caption] +enabled = false # opt-in +``` + +**verify**: +- `*.png` / `*.jpg` / `*.jpeg` 만 ingest target. +- OCR text 가 `Block::ImageRef.ocr.joined` 안. +- `[image.caption].enabled=true` 시 caption 도. + +**scenarios**: +- 1.2.a Korean OCR (한국어 scan PNG) → OCR text + search hit. +- 1.2.b English OCR (typed text screenshot) → alnum > 90%. +- 1.2.c photo (자연 사진, OCR 없음) → empty OCR or warning. +- 1.2.d corrupt image → graceful error. +- 1.2.e oversized image (> max_pixels) → downscale. + +### §1.3 PDF text ingest (P7-1) + +**기본**: +- vector PDF (embedded text) — `PdfTextExtractor` 가 page 별 text 추출. +- 1 `Block::Paragraph` per page (P7-1 invariant). + +**verify**: +- `parser_version = "pdf-text-v1"`. +- `chunker_version = "pdf-page-v1"` (또는 `"pdf-page-v1.1"` from v0.20.1). +- `block_count` ≥ page count. + +**scenarios**: +- 1.3.a single page vector PDF → 1 block. +- 1.3.b multi-page (10+ pages) → block per page. +- 1.3.c large PDF (50+ pages, 50 MB+) → ingest 시간 + memory monitor. +- 1.3.d encrypted PDF → friendly error wording (`qpdf --decrypt` hint). +- 1.3.e corrupt PDF (truncated bytes) → graceful error. +- 1.3.f PDF with annotations / forms → extract 동작. + +### §1.4 PDF scanned OCR ingest (v0.20.0 sub-item 1) + +**Config**: +```toml +[pdf.ocr] +enabled = true +always_on = false +model = "qwen2.5vl:3b" +endpoint = "http://192.168.0.47:11434" +valid_ratio_threshold = 0.5 +min_char_count = 20 +``` + +**verify**: +- IngestEvent::PdfOcrStarted / PdfOcrFinished emit. +- `IngestItem.pdf_ocr_pages > 0` for scanned PDF. +- `IngestItem.pdf_ocr_ms_total > 0`. +- CLI printer: `📷 OCR page N...` / `✓ OCR page N (chars chars, msms via ollama-vision)`. + +**scenarios** (이전 dogfood report 의 9 시나리오): +- 1.4.a scanned 한국어 (F1 / F2) → OCR text indexed + search hit. +- 1.4.b multi-scanned PDF (5+ files) → chunk_id collision 0 (Bug #3 fix verify). +- 1.4.c always_on=true → vector PDF page 도 dual-block OCR. +- 1.4.d valid_ratio_threshold variation (0.3 / 0.5 / 0.8) → mojibake / 정상 page 분류. +- 1.4.e min_char_count variation (5 / 20 / 100) → 짧은 page OCR 호출. +- 1.4.f DCTDecode-only skip (F6 FlateDecode / F7 CCITTFax) → warning + skip. +- 1.4.g force-reingest (`--force-reingest`) → OCR 재실행. +- 1.4.h cancel handle (Ctrl+C 또는 SIGINT) → graceful abort. +- 1.4.i v0.19 → v0.20 upgrade UX (parser_version 보존 + manual force-reingest 필요). + +### §1.5 Code ingest (P10) + +**Tier 1 (AST chunkers)** — Rust / Python / TS / JS / Go / Java / Kotlin / C / C++ + +**Config**: +```toml +[ingest.code] +max_file_bytes = 262144 +max_file_lines = 5000 +ast_chunk_max_lines = 200 +``` + +**verify per lang**: +- `chunker_version` = `code-{lang}-ast-v1`. +- symbol path correct (file-scope nesting / module-path / class-method nesting). +- `extra_skip_globs` 동작. + +**scenarios**: +- 1.5.a Rust crate (workspace + multiple modules) → mod / fn / impl chunks. +- 1.5.b Python package (src/ layout, dotted module path). +- 1.5.c TypeScript/JavaScript (decorators, generators, classes). +- 1.5.d Go (package + struct methods). +- 1.5.e Java/Kotlin (class + inner class + method). +- 1.5.f C (typedef alias unit, header). +- 1.5.g C++ (namespace::Class::method recursive). +- 1.5.h `.h` 파일 (C vs C++ syntax) → tree-sitter-c parse 실패 시 Tier 3 fallback. + +**Tier 2 (resource-aware)** — yaml/k8s, dockerfile, manifest (toml/json/xml/groovy) + +**verify**: +- k8s YAML: `apiVersion+kind` per resource. +- Dockerfile: whole-file `dockerfile-file-v1`. +- Cargo.toml / package.json / pom.xml: whole-file `manifest-file-v1`. + +**scenarios**: +- 1.5.i k8s manifest (multi-resource via `---`). +- 1.5.j Dockerfile (multi-stage build). +- 1.5.k Cargo.toml workspace (members + dependencies). +- 1.5.l invalid YAML → Tier 3 fallback. + +**Tier 3 (paragraph fallback)** — shell, non-k8s YAML, AST failure + +**verify**: +- `chunker_version = "code-text-paragraph-v1"`. +- Blank-line paragraph segmentation + 80-line / 20-overlap window for oversize. + +**scenarios**: +- 1.5.m `.sh` / `.bash` / `.zsh` → paragraph chunks. +- 1.5.n empty file → 0 chunks + warning. +- 1.5.o very long shell script (1000+ lines) → line-window split. + +### §1.6 Single-file / stdin ingest (p9-fb-31) + +```bash +# Workspace 외부 file +"$RELEASE_BIN" ingest-file ~/Documents/external.md --config "$DOGFOOD/config.toml" +# 기대: _external/.md 에 copy + ingest. + +# stdin +echo "# stdin content" | "$RELEASE_BIN" ingest-stdin --title "from stdin" --config "$DOGFOOD/config.toml" +``` + +**scenarios**: +- 1.6.a external markdown ingest. +- 1.6.b stdin with `--source-uri` flag. +- 1.6.c .kebabignore matched file → warn + 진행. +- 1.6.d binary file → reject (markdown only). + +### §1.7 Incremental ingest + +**verify**: +- 첫 ingest 후 다시 ingest → 모두 `unchanged`. +- 일부 file 수정 후 → 해당 file 만 `updated`. +- 새 file 추가 → `new`. +- 삭제된 file → `purged_deleted_files` count. + +**scenarios**: +- 1.7.a unchanged path (parser_version + chunker_version + embedding_version match). +- 1.7.b stale file purge. +- 1.7.c `--force-reingest` (force-update path). + +### §1.8 Ingest progress (wire `ingest_progress.v1`) + +**`--json` mode 의 ndjson stream**: +- `scan_started` → `scan_completed` → `(asset_started → [pdf_ocr_*]* → asset_finished)+` → `completed` | `aborted`. + +**verify**: +- ordering invariant (design §2.4a). +- per-asset `idx/total/path/media/result/chunks`. +- aggregate `counts` on `completed` / `aborted`. + +--- + +## §2 Search + +### §2.1 Lexical search (P2) + +```bash +"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "한국어" --mode lexical --k 10 +``` + +**verify**: +- FTS5 trigram (v0.17.0) — 한국어 2자 이상 query hit. +- `chunks_fts` schema (`text`, `heading_path` 별 column). + +**scenarios**: +- 2.1.a Korean trigram query (`해시 충돌`). +- 2.1.b English/Korean mixed (`Rust 충돌`). +- 2.1.c 1-char query → 0 hit + hint. +- 2.1.d raw mode escape (`heading_path : `). +- 2.1.e FTS5 phrase query (`"specific phrase"`). +- 2.1.f exclusion (`-token`). + +### §2.2 Vector search (P3) + +```bash +"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "어떤 문장의 의미적 검색" --mode vector --k 10 +``` + +**verify**: +- `multilingual-e5-small` (384d) or `bge-m3` embedding. +- LanceDB 의 model 별 separate table. +- similarity score normalized. + +**scenarios**: +- 2.2.a Korean semantic query. +- 2.2.b English semantic query. +- 2.2.c domain-specific (code semantic). +- 2.2.d cross-lingual (한영 mixed query). + +### §2.3 Hybrid search (RRF, P3) + +```bash +"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "query" --mode hybrid --k 10 +``` + +**verify**: +- `fusion_score = [0, 1]` (normalize). +- lexical + vector → RRF fusion. + +**scenarios**: +- 2.3.a hybrid 의 lexical-only / vector-only 가 못 잡는 case (RRF win). +- 2.3.b fusion_score ordering. + +### §2.4 Search filters (p9-fb-36) + +```bash +"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "query" --tag rust --tag api --lang en --path-glob 'src/**' +``` + +**verify**: +- `--tag` repeatable (OR within). +- `--lang` ISO code. +- `--path-glob` workspace_path glob. + +### §2.5 Search pagination (p9-fb-34) + +```bash +"$RELEASE_BIN" search "query" --max-tokens 1000 --snippet-chars 200 --json | jq '.next_cursor' +"$RELEASE_BIN" search "query" --cursor "$(...)" --json +``` + +**verify**: +- `next_cursor` opaque token. +- `corpus_revision` mismatch → `stale_cursor` error. + +### §2.6 Search cache (p9-fb-19) + +```bash +"$RELEASE_BIN" search "query" --json # first call +"$RELEASE_BIN" search "query" --json # cached (in-process LRU, no-op in CLI) +"$RELEASE_BIN" search "query" --no-cache --json # force fresh +``` + +### §2.7 Bulk search + +stdin queries 의 batch: +```bash +echo -e "query 1\nquery 2\nquery 3" | "$RELEASE_BIN" search --bulk --json +``` + +--- + +## §3 Ask (RAG) + +### §3.1 Basic ask (P4) + +```bash +"$RELEASE_BIN" ask --config "$DOGFOOD/config.toml" "어떤 질문" --json +``` + +**verify**: +- `grounded` field (boolean). +- `citations` list (chunk references). +- `answer` text. + +**scenarios**: +- 3.1.a in-corpus question (grounded=true). +- 3.1.b out-of-corpus question (grounded=false + refusal). +- 3.1.c hallucination check (paraphrase test, fb-41). + +### §3.2 Streaming ask (v0.17.1) + +```bash +"$RELEASE_BIN" ask "query" --stream +``` + +**verify**: +- per-event `answer.v1` (delta tokens). +- final event with `verification` + `citations`. + +### §3.3 Multi-hop RAG (fb-41 / v0.18.0) + +```bash +"$RELEASE_BIN" ask "complex question" --multi-hop --json +``` + +**verify**: +- decompose → decide → synthesize loop. +- `multi_hop_max_depth` / `multi_hop_max_sub_queries_per_iter` 따름. + +### §3.4 NLI verification (fb-41 / v0.18.0) + +```toml +[models.nli] +model = "Xenova/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7" + +[rag] +nli_threshold = 0.5 +``` + +**verify**: +- `verification.entailment_score`. +- `refusal_reason = "nli_verification_failed"` (low score). + +**scenarios**: +- 3.4.a known hallucination case (S7 caffeine) → reject with nli_score < threshold. +- 3.4.b legitimate grounded answer → entailment > 0.5. +- 3.4.c NLI model unavailable → `refusal_reason = "nli_model_unavailable"`. + +### §3.5 Ask filters + +```bash +"$RELEASE_BIN" ask "query" --tag rust --path-glob 'src/**' --lang en +``` + +--- + +## §4 Inspect / Fetch / List + +### §4.1 Inspect + +```bash +"$RELEASE_BIN" inspect document --json +"$RELEASE_BIN" inspect chunk --json +``` + +**verify**: +- `chunk_inspection.v1` schema. +- `canonical_document.parser_version` / `chunker_version`. + +### §4.2 Fetch (p9-fb-35) + +verbatim chunk / doc / span fetch: +```bash +"$RELEASE_BIN" fetch --json +"$RELEASE_BIN" fetch --span 12-34 --json +``` + +**verify**: +- `fetch_result.v1` schema. +- verbatim text (no chunk wrapping). + +### §4.3 List + +```bash +"$RELEASE_BIN" list documents --json +"$RELEASE_BIN" list chunks --json +``` + +--- + +## §5 Version cascade + +design §9 cascade rule: +- `parser_version` 변경 → 해당 parser 의 모든 chunk 무효. +- `chunker_version` 변경 → 해당 chunker 의 모든 chunk 무효. +- `embedding_version` 변경 → 모든 embedding 무효. + +**verify scenarios**: +- 5.a parser_version bump (예: `pdf-text-v1` → `pdf-text-v2`) → 자동 invalidation + 다음 ingest 가 재처리. +- 5.b chunker_version bump (예: v0.20.1 의 `pdf-page-v1` → `pdf-page-v1.1`) → chunk_id 재계산. +- 5.c embedding_version bump (예: `multilingual-e5-small/v1` → `/v2`) → LanceDB 의 별 table. +- 5.d 동일 asset 의 doc_id 다른 case → `purge_workspace_path_for_parser_bump` cascade. +- 5.e force-reingest 의 user-facing UX. + +--- + +## §6 Wire schema (`--json`) + +### §6.1 schemas list + +`docs/wire-schema/v1/`: +- `ingest_progress.v1`, `ingest_report.v1` +- `search_hit.v1`, `search_response.v1`, `bulk_search_item.v1`, `bulk_search_response.v1` +- `answer.v1`, `answer_event.v1` +- `chunk_inspection.v1`, `citation.v1`, `doc_summary.v1` +- `doctor.v1`, `schema.v1`, `fetch_result.v1`, `reset_report.v1` +- `error.v1` + +### §6.2 verify per schema + +각 schema: +```bash +$RELEASE_BIN --json | jq '.schema_version' +# 기대: ".v1" +``` + +JSON schema validity: +```bash +jq -e 'has("schema_version")' +ajv-cli validate -s docs/wire-schema/v1/.schema.json -d +``` + +### §6.3 wire backward-compat + +**verify**: +- additive minor (new field / new enum value) → older consumer 가 graceful. +- breaking change (field removal / type change) → `v2` major bump. + +### §6.4 schema_version cascade + +`schema.v1` (`kebab schema --json`) output 의 `wire_schemas` field: +- 16+ entry 의 `{name, version, capabilities}`. + +--- + +## §7 TUI (P9) + +### §7.1 Launch TUI + +```bash +"$RELEASE_BIN" tui --config "$DOGFOOD/config.toml" +``` + +### §7.2 4 panel verify + +- **P9-1 Library**: workspace document tree + recent assets. +- **P9-2 Search**: lexical / vector / hybrid search panel. +- **P9-3 Ask**: question + answer pane + citations. +- **P9-4 Inspect**: chunk / document detail. + +### §7.3 keyboard shortcuts + +- Tab / Shift+Tab — switch panel. +- Esc — cancel ongoing op. +- q — quit. +- 기타 panel-specific. + +### §7.4 scenarios + +- 7.4.a library tree navigation. +- 7.4.b search query + result selection → fetch. +- 7.4.c ask question + answer + citation click → inspect. +- 7.4.d cancel mid-ingest (Esc). +- 7.4.e quit + restart → state preserve. + +--- + +## §8 MCP (P9-fb-30) + +### §8.1 Launch MCP stdio server + +```bash +"$RELEASE_BIN" mcp +# (agent host 가 stdio 로 호출) +``` + +### §8.2 6 MCP tools + +- `search`, `ask`, `schema`, `doctor`, `ingest_file`, `ingest_stdin`. + +### §8.3 verify per tool + +- input schema (JSON Schema). +- output schema (wire `*.v1`). +- error path (graceful, exit code). + +--- + +## §9 Doctor / Schema / Reset + +### §9.1 Doctor (health check) + +```bash +"$RELEASE_BIN" doctor --json +``` + +**verify**: +- `doctor.v1` schema. +- workspace / storage / models / ollama_reachability. + +### §9.2 Schema (introspection report) + +```bash +"$RELEASE_BIN" schema --json +``` + +**verify**: +- `schema.v1` schema. +- wire_schemas / capabilities / model_versions / stats. + +### §9.3 Reset + +```bash +"$RELEASE_BIN" reset --yes +"$RELEASE_BIN" reset --data-only --yes +"$RELEASE_BIN" reset --vector-only --yes +``` + +**verify**: +- XDG data dirs wipe (irreversible — TTY confirm 또는 `--yes`). +- macOS path collision 회피 (HOTFIXES 2026-05-07). +- `--data-only` 가 config 보호. + +--- + +## §10 Eval (P5) + +```bash +"$RELEASE_BIN" eval --config "$DOGFOOD/config.toml" +``` + +**verify**: +- golden query suite 의 metrics (MRR / Recall / NDCG). +- regression detection (snapshot 비교). + +--- + +## §11 Edge cases + +### §11.1 Encrypted PDF + +- input: thermal-pos-printer.pdf / thermal-label.pdf (사용자 dogfood corpus). +- expected: `kind: "Error"` + friendly wording (`qpdf --decrypt` hint). + +### §11.2 Corrupt file + +- input: truncated bytes / invalid magic. +- expected: `kind: "Error"` + graceful error. + +### §11.3 Empty file + +- input: 0-byte file. +- expected: 0 chunks + warning. + +### §11.4 Very large file + +- markdown 100 MB+, PDF 100 MB+. +- expected: walker pass (per file type limit), parser graceful. + +### §11.5 env variable overrides + +```bash +KEBAB_PDF_OCR_ENABLED=true \ + KEBAB_PDF_OCR_MODEL=qwen2.5vl:7b \ + "$RELEASE_BIN" ingest --config "$DOGFOOD/config.toml" +``` + +**verify per env**: +- `KEBAB_PDF_OCR_*` (11 env, v0.20.0). +- `KEBAB_IMAGE_OCR_*` (P6). +- `KEBAB_MODELS_LLM_*`, `KEBAB_MODELS_EMBEDDING_*`. +- `KEBAB_READONLY` (write-path subcommand 차단). + +### §11.6 Cancel handle + +- 대형 PDF (metro-korea 58MB) ingest 도중 SIGINT (Ctrl+C). +- expected: graceful abort + `IngestEvent::Aborted` + partial counts. + +### §11.7 `--readonly` mode + +```bash +KEBAB_READONLY=1 "$RELEASE_BIN" ingest # 기대: refuse +KEBAB_READONLY=1 "$RELEASE_BIN" search "query" # 기대: OK +``` + +### §11.8 `--quiet` mode + +```bash +"$RELEASE_BIN" ingest --quiet # stderr 0 +"$RELEASE_BIN" ingest --json # implies quiet +``` + +### §11.9 `.kebabignore` + +```text +# In workspace root +node_modules/ +*.tmp +draft/** +``` + +**verify**: +- ignore patterns 정확히 적용. +- per-directory `.kebabignore` cascading. + +### §11.10 Config edge cases + +- missing config → fall back to XDG default. +- malformed TOML → `error.v1` with `config_invalid`. +- unknown field → tolerant (forward-compat) 또는 strict (TBD). +- workspace.root path 변수 (`~`, `${XDG_DATA_HOME}`, relative path) 모두 동작. + +### §11.11 Concurrent access + +- 동일 KB 의 multiple ingest 동시 실행 → SQLite lock 또는 graceful queue. +- ingest 중 search → consistent snapshot. + +### §11.12 macOS path collision (HOTFIXES 2026-05-07) + +- `config_dir()` ≠ `data_dir()` 보장. +- legacy `~/Library/...` path 의 자동 migration. + +--- + +## §12 Bug discovery checklist + +새 feature 또는 새 release 마다 본 checklist 실행: + +### §12.1 Pre-flight +- [ ] release binary build clean. +- [ ] Ollama endpoint reachable + 사용 model pull. +- [ ] isolated KB workspace 분리. +- [ ] config.toml = default + minimal customize. + +### §12.2 Ingest path +- [ ] 각 media type 의 baseline (markdown / image / pdf / code / Tier2/3). +- [ ] empty / corrupt / encrypted file. +- [ ] very large file (size limit verify). +- [ ] incremental + force-reingest cycle. +- [ ] env var override. +- [ ] cancel handle mid-ingest. + +### §12.3 Search path +- [ ] lexical (한국어 + 영어 + mixed). +- [ ] vector (semantic). +- [ ] hybrid (RRF). +- [ ] filter (tag/lang/path-glob). +- [ ] pagination cursor. +- [ ] bulk search. + +### §12.4 Ask path +- [ ] in-corpus question (grounded=true). +- [ ] out-of-corpus question (grounded=false). +- [ ] streaming (`--stream`). +- [ ] multi-hop (`--multi-hop`). +- [ ] NLI verification (known hallucination case). + +### §12.5 Surface verify +- [ ] CLI flags (각 subcommand 의 --help + actual behavior). +- [ ] TUI panel + keyboard shortcuts. +- [ ] MCP stdio (each tool). +- [ ] Wire schema (`--json` mode 의 schema_version + jq validity). + +### §12.6 Version cascade +- [ ] parser_version bump → 자동 invalidation. +- [ ] chunker_version bump → chunk_id 재계산. +- [ ] embedding_version bump → LanceDB 별 table. + +### §12.7 Edge cases +- [ ] env override (각 KEBAB_* env). +- [ ] readonly mode. +- [ ] quiet mode. +- [ ] .kebabignore. +- [ ] config edge cases. +- [ ] concurrent access. + +### §12.8 Doc + Wire schema 정합 +- [ ] README + HANDOFF + ARCHITECTURE 의 사용자 visible surface 일치. +- [ ] wire schema 의 actual emit field 와 schema.json 일치. +- [ ] error.v1 의 `code` 값이 실제 surface 와 일치. + +### §12.9 Bug discovery → immediate fix cycle + +사용자 명시: "아무리 작은거여도 발견되면 바로 이어서 그걸 수정하는 작업들을 하도록 해". + +bug 발견 시: +1. **immediate stop dogfood** (현 scenario 중단). +2. **bug log** — `.omc/reviews/--dogfood-report.md` 에 다음 fields: + - file:line of root cause. + - reproduction command. + - expected vs actual. + - severity (Critical / Important / Minor). +3. **spec/plan/executor cycle** (size-adapted): + - **small bug** (1-line fix / wording typo / config field rename): simplified spec (1 page) + plan (1-2 step) + executor. + - **large bug** (cross-crate / wire schema / new invariant): full spec/plan/executor cycle (이전 v0.20-sub1-bugfix scale). +4. **fix verify** — workspace test + clippy + dogfood scenario 재확인. +5. **dogfood resume** from where it left off. + +--- + +## §13 Reference dogfood corpus + +### §13.1 PDF (9 file, PoC + sub-item 1) + +| # | File | Size | Source | Use case | +|---|------|------|--------|----------| +| 1 | scanned_page1.pdf (F1) | 466 KB | `crates/kebab-parse-pdf/tests/fixtures/` | scanned 한국어 일반 (OCR alnum > 85%) | +| 2 | scanned_page2.pdf (F2) | 773 KB | `crates/kebab-parse-pdf/tests/fixtures/` | scanned 받침 intensive (OCR alnum > 70%) | +| 3 | metro-korea.pdf | 58 MB | `/build/cache/pdf-ocr-poc/fixtures/` | real-world 신문 multi-page vector PDF | +| 4 | mojibake.pdf (F4) | 23 KB | `crates/kebab-parse-pdf/tests/fixtures/` | vector PDF (Latin + no ToUnicode CMap) | +| 5 | flate_raw.pdf (F6) | 872 B | `crates/kebab-parse-pdf/tests/fixtures/` | FlateDecode skip path | +| 6 | ccitt.pdf (F7) | 2 KB | `crates/kebab-parse-pdf/tests/fixtures/` | CCITTFax skip path | +| 7 | thermal-pos-printer.pdf | 1.1 MB | `~/paperboy/` | ENG manual PDF (encrypted) | +| 8 | thermal-label.pdf | 2.7 MB | `~/paperboy/` | ENG manual PDF (encrypted) | +| 9 | internals-presentation.pdf | 820 KB | `~/namu-crawler/docs/` | slide deck PDF (vector text) | + +### §13.2 PoC ground-truth (alnum e2e) + +`/build/cache/pdf-ocr-poc/ground-truth/`: +- `page1.txt` (1489 byte) — F1 ground-truth. +- `page2-batchim.txt` (3282 byte) — F2 ground-truth (받침 intensive). + +### §13.3 Markdown / code corpus + +(향후 sub-item 별 추가 — `~/Documents/notes/` 또는 별 prepared fixture). + +### §13.4 Image corpus + +(P6 dogfood — 향후 추가). + +--- + +## 부록 A — 본 doc 의 갱신 정책 + +- 새 release / sub-item 머지 시 §1 (Ingest) 또는 §2-§9 (관련 section) 갱신. +- 새 dogfood report (`.omc/reviews/--dogfood-report.md`) 의 scenarios 가 본 doc 의 표준 시나리오 와 align. +- 본 doc 의 시나리오 자체에 새 scenarios 추가 시 PR 안 §12 checklist update. +- HOTFIXES.md 의 deviation 발견 시 본 doc 의 관련 section 의 verify step 강화. + +## 부록 B — Reviewer 의 dogfood verify + +PR review 시 reviewer 가 본 doc 의 §12 checklist 참조: +- 새 feature 의 변경 surface (CLI flag / wire schema / config field) 가 §1-§9 의 시나리오 cover 하는지. +- new bug discovery 시 §12.9 의 immediate fix cycle 적용. diff --git a/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md new file mode 100644 index 0000000..697aec6 --- /dev/null +++ b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md @@ -0,0 +1,849 @@ +--- +title: v0.20.0 sub-item 1 bugfix — implementation plan +created: 2026-05-27 +status: ACCEPT (round 2 closure — Phase B complete) +target_version: 0.20.0 (PR #189 force-update) +spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md +contract_sections: ["§9 (chunker_version cascade)"] +parent_plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md +review_history: + - 2026-05-27 plan round 0 (opus, drafter) — 5 step group A-E, 18 sub-action + - 2026-05-27 plan round 1 critic (opus, thorough) — NEEDS_DISCUSSION, HIGH 1 + MEDIUM 2 + LOW 3 + NIT 1 (7 finding) + - 2026-05-27 plan round 1 verifier (opus, thorough) — NEEDS_DISCUSSION, HIGH 4 + MEDIUM 3 + LOW 4 + NIT 3 (14 finding) + - 2026-05-27 plan round 1c rewrite (opus, drafter) — 21 finding 모두 적용 (critic 7 + verifier 14). detail = §8 round 1c rewrite changelog + - 2026-05-27 plan round 2 closure critic (opus) — ACCEPT, 21/21 applied + 4 NIT cosmetic +--- + +# v0.20.0 sub-item 1 bugfix plan + +> ACCEPT 된 spec (`docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`, 965 lines, round 2 closure) 의 step decomposition. 3 bug (#2 walker code limit / #3 chunk_id collision Critical / #4 F4 fixture Pages tree) 의 force-update path (PR #189 base branch `feat/pdf-scanned-ocr` 위에 fix commits stack). **5 step (Group A-E), 18 sub-action, 4 commit + 1 verify-only step (= 5 step total, 4 commit boundary).** spec §6 의 16-row acceptance 가 본 plan 의 §4 verifier checklist 로 1:1 mapping. + +## §0 Pre-flight + branch state + +- **Branch**: `feat/pdf-scanned-ocr` (PR #189 base, HEAD = `b4d9e60` "chore(release): bump version 0.19.0 → 0.20.0"). 사용자 메모리 `feedback_pr_workflow` 따라 force-update path — 같은 branch 위에 fix commits 추가 + PR #189 force-push. +- **Working dir**: `/home/altair823/kebab`. +- **Env 강제** (`~/.claude/CLAUDE.md` "Disk Layout — 루트 디스크 보호가 최우선"): + - `export CARGO_TARGET_DIR=/build/out/cargo-target/target` — XFS 4TB 전용 디스크 격리, repo root `target/` 생성 방지. + - `export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab"` — release binary alias (Step 5 dogfood 의 모든 acceptance command 에서 사용). + - `export TMPDIR=/build/cache/tmp` — 대용량 임시 파일 보호. +- **Cargo build 직렬화** (memory `feedback_serial_build_only`): + - per-crate: `-j 4` default (예: `cargo test -p kebab-chunk -j 4`). + - workspace: `-j 1` 강제 (`cargo test --workspace -j 1`, `cargo clippy --workspace -j 1` — 18 integration-test binary 동시 link 시 OOM). +- **`target/` clean policy** (memory `feedback_cargo_clean_policy`): `/build` XFS 4TB 분리라 routinely clean 금지. `df -h /build` 의 `Avail < 500G` OR `du -sh $CARGO_TARGET_DIR` > 200G 시에만 clean. Step 5 E1 의 first cargo invoke 직전 1 회 conditional check, 임계 미달 시 skip + commit body 안 "skipped cargo clean — /build avail X TB" 1줄 record. +- **dogfood KB layout 가정** (Step 5 E3 prerequisite, critic round 1 H-1 closure): canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (in-place, KB dir 안). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml` 은 **존재 안 함** — 본 plan 의 모든 acceptance command 는 in-place config 기준. Step 5 E3 의 KB clean 은 **destructive `rm -rf` 금지**, config 보존 selective clean 사용 (E3 detail 참조). dogfood config canonical path 는 본 §0 의 한 곳에서만 정의 — Step 5 E3 의 command 가 이 path 참조. +- **HOTFIXES.md / README / HANDOFF / ARCHITECTURE 영향**: Step 2 B4 가 HOTFIXES.md entry 추가 (Bug #3 second-iteration patch). 그 외 사용자 visible surface 변경 0 — README + HANDOFF + ARCHITECTURE 갱신 0 (CLI flag / wire schema / TUI key / config 추가 0; chunker_version bump 은 internal cascade 라 release notes 만). +- **wire schema 변경 0** — `ingest_progress.v1` + `ingest_report.v1` 추가 field 0. V00X migration 0. `chunks` table DDL unchanged. +- **frozen design contract 변경 0** — design §9 cascade rule 자체 변경 0 (rule 의 직접 적용으로 chunker_version 만 bump). +- **workspace version bump 0** — v0.20.0 이 이미 cut (commit `b4d9e60`). 본 plan 은 같은 v0.20.0 안의 cumulative bugfix (PR #189 force-update). Step 5 E5 의 PR force-push 만, release tag 재컷 0. + +## §1 Plan overview + spec linkage + +Spec §3 (Bug #2) + §4 (Bug #3) + §5 (Bug #4) 의 fix design 을 atomic step 으로 decompose. 핵심 sequencing: + +1. **Bug #2 walker code limit fix** (Step 1) — `is_code_file` helper + walker conditional + unit test. spec §3.4 + §3.5 의 diff 그대로 적용. 1 commit. +2. **Bug #3 chunk_id fix + chunker_version bump** (Step 2) — `chunk_page` return tuple 4-tuple 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경 + `VERSION_LABEL` `"pdf-page-v1"` → `"pdf-page-v1.1"` bump + module doc 갱신 + HOTFIXES.md entry + unit regression test. spec §4.4 + §4.4.1 + §4.5 의 diff 그대로 적용. 1 commit. +3. **Bug #3 integration test** (Step 3) — `crates/kebab-app/tests/` 안 multi-scanned PDF chunk_id collision-free integration test. spec §4.5.1 의 MockOcrEngine pre-condition 결정 (Option A share 또는 Option B inline) 이 executor 의 first sub-action. 1 commit. +4. **Bug #4 F4 fixture re-generation** (Step 4) — `tests/fixtures/_synth/mojibake.py` 의 pikepdf-based rewrite + F4 fixture binary regenerate + parse-pdf 의 3 신규 invariant test. spec §5.4 + §5.5 + §5.6 의 diff 그대로 적용. 1 commit. +5. **Workspace verify + commit + PR force-push** (Step 5) — cargo workspace test `-j 1` + clippy `-D warnings` + dogfood re-run (`/build/cache/tmp/v0.20-dogfood` isolated KB, qwen2.5vl:3b 의 Ollama endpoint `192.168.0.47:11434`) + PR #189 force-push. spec §6 16-row consolidated acceptance 가 본 step 의 verifier checklist. + +ordering invariant: + +- **Step 1 || Step 2 || Step 4 mutually independent**: 3 bug 의 fix 가 서로 다른 crate (`kebab-source-fs` / `kebab-chunk` / `tests/fixtures` + `kebab-parse-pdf`) 의 file path 에 한정 — 동시 진행 가능. 정합성 우선 → Step 1 → Step 2 → Step 4 sequential. +- **Step 2 < Step 3**: integration test 가 `kebab-chunk` 의 fix 된 chunk_id 계산 path 위에 의존. Step 2 의 GREEN 이 prerequisite. +- **Step 4 < Step 5 dogfood**: F4 fixture regeneration 의 결과 binary 가 dogfood 의 9 PDF 중 1 (mojibake) — Step 5 E3 dogfood 의 `block_count: 1` invariant 검증 prerequisite. +- **Step 1-4 all < Step 5 workspace test**: workspace 전체 test 가 production code + test 의 final state 위에서만 의미. + +commit 단위는 logical group 1 commit (atomic) — §7 sequencing summary 의 5-commit table 따름. 사용자 memory `feedback_pr_workflow` (gitea-pr + 리뷰 루프) 따라 force-update 후 `gitea-pr-review` skill 의 review 루프 진입. + +--- + +## §2 Step group structure (Group A-E) + +| Step | Group | 분류 | sub-action | +|---:|---|---|---| +| 1 | A | Bug #2 walker code limit fix | A1 `is_code_file` helper + A2 walker conditional + A3 unit test | +| 2 | B | Bug #3 chunk_id collision fix + chunker_version bump | B1 `chunk_page` 4-tuple + B2 caller `per_chunk_hash` + B3 `VERSION_LABEL` bump + B4 module doc + HOTFIXES.md + B5 unit regression test | +| 3 | C | Bug #3 multi-scanned PDF integration test | C1 MockOcrEngine share decision + C2 integration test (conditional) | +| 4 | D | Bug #4 F4 fixture re-generation | D1 mojibake.py pikepdf rewrite + D2 fixture regenerate + commit + D3 parse-pdf 3 invariant test | +| 5 | E | Workspace verify + commit + PR force-push | E1 cargo workspace test -j 1 + E2 clippy -D warnings + E3 dogfood re-run + E4 commit + E5 PR #189 force-push | + +--- + +## §3 Per-step detail + +### Step 1 (Group A): Bug #2 walker code limit fix + +spec §3 의 Option A (code path only) — `is_oversized` 호출을 `is_code_file(path)` conditional 로 gate. PDF/image/markdown 의 size 는 parser 단계 자체 검증 (lopdf load_mem 256 KB+ 정상, image OCR 의 max_pixels self-cap). + +#### Sub-action A1 — `is_code_file` helper 추가 + +- **Files affected**: `crates/kebab-source-fs/src/code_meta.rs` (line 129 `is_oversized` 함수 직후, 또는 `code_lang_for_path` 정의 직후). +- **Action** (spec §3.4 diff 그대로): + ```rust + /// Returns true when `path`'s filename/extension is recognised as a code + /// file (per `code_lang_for_path`). Used by the walker to apply + /// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files, + /// not to PDF/image/markdown (which have their own size controls in + /// their respective parsers). + pub(crate) fn is_code_file(path: &Path) -> bool { + code_lang_for_path(path).is_some() + } + ``` +- **Acceptance**: + - `grep -c "pub(crate) fn is_code_file" crates/kebab-source-fs/src/code_meta.rs` = **1**. + - `cargo build -p kebab-source-fs -j 4` green. + +#### Sub-action A2 — walker conditional size check + +- **Files affected**: `crates/kebab-source-fs/src/connector.rs:168-190` (현재 verified line range). +- **Action** (spec §3.4 diff 그대로 — `is_oversized` 호출 앞에 `is_code_file` short-circuit): + ```diff + - // Size-cap check (byte or line limit). + - if crate::code_meta::is_oversized( + - &abs_path, + - self.max_file_bytes, + - self.max_file_lines, + - ) + - .unwrap_or(false) + + // v0.20.0 sub-item 1 bugfix (#2): size-cap applies ONLY to + + // code files. PDF/image/markdown bypass — their parsers + + // have their own size controls. spec §3.3. + + if crate::code_meta::is_code_file(&abs_path) + + && crate::code_meta::is_oversized( + + &abs_path, + + self.max_file_bytes, + + self.max_file_lines, + + ) + + .unwrap_or(false) + { + fs_skips.skipped_size_exceeded = + fs_skips.skipped_size_exceeded.saturating_add(1); + ... + tracing::debug!( + path = %rel_path.display(), + max_bytes = self.max_file_bytes, + max_lines = self.max_file_lines, + - "skip: file exceeds size cap" + + "skip: code file exceeds size cap" + ); + continue; + } + ``` +- **Acceptance**: + - `grep -nE "is_code_file\(&abs_path\)\s*$" crates/kebab-source-fs/src/connector.rs` ≥ **1**. + - `grep -c "skip: code file exceeds size cap" crates/kebab-source-fs/src/connector.rs` ≥ **1**. + - `cargo build -p kebab-source-fs -j 4` green. + +#### Sub-action A3 — Bug #2 unit test 추가 + +- **Files affected**: `crates/kebab-source-fs/src/connector.rs` 의 기존 `#[cfg(test)] mod tests` (spec §3.5 "기존 test module 에 추가" 명시 — 새 file 아님). +- **Action** (spec §3.5 의 `size_cap_skips_only_code_files` test body 그대로): + - 300 KB PDF / 300 KB markdown / 300 KB `big.rs` (3 file) tempdir 합성. + - `FsSourceConnector` (`max_file_bytes = 262_144`, `max_file_lines = 5_000`) 의 `scan_with_skips(&SourceScope::default())`. + - assertions: + - `paths.contains("paper.pdf")` (PDF walker pass). + - `paths.contains("notes.md")` (Markdown walker pass). + - `!paths.contains("big.rs")` (code file walker skip). + - `skips.skip_examples.size_exceeded` 안 `big.rs` 1 entry, `paper.pdf` 0 entry. + - cfg helper: 기존 test module 의 `cfg_with_size_cap(root, max_bytes, max_lines)` 패턴 재사용 (필요 시 helper 추가). +- **Acceptance**: + - `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` green. + - 기존 `ingest_report_counts_oversized_files_by_bytes` (fixture `huge.rs`) + `ingest_report_size_cap_by_line_count` (fixture `longfile.rs`) regression 0 — fixture 명이 `.rs` 라 새 conditional 통과 (invariant preserved). + - `cargo test -p kebab-source-fs -j 4` 전체 green. + +#### Commit (Step 1 전체) + +``` +fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2) + +- crates/kebab-source-fs/src/code_meta.rs: add pub(crate) fn is_code_file +- crates/kebab-source-fs/src/connector.rs: walker conditional `is_code_file && is_oversized` +- crates/kebab-source-fs/src/connector.rs mod tests: size_cap_skips_only_code_files unit test +- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §3 +``` + +--- + +### Step 2 (Group B): Bug #3 chunk_id collision fix + chunker_version bump + +spec §4.3 의 Option A (segment boundary `start` 를 `per_chunk_hash` suffix 로). `chunk_page` return tuple 을 3-tuple `(actual_start, chunk_end, slice)` → 4-tuple `(segment_start, actual_start, chunk_end, slice)` 로 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경. `VERSION_LABEL` `"pdf-page-v1"` → `"pdf-page-v1.1"` bump (spec §4.4.1 round 1c M-1 decision — explicit cascade audit trail). + +#### Sub-action B1 — `chunk_page` 4-tuple expansion + +- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:200-204` (doc comment) + `:205-289` (signature line 205 → closing `}` line 289). 본 critic round 1 + verifier round 1 의 actual probe 결과 정정 (L-1). +- **Action** (spec §4.4 diff 그대로): + - doc comment 갱신 — `(char_start, char_end, text_slice)` → `(segment_start, actual_start, chunk_end, text_slice)`: + ```rust + /// Split a single page's text into ordered chunks, each represented as + /// `(segment_start, actual_start, chunk_end, text_slice)`. + /// + /// - `segment_start` = pre-overlap segment boundary. Strictly increasing + /// across the returned vec. Use this for chunk_id uniqueness suffixes. + /// - `actual_start` = post-overlap start char index. May collapse to a + /// previous chunk's `actual_start` under aggressive overlap policy. + /// Use this for `SourceSpan::Page::char_start`. + /// - `chunk_end` = chunk's end char index (exclusive). + fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) + -> Vec<(usize, usize, usize, String)> + ``` + - early return: `vec![(0, n, text.to_string())]` → `vec![(0, 0, n, text.to_string())]`. + - loop body 의 push: `chunks.push((actual_start, chunk_end, slice))` → `chunks.push((start, actual_start, chunk_end, slice))`. (`start = bounds[seg_idx]` 는 이미 local var 로 존재 — line 245.) + - overlap walk 의 `let prev_min = prev.0` 가 기존 tuple 의 첫 field = post-fix tuple shape 에서는 `prev.1` (actual_start) — spec §4.4 의 invariant 보존 위해 변경: + ```diff + - let actual_start = if let Some(prev) = chunks.last() { + - let prev_min = prev.0; + + let actual_start = if let Some(prev) = chunks.last() { + + // prev tuple shape = (segment_start, actual_start, chunk_end, slice). + + // overlap walk floor = previous chunk's actual_start (prev.1). + + let prev_min = prev.1; + ... + ``` +- **Acceptance**: + - `grep -nE "fn chunk_page.*-> Vec<\(usize, usize, usize, String\)>" crates/kebab-chunk/src/pdf_page_v1.rs` = **1**. + - `grep -c "let prev_min = prev.1" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1**. + - `cargo build -p kebab-chunk -j 4` green (caller B2 sub-action 동시 적용 후 red 해소). + +#### Sub-action B2 — caller `per_chunk_hash` suffix → `segment_start` + +- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:149-186` (현재 verified — `chunk` method 의 `for (...) in chunk_page(...)` loop start line 149 → loop end line 186, verifier round 1 L-2 정정). +- **Action** (spec §4.4 diff 그대로): + ```diff + - for (char_start, char_end, slice) in + - chunk_page(&p.text, target_bytes, overlap_bytes) + + for (segment_start, char_start, char_end, slice) in + + chunk_page(&p.text, target_bytes, overlap_bytes) + { + ... + let span = SourceSpan::Page { + page: page_num, + char_start: Some(char_start_u32), + char_end: Some(char_end_u32), + }; + let block_ids: Vec = vec![p.common.block_id.clone()]; + - // Per-chunk policy_hash variant prevents chunk_id + - // collision when a page produces multiple chunks. See + - // module docs for rationale. + - let per_chunk_hash = format!("{base_policy_hash}#c{char_start}"); + + // v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash + + // variant uses `segment_start` (pre-overlap boundary, + + // strictly increasing) instead of `char_start` (post- + + // overlap, may collapse to prev_min). See module docs + + + // spec §4.1 root cause + HOTFIXES.md 2026-05-27. + + let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}"); + let chunk_id = + id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash); + ... + } + ``` + - `SourceSpan::Page.char_start` 는 여전히 post-overlap `char_start` (= `actual_start`) 보존 — citation locality semantic 유지. +- **Acceptance** (verifier round 1 M-2: B2+B4 가 같은 logical commit 안 → grep 시점 = Step 2 commit time, 즉 post-B4): + - `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1** (B2 단독 적용 시 = 1 call site; B4 module doc 적용 후 = 2 — B4 acceptance 가 ≥ 2 검증). + - `grep -c "#c{char_start}" crates/kebab-chunk/src/pdf_page_v1.rs` = **0** (call site + module doc 모두 segment_start 로 교체 — B2+B4 의 same-commit consolidated invariant). + - sub-action-by-sub-action 분리 검증 시 B2 단독 grep `#c{char_start}` 는 module doc line 56 의 literal 잔존으로 ≥ 1 — Step 2 commit boundary 도달 후 = 0 으로 확정. + +#### Sub-action B3 — `VERSION_LABEL` bump `"pdf-page-v1"` → `"pdf-page-v1.1"` + hardcoded literal 2 site 갱신 + +- **Files affected** (verifier round 1 H-1 의 actual probe `grep -rn '"pdf-page-v1"' crates/ --include='*.rs'` 결과 2 site enumerate): + - `crates/kebab-chunk/src/pdf_page_v1.rs:67` (현재 verified — `const VERSION_LABEL: &str = "pdf-page-v1";`). + - `crates/kebab-app/tests/pdf_pipeline.rs:168` (현재 verified — `assert_eq!(pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()), Some("pdf-page-v1"))` hard assertion, v1.1 bump 후 fail). + - `crates/kebab-app/tests/pdf_pipeline.rs:368` (현재 verified — error message string literal `"pdf-page-v1 emits 0 chunks for the empty page; total = 2"`, hard assertion 아니지만 stale 방지). +- **Action** (spec §4.4.1 결정): + - **(a) primary const bump** (`crates/kebab-chunk/src/pdf_page_v1.rs:67`): + ```diff + -const VERSION_LABEL: &str = "pdf-page-v1"; + +const VERSION_LABEL: &str = "pdf-page-v1.1"; + ``` + 기존 test `chunker_version_is_pdf_page_v1` (pdf_page_v1.rs:374) 의 assertion 은 `VERSION_LABEL` const 인용 → 자동 갱신, test code 변경 불요. + - **(b) test assertion literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:168`, required): + ```diff + - Some("pdf-page-v1") + + Some("pdf-page-v1.1") + ``` + - **(c) test error message literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:368`, recommended): + ```diff + - "pdf-page-v1 emits 0 chunks for the empty page; total = 2" + + "pdf-page-v1.1 emits 0 chunks for the empty page; total = 2" + ``` +- **Acceptance**: + - `grep -nE 'const VERSION_LABEL: &str = "pdf-page-v[0-9.]+";' crates/kebab-chunk/src/pdf_page_v1.rs` 결과 = `"pdf-page-v1.1"`. + - `cargo test -p kebab-chunk chunker_version_is_pdf_page_v1 -j 4` green (VERSION_LABEL 인용이라 자동 통과). + - `grep -rn '"pdf-page-v1"' crates/ --include='*.rs' | grep -v 'pdf-page-v1\.1'` = 결과 **0** (regex 의 false-positive 방지 — `pdf-page-v1.1` 의 substring `"pdf-page-v1"` 은 ".1" suffix 로 exclude). `grep -v` filter 후 line 0 이면 stale literal 잔존 0. + - `cargo test -p kebab-app pdf_pipeline -j 4` green (line 168 assertion 갱신 후). + +#### Sub-action B4 — module doc 갱신 + HOTFIXES.md entry + +- **Files affected**: + - `crates/kebab-chunk/src/pdf_page_v1.rs:47-60` (현재 verified — module doc `## chunk_id collision deviation` 단락). + - `tasks/HOTFIXES.md` (new dated entry append, 기존 entry 위치 — file 의 latest entry 가 `2026-05-26` 이므로 그 위에 `2026-05-27 — v0.20.0 sub-item 1` entry insert; 본 file 의 chronological pattern 따름). +- **Action**: + - **(a) module doc** — spec §4.4 의 갱신본 그대로: + ```diff + -//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk + -//! variant `format!("{base_policy_hash}#c{char_start}")` into the + -//! recipe's `policy_hash` slot (so distinct chunks distinguish via + -//! different policy_hash inputs), while storing the unmodified + -//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers + -//! "what policy was active". Logged in `tasks/HOTFIXES.md`. + +//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk + +//! variant `format!("{base_policy_hash}#c{segment_start}")` into the + +//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap + +//! segment boundary, strictly increasing across the returned chunks + +//! even when the overlap walk collapses `actual_start` to a previous + +//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in + +//! `Chunk.policy_hash` so the field still answers "what policy was + +//! active". v1.1 second-iteration patch — logged in + +//! `tasks/HOTFIXES.md` (2026-05-27). + ``` + - **(b) HOTFIXES.md entry** (spec §4.4 의 entry body 그대로): + ```markdown + ## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround collapses under aggressive overlap (Bug #3 second-iteration patch) + + **Symptom**: F2 (1580 chars OCR, scanned_page2.pdf) ingest 시 + `DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint + failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint + failed`. `kebab v0.20.0` (commit `b4d9e60`) dogfood (qwen2.5vl:3b 의 + `192.168.0.47:11434` Ollama endpoint, `/build/cache/tmp/v0.20-dogfood` + isolated KB) `--force-reingest` 마다 reproducible. + + **Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 + `per_chunk_hash = format!("{base_policy_hash}#c{char_start}")` 에서 + `char_start` = post-overlap `actual_start`. line 266-281 의 overlap + walk 가 `prev_min` floor 까지만 back-walk 하므로 aggressive overlap + + 첫 segment 가 작은 page (F2 의 한국어 OCR text: 첫 ~10 char 안 + sentence-end → segment_1 = [0, 30], segment_2 = [30, n], overlap_bytes + 240 / chars=80 → segment_2 의 actual_start 가 prev_min=0 으로 + collapse) → 두 chunk 의 `#c0` suffix identical → identical chunk_id → + `chunks` PRIMARY KEY violation. + + **Fix** (spec §4.4): `chunk_page` return tuple 에 `segment_start` + 추가 (3-tuple → 4-tuple `(segment_start, actual_start, chunk_end, + slice)`), caller `per_chunk_hash` 의 suffix 를 `segment_start` 로 + 변경. `segment_start` 는 `bounds[seg_idx]` (dedup 후 strictly + increasing) — overlap walk 와 무관하게 모든 chunk distinct. citation + locality 의 `SourceSpan::Page.char_start` 는 여전히 post-overlap + `actual_start` 유지. + + **chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump + (spec §4.4.1 round 1c M-1 결정, design §9 cascade rule 의 직접 적용). + multi-chunk PDF page (pre-OCR 시점 `metro-korea.pdf` 의 21 block / + 34 chunk 같은 정상 path) 의 chunk_id 가 변경 — explicit user-facing + audit trail 확보, store layer 의 자동 invalidation report. v0.20.0 + force-update path 라 사용자 cost zero (어차피 fresh ingest). + + **Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md` + §4.4. parent design §4.2 chunk_id recipe 자체 unchanged (workaround + layer 의 internal computation 만 변경). parent PR #189 + (`feat/pdf-scanned-ocr`, force-update path). + ``` +- **Acceptance**: + - `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **2** (module doc + line 170 의 actual call). + - `grep -c "2026-05-27 — v0.20.0 sub-item 1: chunk_id" tasks/HOTFIXES.md` = **1**. + +#### Sub-action B5 — unit regression test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` + +- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` (현재 verified — `make_pdf_doc(&[&str])` + `default_policy(target, overlap)` helper 이미 존재, line 300-371). +- **Action** (spec §4.5 의 test body 그대로): + ```rust + #[test] + fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() { + // 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나". + // → first segment [0, 12), second segment [12, n). + // page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500 + // → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80 + // → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`. + // + // default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500 + // (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240. + // verifier round 1 L-3 보강. + let early_seg: String = std::iter::repeat('가').take(10).collect(); + let tail: String = std::iter::repeat('나').take(500).collect(); + let page_text = format!("{early_seg}. {tail}"); + + let doc = make_pdf_doc(&[&page_text]); + let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte + let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap(); + + assert!( + chunks.len() >= 2, + "expected ≥2 chunks for {} byte page; got {}", + page_text.len(), + chunks.len() + ); + + let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect(); + ids.sort_unstable(); + let total = ids.len(); + ids.dedup(); + assert_eq!( + ids.len(), + total, + "all chunk_ids must be unique even when overlap walks actual_start back to prev_min" + ); + } + ``` +- **Acceptance**: + - `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` green. + - `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` green (기존 determinism invariant 보존). + - `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` green (기존 overlap clamp invariant 보존). + - `cargo test -p kebab-chunk -j 4` 전체 green. + +#### Commit (Step 2 전체) + +``` +fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3) + +- crates/kebab-chunk/src/pdf_page_v1.rs: chunk_page returns 4-tuple + (segment_start, actual_start, chunk_end, slice); caller per_chunk_hash + suffix uses segment_start (pre-overlap boundary, strictly increasing) + instead of char_start (post-overlap, may collapse to prev_min). +- VERSION_LABEL "pdf-page-v1" → "pdf-page-v1.1" (design §9 cascade, + explicit user-facing audit trail). +- module docs: workaround description updated to segment_start. +- mod tests: multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids + regression pin. +- tasks/HOTFIXES.md: 2026-05-27 entry (symptom F2 1580 char OCR, + intra-doc collision root cause, second-iteration patch rationale). +- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4 +``` + +--- + +### Step 3 (Group C): Bug #3 multi-scanned PDF integration test + +spec §4.5 + §4.5.1 — `kebab-app` integration 수준의 chunk_id collision-free regression. real Ollama 회피 위해 `OcrEngine` trait 의 MockOcrEngine. 기존 `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` 의 private MockOcrEngine 가 같은 crate 의 별 test binary 라 직접 import 불가 — executor 의 first sub-action 으로 share path 결정. + +#### Sub-action C1 — MockOcrEngine share decision (executor 의 dependency 확인 task) + +- **Files affected** (Option 별 분기): + - **Option A (share via `tests/common/`)** — verifier round 1 H-2 의 actual probe 결과 정정: + - `crates/kebab-app/tests/common/mod.rs` 는 **이미 존재** (172 line `TestEnv` infrastructure, `#![allow(dead_code)]` + `pub struct TestEnv` + `pub fn ingest_md` + `pub fn lexical_query` 등). action = **`pub mod mock_ocr;` 1줄 append** (mod.rs 신규 X). + - `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** file, MockOcrEngine lift + per-page ctor). + - 기존 `pdf_ocr_apply.rs:20-45` 의 MockOcrEngine struct + impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` import 추가 + ctor call site migration (M-3 참조). + - 신규 integration test 가 `mod common; use common::mock_ocr::MockOcrEngine;` 으로 share. + - **Option B (inline 중복)**: 신 test file `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` 안에 inline `struct LocalMockOcr` + `impl OcrEngine for LocalMockOcr` (test isolation 우선, common/mod.rs touch X). +- **Action**: + - **(a) dependency probe** — spec §4.5.1 의 결정 path 따름: + ```bash + grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/ + # 실 결과: + # crates/kebab-parse-image/src/ocr.rs:235 — production OllamaVisionOcr. + # crates/kebab-app/tests/pdf_ocr_apply.rs:25 — test-only MockOcrEngine. + ls crates/kebab-app/tests/common/mod.rs + # 실 결과: -rw-r--r-- ... 172 line (TestEnv infrastructure 이미 존재). + ``` + - **(b) executor 결정**: + - 기존 MockOcrEngine 의 ctor 가 `MockOcrEngine { expected_text: String, fail: bool }` — per-page 다른 text 길이 지원 위해 ctor signature 확장 필요 (예: `expected_text: Vec` + internal `Mutex` cursor). 확장이 trivial + 두 test 가 같은 crate → **Option A 권장**. + - Option A 시 `pdf_ocr_apply.rs` 의 MockOcrEngine ctor 호출 site (현재 실 verifier probe = **10 instantiation site** at lines 140, 170, 193, 210, 242, 284, 311, 334, 359, 399 — critic round 1 L-2 의 "9 → 10" off-by-1 정정. struct define line 21 제외) 가 새 ctor signature 로 migration — backward-compat 위해 두 ctor (`MockOcrEngine::single(text, fail)` + `MockOcrEngine::per_page(texts, fail)`) 제공. **mechanical migration**: 각 site 의 `MockOcrEngine { expected_text: , fail: }` → `MockOcrEngine::single(, )` (10 site × 1 line edit, verifier round 1 M-3 의 actual cost). + - Option B (inline) 는 sharing 비용 > test 격리 가치 시. 본 plan 의 first preference = Option A. + - **(c) 결정 결과 record**: result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 closing summary 의 §6 open question 1 에 결정 path 기록 — Option A 시 sub-action C2 의 file edit = (existing) `common/mod.rs` append 1 line + (new) `common/mock_ocr.rs` + (modify) `pdf_ocr_apply.rs` + (new) `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` = 4 file. Option B 시 1 new file 만. +- **Acceptance**: + - probe grep 결과 ≥ 2 line (production + existing mock). + - probe ls 결과 — `common/mod.rs` existing 확인. + - executor 의 결정이 plan 의 §6 open question OQ-1 안에 명시. + +#### Sub-action C2 — integration test 작성 (conditional on C1 결정) + +- **Files affected** (Option A 채택 가정, verifier round 1 H-2 정정): + - `crates/kebab-app/tests/common/mod.rs` (**existing** 172 line — `pub mod mock_ocr;` 1줄 append 만). + - `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** — MockOcrEngine lift + per-page ctor). + - `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` (기존 inline impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` add — file head 의 mod declaration 1 줄 추가) + ctor call site 10 개 mechanical migration (M-3). + - `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs` (**신규**) — `mod common; use common::mock_ocr::MockOcrEngine;` import. +- **Action** (spec §4.5 의 test body — 본 plan 의 sub-action 안 expanded): + - **fixture**: F1 (`scanned_page1.pdf`, 779 char OCR) + F2 (`scanned_page2.pdf`, 1580 char OCR) + 1 synthetic small-page PDF (300 char) — 3 scanned PDF. + - **MockOcrEngine ctor**: per-page text vec `["text for F1", "text for F2 의 1580 char string", "text for synthetic 300 char"]` + `fail: false`. + - **isolated KB**: `tempfile::tempdir()` + `Config::default()` 의 `data_dir` 만 override + workspace `[ingest.pdf].enabled = true`. + - **assertion path**: + 1. `kebab_app::ingest_with_config_opts(&cfg, ...)` (facade) 호출. + 2. `report.items.iter().filter(|i| i.kind == IngestItemKind::Error).count() == 0` — chunk_id collision 시 발생할 `ErrorKind::Storage` row 부재. + 3. `store.get_chunks_count() == sum(per-PDF chunk_counts)` — DELETE+INSERT path 의 final row count. + 4. `store.get_all_chunk_ids().iter().collect::>().len() == chunks_count` — chunk_id global uniqueness. + - **executor degradation path** (spec §4.5.1 conditional downgrade): 만약 Option A 의 share 가 비용/위험 크고 Option B 도 비현실적 (예: integration setup 의 ExtractContext / Facade wiring 가 본 sub-action scope 초과) → §6 row 7 의 acceptance 를 conditional downgrade — `kebab-chunk` 의 unit-level invariant (Step 2 B5) 만으로 Bug #3 의 core regression 핀 확보, integration 회피. +- **Acceptance**: + - `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` green. + - `cargo test -p kebab-app pdf_ocr_apply -j 4` green (existing test regression 0 — `MockOcrEngine { expected_text, fail }` literal struct construction 10 ctor site 가 `MockOcrEngine::single(text, fail)` 로 migration 후, critic round 1 L-2 actual count). + - downgrade path 시: result file + commit body 안 "§6 row 7 conditional skip — Bug #3 core regression = kebab-chunk unit B5" 1줄 record. + +#### Commit (Step 3 전체) + +``` +test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression) + +- crates/kebab-app/tests/common/{mod,mock_ocr}.rs: MockOcrEngine lift + with per-page text ctor (shared by pdf_ocr_apply.rs + new test). +- crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs: + 3 scanned PDF (F1 + F2 + synthetic 300char) ingest via mock OCR, + assert all chunk_ids globally unique + zero ErrorKind::Storage rows. +- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4.5 +``` + +Option B (inline) 또는 conditional downgrade 채택 시 commit body 와 file list 그에 맞춰 조정. + +--- + +### Step 4 (Group D): Bug #4 F4 fixture re-generation + +spec §5 — `tests/fixtures/_synth/mojibake.py` 의 byte-level `re.sub` + 수작업 startxref edit 를 pikepdf 의 proper PDF surgery (open + delete /ToUnicode + save 자동 xref regen) 로 교체. F4 fixture 자체 (`crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`) regenerate + 3 신규 invariant test. + +#### Sub-action D1 — `tests/fixtures/_synth/mojibake.py` pikepdf rewrite + +- **Files affected**: `tests/fixtures/_synth/mojibake.py` (전체 rewrite — 기존 byte-edit 패턴 폐기). +- **Action** (spec §5.4 의 body 그대로): + - Step 1: reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함). + - Step 2: pikepdf 로 open + 모든 dictionary 의 `/ToUnicode` entry 제거 + `pdf.save(allow_overwriting_input=True)` (xref 자동 regen). + - Step 3: invariant 검증 — `len(pdf.pages) == 1` + `b"/ToUnicode" not in dst.read_bytes()`. + - 실패 시 비-zero exit code + stderr message (Step 2 의 removed count = 0 → exit 2; Step 3 의 page count mismatch → exit 3; ToUnicode 잔존 → exit 4). +- **Dep install** (executor 의 pre-action): + ```bash + pip install --cache-dir /build/cache/pip pikepdf reportlab + python -c "import pikepdf; import reportlab; print(pikepdf.__version__, reportlab.Version)" + # font availability probe (critic round 1 L-3) — mojibake.py 의 hardcode path. + test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf \ + || sudo apt-get install -y fonts-dejavu-core + ``` + CI 환경 미반영 — fixture 자체를 commit 하므로 generation 은 1회성 (Step 4 D2 의 executor local). `tasks/HOTFIXES.md` 에 pikepdf install hint 만 1줄 추가 가능. +- **Acceptance**: + - `grep -c "import pikepdf" tests/fixtures/_synth/mojibake.py` = **1**. + - `grep -c "re.sub" tests/fixtures/_synth/mojibake.py` = **0** (byte-edit 패턴 폐기 확인). + - `test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf` exit 0 (font probe, critic round 1 L-3 fast failover signal). + - `python tests/fixtures/_synth/mojibake.py /tmp/mojibake_dryrun.pdf && echo OK` exit 0 + stderr 무. + +#### Sub-action D2 — F4 fixture binary regenerate + snapshot regen + commit + +- **Files affected** (verifier round 1 H-4 + critic round 1 M-2 의 actual probe `grep -rn 'fixtures/mojibake.pdf' crates/` 결과 2 consumer enumerate): + - `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` (regenerate). + - `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` (**snapshot baseline file 자체** — delete + auto-regen). verifier round 1 H-4 의 actual probe `text_extractor_regression.rs:59-64` 의 hand-rolled `unwrap_or_else { write baseline }` 패턴. + - `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` (existing test — 코드 자체 변경 0, snapshot regen path 만 trigger). + - `crates/kebab-parse-pdf/src/text_quality.rs:96` (verifier round 1 H-4 의 2번째 consumer — `let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");` 의 unit test/doctest 가 fixture binary 변경 시 동시 verify, 코드 변경 0). +- **Action**: + - **(a) regenerate command**: + ```bash + python tests/fixtures/_synth/mojibake.py \ + crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf + ``` + - **(b) regenerate 후 manual probe**: + ```bash + python -c "import pikepdf; pdf = pikepdf.open('crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf'); print(len(pdf.pages))" + # expected: 1 + grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf + # expected: 0 (binary grep — Pages dict 안 ToUnicode 부재) + ``` + - **(c) snapshot baseline regen** (verifier round 1 H-4 + critic round 1 M-2 의 actual mechanic — OQ-2 closure): + - `text_extractor_regression.rs:59-64` 는 `let baseline = std::fs::read_to_string("tests/snapshots/vector_pdf_canonical.json").unwrap_or_else(|_| { std::fs::write(baseline_path, &actual).expect(...); actual.clone() })` 의 hand-rolled pattern (insta crate 사용 X). + - fixture binary 변경 → 다음 cargo test 시 `actual != baseline` → `assert_eq!` fail. + - executor 의 regen step: + ```bash + rm crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json + cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4 + # 1st run: snapshot file 부재 → unwrap_or_else write 패턴이 새 baseline 작성 → assert pass. + cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4 + # 2nd run: 새 baseline 와 byte-identical → assert pass (regression invariant 확립). + ``` + - OQ-2 closure — insta crate 미사용, cargo-insta CLI 불요. spec §5.6 의 "기존 `text_extractor_regression.rs` 의 F4 baseline 갱신" 의 actual mechanic 명문화. +- **Acceptance**: + - `stat crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` size > 0. + - `grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` = **0**. + - python 의 page count probe = `1`. + - `test -f crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` exit 0 (snapshot regen 후). + - `cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4` 2회 연속 green (regen + verify, H-4 mechanic). + - `cargo test -p kebab-parse-pdf -j 4` 전체 green — D3 의 신규 test + text_quality.rs:96 의 2번째 consumer 도 동시 verify. + +#### Sub-action D3 — parse-pdf 의 3 신규 invariant test + +- **Files affected** (verifier round 1 H-3 의 actual probe 결과 결정): + - actual `ls crates/kebab-parse-pdf/tests/` 결과: `common`, `extractor.rs`, `fixtures`, `ocr_e2e.rs`, `page_image.rs`, `snapshots`, `text_extractor_regression.rs` — plan round 0 의 primary candidate `text_extractor.rs` 는 **존재 안 함**. + - **결정**: `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` append (F4 fixture consumer locality + D2 snapshot regen mechanic 와 same file). 3 신규 `#[test] fn` append. + - 대안 (executor 가 file size / cohesion 고려해 split 결정 시): 신규 `crates/kebab-parse-pdf/tests/mojibake_invariants.rs`. plan first preference = append to `text_extractor_regression.rs`. +- **Action** (spec §5.5 의 3 test body 의 path 정정 — verifier round 1 H-3): + 1. `mojibake_fixture_load_yields_one_page` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` (integration test 는 이미 `crates/kebab-parse-pdf/tests/` root, `text_extractor_regression.rs:42` 의 canonical pattern 따름; spec §5.5 의 `"../tests/fixtures/mojibake.pdf"` 가 잘못 — `"fixtures/mojibake.pdf"` 직접). `lopdf::Document::load_mem(bytes).unwrap().get_pages().len() == 1`. + 2. `mojibake_fixture_has_no_tounicode_cmap` — CWD-relative `std::fs::read("tests/fixtures/mojibake.pdf")` 위험 회피 (cargo test 의 CARGO_MANIFEST_DIR ≠ CWD 환경 가능): `let bytes = include_bytes!("fixtures/mojibake.pdf");` 사용. `bytes.windows(b"/ToUnicode".len()).filter(|w| *w == b"/ToUnicode").count() == 0`. + 3. `pdf_text_extractor_on_mojibake_yields_one_block` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` + PdfTextExtractor 의 `1 Block::Paragraph per page` invariant 검증, `canonical.blocks.len() == 1`, `scanned candidate` warning 또는 non-empty text. ExtractContext setup 의 actual body 는 executor 가 `text_extractor_regression.rs` 의 existing helper (있을 시) 또는 spec §5.5 의 placeholder 의 expansion. +- **Acceptance**: + - `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` green. + - `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` green. + - `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` green. + - `cargo test -p kebab-parse-pdf -j 4` 전체 green. + +#### Commit (Step 4 전체) + +``` +fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4) + +- tests/fixtures/_synth/mojibake.py: full rewrite — replace byte-level + re.sub + manual startxref edit with pikepdf open+del+save (auto xref + regen). Type 0 font + ToUnicode strip via dictionary walk. +- crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf: regenerate. +- crates/kebab-parse-pdf/tests/text_extractor_regression.rs: append 3 + invariant tests (lopdf 1-page / no ToUnicode marker / PdfTextExtractor + 1-block) — verifier round 1 H-3 의 file path decision (same locality + with snapshot regen). +- crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json: + delete + auto-regen via 2-run cargo test (hand-rolled unwrap_or_else + pattern, verifier round 1 H-4). +- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §5 +``` + +verifier round 1 NIT-2 정정 — commit scope `test-fixture` → `parse-pdf` (crate name, conventional commit typical scope). + +--- + +### Step 5 (Group E): Workspace verify + commit + PR #189 force-push + +spec §6 의 16-row consolidated acceptance 를 본 step 의 verifier checklist. 모든 acceptance command 가 scriptable. + +#### Sub-action E1 — cargo workspace test `-j 1` + +- **Files affected**: 변경 0 (verification only). +- **Action**: + - **(a) conditional cargo clean** — memory `feedback_cargo_clean_policy` (verifier round 1 NIT-1 정정: TB vs GB unit 혼동 회피 위해 `-BG` 으로 GB unit 강제): + ```bash + # /build avail 을 GB 단위 정수로 직접 가져옴 (df -BG output 의 'G' suffix 만 strip). + AVAIL_GB=$(df -BG --output=avail /build | tail -1 | tr -d ' G') + # CARGO_TARGET_DIR 의 size 도 GB 단위 정수로 (du -BG output). + TARGET_GB=$(du -BG -s "${CARGO_TARGET_DIR:-target}" 2>/dev/null | awk '{print $1}' | tr -d 'G') + # /build avail < 500 GB OR target > 200 GB → clean + if [[ "${AVAIL_GB:-9999}" -lt 500 ]] || [[ "${TARGET_GB:-0}" -gt 200 ]]; then + cargo clean + fi + ``` + 임계 미달 시 skip + commit body / result file 안 1줄 record (예: "skipped cargo clean — /build avail ${AVAIL_GB}G, target ${TARGET_GB}G"). + - **(b) workspace test**: + ```bash + cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -100 + ``` + - tail 100 line + final summary "test result: ok. N passed; 0 failed" 확인. +- **Acceptance**: + - exit code 0. + - stdout 의 "test result: ok" + "0 failed". + - spec §6 row 14 (workspace full test pass) 충족. + +#### Sub-action E2 — `cargo clippy --workspace -- -D warnings` + +- **Files affected**: 변경 0. +- **Action**: + ```bash + cargo clippy --workspace --all-targets -j 1 -- -D warnings 2>&1 | tail -50 + ``` +- **Acceptance**: + - exit code 0. + - "warning" 키워드 0 (or `-D warnings` 가 자동 error 화). + - spec §6 row 13 (workspace clippy clean) 충족. + +#### Sub-action E3 — dogfood re-run (Ollama qwen2.5vl:3b 환경) + +- **Files affected**: 변경 0. `/build/cache/tmp/v0.20-dogfood/` (isolated KB, 동일 dogfood 재사용). +- **Action** (memory `feedback_pr_workflow` + `_external/` invariant 따름): + - **(a) release build**: + ```bash + cargo build --release -p kebab-cli -j 4 2>&1 | tail -10 + "${CARGO_TARGET_DIR:-target}/release/kebab" --version + # expected: kebab 0.20.0 + ``` + - **(b) dogfood KB clean + re-ingest 9 PDF** (spec §1.1 의 dogfood 환경 동일). + + canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (§0 가정). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml` 은 **존재 안 함** — critic round 1 H-1 의 actual probe 결과. 따라서 config 의 **자체 backup 후 clean + restore** path 사용 (destructive `rm -rf` 시 config 동시 삭제 방지): + + ```bash + # Step A: config 의 임시 backup (KB clean 전 보존). + cp /build/cache/tmp/v0.20-dogfood/config.toml \ + /build/cache/tmp/v0.20-dogfood-config.toml.bak + + # Step B: KB 전체 clean (config 포함 destructive — backup 으로 보존됨). + rm -rf /build/cache/tmp/v0.20-dogfood/ + mkdir -p /build/cache/tmp/v0.20-dogfood/ + + # Step C: backup 에서 config restore. + cp /build/cache/tmp/v0.20-dogfood-config.toml.bak \ + /build/cache/tmp/v0.20-dogfood/config.toml + + # config.toml 안 [ingest.pdf].enabled = true, ollama endpoint = + # http://192.168.0.47:11434, ocr_model = qwen2.5vl:3b + + # Step D: ingest. + "$RELEASE_BIN" ingest --config /build/cache/tmp/v0.20-dogfood/config.toml \ + --json --force-reingest 2>&1 | tee /build/cache/tmp/v0.20-dogfood-ingest.ndjson + + # Step E (optional): backup file cleanup. 다음 dogfood iteration 의 redundant + # backup 누적 방지. config 자체는 v0.20-dogfood/config.toml 가 in-place + # canonical, .bak 은 transient. + rm /build/cache/tmp/v0.20-dogfood-config.toml.bak + ``` + + (대안 selective-delete 의 single-step path — config 보존 + 그 외 destructive): + ```bash + find /build/cache/tmp/v0.20-dogfood/ -mindepth 1 -not -name 'config.toml' \ + -exec rm -rf {} + + ``` + 실 procedure 는 plain 5-step 의 명료성 우선 (executor 의 default). + - **(c) acceptance**: + - spec §6 row 3: 9 PDF 의 `skipped_size_exceeded == 0` for non-code (= 모두 0 — workspace 가 code 0). + - spec §6 row 8: F1 + F2 의 `kind != "Error"` (chunk_id collision 부재). + - spec §6 row 12: mojibake.pdf 의 ingest item `block_count: 1`. + - spec §6 row 15: 9 PDF 모두 ingest, `errors = 2` (encrypted only — pre-existing dogfood baseline 동일). + - **Ollama 미가용 시 fallback**: endpoint 가 unreachable 면 본 sub-action 의 partial skip 가능 — workspace test (E1) + clippy (E2) 의 unit/integration 수준 evidence 로 spec §6 row 1, 2, 4-7, 9-11, 13-14, 16 충족 + dogfood row 3, 8, 12, 15 skip 1줄 record (commit body + result file). +- **Acceptance**: + - ingest report 의 ndjson 안 errors = 2 (encrypted only). + - F1/F2/mojibake 각각의 item line `kind` field 가 success path (= `"new"` 또는 `"unchanged"`, not `"Error"`). + - dogfood log path: `/build/cache/tmp/v0.20-dogfood-ingest.ndjson` (commit body 안 reference). + +#### Sub-action E4 — commit 점검 + 최종 organize + +- **Files affected**: 모든 step 의 누적 changes. +- **Action**: + - `git status` + `git log --oneline b4d9e60..HEAD` — Step 1-4 의 4 commit + Step 5 의 verify-only commit 0 (verification 만, commit 없음). + - 만약 work-in-progress 잔존 file 있으면 reset. + - commit message 의 `Co-Authored-By:` line 점검 (CLAUDE.md gitea-pr workflow). +- **Acceptance**: + - `git log --oneline b4d9e60..HEAD | wc -l` = **4** (Step 1-4 의 각 1 commit). + - `git status` 의 untracked + modified = 0. + +#### Sub-action E5 — PR #189 force-push + +- **Files affected**: remote ref `gitea/feat/pdf-scanned-ocr`. +- **Action** (gitea-ops skill 의 직접 호출 가능): + ```bash + git push gitea feat/pdf-scanned-ocr --force-with-lease + ``` + - `--force-with-lease` — local 의 fetch state 와 remote HEAD 가 match 시에만 force-push (다른 collaborator 의 push 보호; 본 single-user 환경 cheap safety). + - PR #189 의 body 갱신 — Bug #2/#3/#4 fix summary + dogfood evidence 추가 (gitea API `PATCH /repos/altair823-org/kebab/pulls/189`). + - 사용자 memory `feedback_pr_workflow` 따라 `gitea-pr-review` skill 의 review 루프 진입 (multi-round critic + verifier). +- **Acceptance**: + - `gh-equivalent` (gitea-ops `gitea-pr-status 189`) 의 head SHA = local `git rev-parse HEAD`. + - PR #189 의 commit count = 이전 force-push 시점 의 commit count + 4. + - sequencing summary 의 5-commit table (§7) 와 final state 일치. + +#### Commit (Step 5) + +verification only — git commit 0. Step 1-4 의 4 commit 가 final tree. + +--- + +## §4 Verifier checklist (spec §6 16-row 1:1 mapping) + +각 row 가 scriptable command. step 5 E1-E3 의 누적 실행으로 모두 가능. + +| # | Verifier | Bug | step | 명령 | +|---|---------|-----|------|------| +| 1 | walker bypasses size cap for PDF | #2 | A3 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` | +| 2 | walker still skips oversized code files | #2 | A3 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes -j 4` | +| 3 | 256KB+ PDF/markdown ingest default config | #2 | E3 | dogfood: `$RELEASE_BIN ingest ...` 의 ingest report 의 `skipped_size_exceeded = 0` for non-code | +| 4 | chunker collision regression test | #3 | B5 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` | +| 5 | chunker determinism preserved | #3 | B5 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` | +| 6 | chunker overlap clamp preserved | #3 | B5 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` | +| 7 | integration: multi-scanned PDF ingest (conditional, §4.5.1) | #3 | C2 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` (Option A/B downgrade path 시 skip + record) | +| 8 | dogfood: F1 + F2 force-reingest errors=0 | #3 | E3 | dogfood: `$RELEASE_BIN ingest --force-reingest ...` 의 errors = 0 (encrypted 제외) | +| 9 | F4 fixture lopdf 1-page invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` | +| 10 | F4 fixture ToUnicode 부재 invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` | +| 11 | F4 PdfTextExtractor 1-block invariant | #4 | D3 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` | +| 12 | dogfood: F4 ingest block_count=1 | #4 | E3 | dogfood: mojibake.pdf 의 ingest item `block_count: 1` | +| 13 | workspace clippy clean | all | E2 | `cargo clippy --workspace --all-targets -j 1 -- -D warnings` | +| 14 | workspace full test pass | all | E1 | `cargo test --workspace --no-fail-fast -j 1` | +| 15 | dogfood end-to-end 9 PDF | all | E3 | dogfood: 9 PDF 모두 ingest, errors = 2 (encrypted only) | +| 16 | chunker_version cascade final value | #3 | B3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` | + +executor 의 final step (E1-E3) 에서 16 row 모두 scriptable 실행 + result file 안 row-by-row pass/fail/skip 기록. + +#### Workspace baseline expected test count delta (verifier round 1 M-1 closure) + +`cargo test --workspace -j 1` (Step 5 E1) 의 expected `test result: ok. N passed` 의 delta 산수 — pre-fix baseline 대비: + +| Step | Sub-action | new test name | crate | type | +|---|---|---|---|---| +| 1 | A3 | `size_cap_skips_only_code_files` | kebab-source-fs | unit (in `mod tests`) | +| 2 | B5 | `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` | kebab-chunk | unit (in `mod tests`) | +| 3 | C2 | `multi_scanned_pdf_ingest_no_chunk_id_collision` | kebab-app | integration (new test binary) | +| 4 | D3 | `mojibake_fixture_load_yields_one_page` | kebab-parse-pdf | unit-style integration | +| 4 | D3 | `mojibake_fixture_has_no_tounicode_cmap` | kebab-parse-pdf | unit-style integration | +| 4 | D3 | `pdf_text_extractor_on_mojibake_yields_one_block` | kebab-parse-pdf | unit-style integration | + +- **Option A (full path, C2 active)**: total = **+6 unit/integration test cases + 1 new integration test binary**. +- **Option B (C2 conditional downgrade per §4.5.1)**: total = **+5 test cases + 0 new binary**. + +기타: B3 의 `chunker_version_is_pdf_page_v1` 기존 test 의 assertion content 변경 없음 (VERSION_LABEL const 인용) → test count delta 0. D2 의 `vector_pdf_extract_byte_identical_to_baseline` 기존 test 의 assertion 결과만 변경 (fixture 변경 → baseline 변경) → test count delta 0, snapshot regen action 만 추가. + +executor 가 E1 acceptance 의 N 비교 시 본 delta 산수 와 일치 확인 (regression 시 detection). + +--- + +## §5 Risks (plan 단계) + +- **R-1 (MockOcrEngine sharing complexity)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) 가 기존 `pdf_ocr_apply.rs:20-45` 의 9 test 의 ctor migration 필요 — backward-compat ctor 2 개 (single + per_page) 도입 시 trivial, 실패 시 Option B (inline) downgrade. spec §6 row 7 conditional skip 가능. +- **R-2 (chunker_version bump cascade scope)**: `pdf-page-v1.1` 의 영향 = multi-chunk PDF page 의 chunk_id 변경. parser_version / embedding_version / prompt_template_version / index_version unchanged — `kebab-eval::eval_runs.config_snapshot_json` 의 5-version snapshot 의 chunker_version field 만 새 값. parent design §9 의 cascade rule invariant 보존, eval baseline 의 re-run 권장 (spec §7.1 Risk 1 의 user-facing note). +- **R-3 (F4 fixture binary churn)**: pikepdf 의 save output 가 reportlab+byte-edit 와 다른 PDF object ordering → SHA256 변경 + git binary diff noise. `text_extractor_regression.rs` baseline 도 새 fixture 의 actual output 으로 same-commit update — Step 4 D2 안 동시 처리. +- **R-4 (dogfood Ollama 의존)**: spec §6 row 3 + 8 + 12 + 15 dogfood acceptance 가 real `192.168.0.47:11434` qwen2.5vl:3b 호출. endpoint 미가용 시 unit/integration evidence (row 1-2, 4-7, 9-11, 13-14, 16) 로 partial closure + commit body / result file 안 skip record. +- **R-5 (pikepdf dependency install)**: Step 4 D1 의 mojibake.py 의 `import pikepdf` — 본 머신의 Python venv 에 pip install 필요. CI 의존성 미발생 (fixture commit 후 1회성 generation). +- **R-6 (parent plan 와의 동시 진행 충돌 0 확인)**: parent plan (`2026-05-27-pdf-scanned-ocr-plan.md` round 1c ACCEPT) 의 Step 11 (final verify + PR open) 가 이미 commit `b4d9e60` 으로 closed. 본 plan 의 fix commits 가 그 commit 위에 stack — branch ordering 충돌 0. + +--- + +## §6 Open questions deferred to executor + +- **OQ-1 (MockOcrEngine sharing path)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) vs Option B (inline) 결정. executor 의 Step 3 C1 안 first action — probe `grep -rn "impl OcrEngine"` 후 결정 + result file 안 record. plan first preference = Option A. +- **OQ-2 (F4 baseline snapshot update tool)**: ✅ **CLOSED (round 1c, critic M-2 + verifier H-4)** — `text_extractor_regression.rs:59-64` 의 actual pattern = hand-rolled `unwrap_or_else { write baseline }` (insta crate 사용 X). regen procedure = snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify). cargo-insta CLI 불요. detail = §3 Step 4 D2 의 Action (c). +- **OQ-3 (pikepdf install command)**: `pip install` 의 cache-dir + venv 결정 — global `--user` pip 또는 fixture generation 전용 venv 또는 conda environment. plan 의 default = `pip install --cache-dir /build/cache/pip pikepdf reportlab` (memory `feedback_disk_layout`). +- **OQ-4 (dogfood config.toml 의 endpoint 변경 시점)**: 본 dogfood 환경의 `192.168.0.47:11434` Ollama endpoint 가 변경되면 executor 가 alternative endpoint (`localhost:11434` 등) 로 override + result file 안 record. +- **OQ-5 (PR #189 review 루프의 round 수)**: memory `feedback_pr_workflow` 의 gitea-pr + 리뷰 루프 — round 1 critic + verifier 의 결과에 따라 round 2/2c 진입 가능. 본 plan 은 round 0 (drafter) — review round 의 outcome 은 plan 외 scope. + +--- + +## §7 Sequencing summary (logical commit boundaries) + +| commit # | step range | logical scope | file count | +|---:|---|---|---:| +| 1 | Step 1 (A1+A2+A3) | `fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)` | 2 | +| 2 | Step 2 (B1+B2+B3+B4+B5) | `fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)` | 4 (pdf_page_v1.rs + HOTFIXES.md + pdf_pipeline.rs:168 + :368, verifier H-1) | +| 3 | Step 3 (C1+C2) | `test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)` | **4 (Option A: existing common/mod.rs append + new common/mock_ocr.rs + modify pdf_ocr_apply.rs + new multi_scanned_pdf_ingest_no_chunk_id_collision.rs, verifier H-2)** / 1 (Option B) | +| 4 | Step 4 (D1+D2+D3) | `fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)` | **5 (mojibake.py + fixtures/mojibake.pdf + snapshots/vector_pdf_canonical.json + text_extractor_regression.rs (D3 append) + src/text_quality.rs:96 consumer verify, verifier H-4 + H-3 + NIT-2)** | +| 5 | Step 5 (E1-E5) | verification only — git commit 0; final state = commits 1-4 위 PR #189 force-push | 0 | + +총 4 commit + 1 verify-only step. force-push 후 PR #189 의 head = local HEAD. + +--- + +## §8 Round 1c rewrite changelog (drafter trace) + +round 1 critic + verifier 의 합산 21 finding (critic 7 + verifier 14) 적용. detail 은 result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 §1 traceability matrix 참조. 본 §8 은 plan body 의 substantive change summary. + +### Critic r1 (7 finding) + +| ID | Severity | Action | Plan section | +|---|---|---|---| +| critic H-1 | HIGH | E3 dogfood config 의 backup 후 clean + restore 5-step procedure (외부 backup file 부재 reality 반영) | §3 Step 5 E3 (b) | +| critic M-1 | MEDIUM | line 15 "17 sub-action" → "18 sub-action" | §0 prelude line | +| critic M-2 | MEDIUM | D2 snapshot baseline 갱신 mechanic 명문 (hand-rolled `unwrap_or_else` pattern, OQ-2 closure) | §3 Step 4 D2 + §6 OQ-2 | +| critic L-1 | LOW | B1 line range "200-289" → "200-204 (doc) + 205-289 (body)" 명시 | §3 Step 2 B1 | +| critic L-2 | LOW | MockOcrEngine ctor count "9 test (existing)" → "10 instantiation site" (actual probe) | §3 Step 3 C1 + C2 | +| critic L-3 | LOW | D1 pre-action 에 DejaVuSans.ttf existence probe 1줄 추가 | §3 Step 4 D1 | +| critic NIT-1 | NIT | "5 logical commit" → "4 commit + 1 verify-only step (= 5 step total, 4 commit boundary)" | §0 prelude line | + +### Verifier r1 (14 finding) + +| ID | Severity | Action | Plan section | +|---|---|---|---| +| verifier H-1 | HIGH | B3 sub-action 에 `pdf_pipeline.rs:168` (hard assertion) + `:368` (error message) literal 갱신 명시 + acceptance grep regex 정밀화 (`grep -v 'pdf-page-v1\.1'`) | §3 Step 2 B3 | +| verifier H-2 | HIGH | Step 3 Option A 의 `common/mod.rs` 가 existing infrastructure 반영 — `pub mod mock_ocr;` 1줄 append + 신규 `common/mock_ocr.rs` + `pdf_ocr_apply.rs` lift + 신규 integration test = 4 file edit | §3 Step 3 C1 + C2 + §7 commit 3 file count | +| verifier H-3 | HIGH | D3 file path `text_extractor.rs` 부재 정정 → `text_extractor_regression.rs` append (locality with D2 snapshot regen). `include_bytes!` path 도 `../tests/fixtures/...` → `fixtures/...` 직접 + CWD-relative `std::fs::read` 회피 | §3 Step 4 D3 | +| verifier H-4 | HIGH | D2 snapshot regen mechanic — snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify) + `src/text_quality.rs:96` 2번째 consumer enumerate | §3 Step 4 D2 + §7 commit 4 file count | +| verifier M-1 | MEDIUM | §4 verifier checklist 뒤에 expected workspace test count delta 산수 표 추가 (+6 unit + 1 integration, Option A / +5 + 0, Option B) | §4 (sub-section) | +| verifier M-2 | MEDIUM | B2 acceptance phrasing 갱신 — "Step 2 commit time" 명시 + sub-action 별 grep 시점 명문 | §3 Step 2 B2 acceptance | +| verifier M-3 | MEDIUM | C2 Option A 의 "기존 10 ctor site mechanical migration" 명령 명시 | §3 Step 3 C1 (b) | +| verifier L-1 | LOW | pdf_page_v1.rs line range 200-289 → 205-289 (critic L-1 와 same edit pass) | §3 Step 2 B1 | +| verifier L-2 | LOW | caller line range 155-185 → 149-186 | §3 Step 2 B2 | +| verifier L-3 | LOW | B5 test scenario comment 의 target=1500 byte + overlap=240 byte 산수 보강 | §3 Step 2 B5 | +| verifier L-4 | LOW | `ingest_pdf_ocr_smoke.rs` 의 grep B3 scope safety 확인 (별도 action 0, finding 자체 = no action) | (verified safe) | +| verifier NIT-1 | NIT | E1 의 `df -h` unit 처리 산수 정밀화 → `df -BG --output=avail` 으로 GB unit 강제 | §3 Step 5 E1 (a) | +| verifier NIT-2 | NIT | Step 4 commit scope `test-fixture` → `parse-pdf` (crate name) | §3 Step 4 commit | +| verifier NIT-3 | NIT | dogfood config canonical path 의 single-definition (in §0) + 모든 acceptance command 참조 | §0 pre-flight + §3 Step 5 E3 | + +### Summary + +- frontmatter `status` `draft (round 0)` → `draft (round 1c)`. +- frontmatter `review_history` 에 round 1 critic + verifier + round 1c rewrite 항목 3 줄 add. +- plan body line 15 의 prelude statement 2 token 정정 (sub-action count + commit boundary 표현). +- §0 pre-flight 에 dogfood KB layout 가정 1 bullet add. +- §3 5 step 의 sub-action body 의 detail 보강 (file path / acceptance grep / mechanic / migration cost). +- §4 verifier checklist 의 expected test count delta sub-section add. +- §6 OQ-2 closure 표시 (✅ CLOSED). +- §7 sequencing summary 의 file count 갱신 (commit 2: 2→4, commit 3: 3-4→4, commit 4: 3-4→5). +- §8 round 1c rewrite changelog (본 단락) populate. + +총 plan body line 변경 = ~+250 net add (round 0 698 line → round 1c ~950 line). diff --git a/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md new file mode 100644 index 0000000..a9ddc31 --- /dev/null +++ b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md @@ -0,0 +1,388 @@ +--- +title: "v0.20.0 sub-item 1 bugfix round 2 — plan" +created: 2026-05-27 +status: "DRAFT round 0" +spec_path: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md +spec_status: ACCEPT (round 1c, 308 line) +critic_round_1: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md +critic_round_2: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md +step_count: 4 (3 commit + 1 sanity-only) +commit_count: 3 +branch: feat/pdf-scanned-ocr +head_at_draft: e674ff4 +--- + +# v0.20.0 sub-item 1 bugfix round 2 — plan + +## §0 Overview + +Spec ACCEPT (round 1c, 9/9 critic finding 반영) 의 implementation map. fix scope = 2 bug: + +- **Bug #6** (Critical): `?Identity-H Unimplemented?` mojibake marker bypass — `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()` 의 marker strip + dominance heuristic. +- **Bug #7** (Minor doc): `kebab search --help` 의 `--media` value list 에서 `code` 누락 — clap doc-comment + SKILL.md 동기 갱신 + CLI help regression test. + +총 변경 file 5 + 신규 test file 1 + HOTFIXES entry. branch = `feat/pdf-scanned-ocr` (HEAD = `e674ff4`, round 1 의 4 commit 적층 위에 round 2 의 3 commit append). env = `CARGO_TARGET_DIR=/build/out/cargo-target/target`. fresh release binary = `/build/out/cargo-target/target/release/kebab`. + +**non-scope (spec §2.2 + 본 plan §5 OQ-4)**: spec ACCEPT 의 surface 외에 `crates/kebab-mcp/src/tools/search.rs:44`, `crates/kebab-core/src/search.rs:32+52`, `crates/kebab-app/src/ingest_progress.rs:69`, `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` 가 같은 stale value list (`markdown, pdf, image, audio, other` — `code` 누락) 을 보유. spec 의 frozen grep boundary (`integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) 외이므로 본 round 의 commit 대상 X — follow-up issue 권장. + +--- + +## §1 Step table + +| Step | Title | Scope summary | Commit subject | Files touched | +|------|-------|----------------|----------------|----------------| +| 1 | Bug #6 implementation | `MOJIBAKE_MARKERS` const + `compute_valid_char_ratio()` rewrite + 2 new unit test + HOTFIXES entry | `fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)` | `crates/kebab-parse-pdf/src/text_quality.rs`, `tasks/HOTFIXES.md` | +| 2 | Bug #7 doc-comment + SKILL.md | clap doc-comment 의 `--media` value list 에 `code` 추가 + SKILL.md line 57 동기 | `docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)` | `crates/kebab-cli/src/main.rs`, `integrations/claude-code/kebab/SKILL.md` | +| 3 | Bug #7 CLI help assertion | 신규 test file `crates/kebab-cli/tests/cli_help_smoke.rs` 의 `search_help_lists_code_in_media_values` test | `test(cli): assert 'code' in search --help output (Bug #7 regression pin)` | `crates/kebab-cli/tests/cli_help_smoke.rs` (신규) | +| 4 | Final sanity (no commit) | workspace test + workspace clippy + optional dogfood retest | — | none | + +--- + +## §2 Per-step detail + +### Step 1 — Bug #6 implementation + +#### §2.1 Files affected + +- `crates/kebab-parse-pdf/src/text_quality.rs` (현재 103 line — line 1-37 body, line 39-103 tests). +- `tasks/HOTFIXES.md` (dated entry append). + +#### §2.2 Action + +**§2.2.1** — `text_quality.rs` line 1-18 (file header comment + `compute_valid_char_ratio` body) **rewrite** per spec §4.1 의 diff. 추가: + +- 새 const `MOJIBAKE_MARKERS: &[&str] = &["?Identity-H Unimplemented?"]` (line 8-12 위치, lopdf 0.32.0 source 추적 comment 포함). +- `compute_valid_char_ratio()` body 의 4-단계 marker strip → trim-empty zero → dominance cap-0.3 → 기존 ratio 계산. +- `is_valid_text_char()` (line 20-37) **변경 없음** (signature + range list 보존). + +**§2.2.2** — `text_quality.rs::tests` module (line 39-103) 에 2 신규 test **append**: + +```rust +#[test] +fn identity_h_marker_dominance_caps_ratio_below_threshold() { + let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20)); + let r = compute_valid_char_ratio(&s); + assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}"); +} + +#[test] +fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() { + let header = "x".repeat(200); + let s = format!("{header} ?Identity-H Unimplemented?"); + let r = compute_valid_char_ratio(&s); + assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}"); +} +``` + +**중요 — 스펙 §4.2 wording 보정 (critic r2 NEW-1)**: spec §4.2 의 "Replace existing Bug #6 test set with two new tests" 는 stale wording. 현 `text_quality.rs::tests` 는 8 test 보유, **Identity-H marker 관련 test 0**. 즉 net change = **+2 / -0**. brief §2.1 의 "기존 test `identity_h_marker_mixed_with_some_real_text_low_ratio` 제거" 도 동일 stale — 해당 test 미존재. executor 는 8 existing test (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) 모두 **보존**. + +**§2.2.3** — `tasks/HOTFIXES.md` 의 latest dated section 위에 entry append: + +```markdown +## 2026-05-27 — Identity-H mojibake marker bypassed OCR fallback (Bug #6) + +- **Symptom**: `metro-korea.pdf` (Identity-H CID font without ToUnicode CMap) 의 ingest 가 `pdf_ocr_pages=0` 으로 종료. text 전체가 `?Identity-H Unimplemented?` marker 1154회 반복 (lopdf 0.32.0 emit). text-detect ratio = 1.0 → OCR fallback threshold 0.5 bypass. +- **Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()` 의 `is_valid_text_char()` 가 ASCII printable range (0x0020..=0x007E) 를 unconditional valid 처리. marker (28 ASCII char) 는 valid 로 count. +- **Fix**: `MOJIBAKE_MARKERS` const 도입 + marker strip after-strip 의 trim-empty → 0.0 + dominance heuristic (strip > 잔여 일 때 cap 0.3). spec ACCEPT: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` §4.1. parser_version/wire schema 영향 0. +- **User action**: 이미 `metro-korea.pdf` class 의 mojibake-heavy PDF 를 v0.20.0 pre-bugfix2 binary 로 indexed 한 경우, `kebab ingest --force-reingest ` 로 cached skip 무효화 필요 (release notes 동등 안내). +``` + +#### §2.3 Acceptance + +actionable verify command (per-step): + +```bash +# A) text_quality 신규 test 2 + 기존 8 = 10 모두 green +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 | tail -10 + +# B) parse-pdf crate clean compile +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-parse-pdf -j 4 2>&1 | tail -3 + +# C) parse-pdf clippy clean (-D warnings) +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-parse-pdf --all-targets -j 4 -- -D warnings 2>&1 | tail -5 +``` + +기대: A 의 tail = `test result: ok. 10 passed; 0 failed`, B = `Finished`, C = warning 0. + +#### §2.4 Commit + +```bash +git add crates/kebab-parse-pdf/src/text_quality.rs tasks/HOTFIXES.md +git commit -m "$(cat <<'EOF' +fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6) + +Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의 +ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit +`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char() +의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5 +threshold bypass. + +Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계 +(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker +list extensible. is_valid_text_char() 본체 변경 0. + +Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version +/ wire schema 변경 0. + +Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md +§4.1 / §4.2 / §6 R-1. +EOF +)" +``` + +--- + +### Step 2 — Bug #7 doc-comment + SKILL.md + +#### §2.5 Files affected + +- `crates/kebab-cli/src/main.rs` line 158-160 (실측: SearchArgs `media` 의 3-line clap doc-comment). +- `integrations/claude-code/kebab/SKILL.md` line 57. + +#### §2.6 Action + +**§2.6.1** — `crates/kebab-cli/src/main.rs` line 158-160 의 doc-comment edit: + +```diff + /// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. +- /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, +- /// `image`, `audio`, `other`. Unknown values match nothing. ++ /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, ++ /// `image`, `audio`, `code`, `other`. Unknown values match nothing. +``` + +(critic r2 NEW-2 보정: spec §4.3 의 1-line 표기 vs 실제 3-line clap doc-comment 차이. 실제 file 의 multi-line 분포 그대로 유지하며 line 160 의 `image`, `audio` 사이에 `code` 삽입.) + +**§2.6.2** — `integrations/claude-code/kebab/SKILL.md` line 57 의 edit: + +```diff +-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`) ++`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`) +``` + +#### §2.7 Acceptance + +```bash +# A) cli crate clean compile (doc-comment edit — compile 영향 0 기대) +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli -j 4 2>&1 | tail -3 + +# B) SKILL.md 의 `code` substring grep +grep -nF '"code"' integrations/claude-code/kebab/SKILL.md + +# C) fresh binary 의 search --help 가 `code` 노출 +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli -j 4 2>&1 | tail -3 +/build/out/cargo-target/target/release/kebab search --help 2>&1 | grep -F 'code' +``` + +기대: A = `Finished`, B = line 57 1건 hit, C = `code` 포함 1+ line. + +#### §2.8 Commit + +```bash +git add crates/kebab-cli/src/main.rs integrations/claude-code/kebab/SKILL.md +git commit -m "$(cat <<'EOF' +docs(cli): list 'code' in --media help string + SKILL.md (Bug #7) + +Why: kebab search --media code 가 v0.18.0 부터 functional support 됨 +(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code +존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의 +value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라 +오해 가능. + +Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap +doc-comment + integrations/claude-code/kebab/SKILL.md line 57. +Rust binary surface / wire schema 변경 0. + +Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44, +crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/ +ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35 +도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary +밖이므로 본 round 미포함. + +Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md +§4.3 / §4.3a. +EOF +)" +``` + +--- + +### Step 3 — Bug #7 CLI help assertion test + +#### §2.9 Files affected + +- `crates/kebab-cli/tests/cli_help_smoke.rs` (신규 — 기존 file list 에 미존재). + +#### §2.10 Action + +신규 file 생성. 기존 test convention (`cli_*` prefix, `Command::new(env!("CARGO_BIN_EXE_kebab"))` pattern — 참고: `cli_readonly_quiet.rs`, `cli_schema.rs`) 답습: + +```rust +// crates/kebab-cli/tests/cli_help_smoke.rs +// +// Regression pin — `kebab search --help` 의 `--media` value list 가 +// `code` 를 노출. Bug #7 (v0.20.0 bugfix round 2 spec §4.4). + +#[test] +fn search_help_lists_code_in_media_values() { + let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab")) + .args(["search", "--help"]) + .output() + .expect("kebab search --help"); + let stdout = String::from_utf8_lossy(&out.stdout); + assert!( + stdout.contains("`code`"), + "search --help must list 'code' as accepted --media value; stdout = {stdout}" + ); +} +``` + +#### §2.11 Acceptance + +```bash +# A) 신규 test target 빌드 + 실행 +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 | tail -10 + +# B) cli crate tests target clean compile (전체) +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli --tests -j 4 2>&1 | tail -3 + +# C) cli clippy clean (-D warnings) — 신규 test file 포함 +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-cli --all-targets -j 4 -- -D warnings 2>&1 | tail -5 +``` + +기대: A = `test result: ok. 1 passed; 0 failed`, B = `Finished`, C = warning 0. + +#### §2.12 Commit + +```bash +git add crates/kebab-cli/tests/cli_help_smoke.rs +git commit -m "$(cat <<'EOF' +test(cli): assert 'code' in search --help output (Bug #7 regression pin) + +Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬 +하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의 +--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가 +유일한 검출 수단. + +Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습). +CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring +assert. spec §4.4 의 acceptance row 1:1 mapping. + +Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4 +/ §5 (acceptance row 4). +EOF +)" +``` + +--- + +### Step 4 — Final sanity (no commit) + +#### §2.13 Scope + +3 commit append 후 workspace 전수 verify + optional dogfood. **commit 발생 X** (코드 변경 0 — verification only). + +#### §2.14 Acceptance + +```bash +# A) workspace test 전수 — 기존 1316 + 본 round +2 unit + +1 cli = 1319 expected +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test --workspace --no-fail-fast -j 1 2>&1 | tee /tmp/v0.20-bugfix2-test.log | tail -15 +echo "exit = ${PIPESTATUS[0]:-$?}" + +# B) workspace clippy clean (-D warnings) +CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 | tail -8 + +# C) (optional) dogfood retest — metro-korea.pdf +# binary 의 fresh build 가 이미 Step 2 acceptance 에서 완료. +# --force-reingest 후 pdf_ocr_pages 가 0 → 21+ 변화 관찰. +# OCR latency ≈ 10 min cost — plan drafter 가 executor 에게 optional 명시. +# 실측 corpus 가 user-private (KEBAB_WORKSPACE 또는 ~/Documents/test/) 이면 skip 가능. +``` + +기대: A = `test result: ok. passed; 0 failed` (N ≥ 1319), B = warning 0, C = 사용자 선택 (verifier round 0 에서 평가). + +#### §2.15 Commit + +없음 (sanity-only). executor 가 sanity green 확인 후 PR push 단계로 진행. + +--- + +## §3 Verifier checklist (cumulative) + +spec §5 의 7 row acceptance criteria 와 1:1 mapping. verifier round 0 의 actionable command: + +| # | Spec §5 criterion | Verifier command | Step coverage | Pass condition | +|---|-------------------|------------------|---------------|----------------| +| 1 | `identity_h_marker_dominance_caps_ratio_below_threshold` green | `cargo test -p kebab-parse-pdf identity_h_marker_dominance_caps_ratio_below_threshold -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` | +| 2 | `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` green | `cargo test -p kebab-parse-pdf identity_h_marker_minority_with_long_valid_text_keeps_high_ratio -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` | +| 3 | 기존 text_quality 8 test green (regression 0) | `cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 \| tail -5` | Step 1 | `10 passed; 0 failed` (8 기존 + 2 신규) | +| 4 | `search_help_lists_code_in_media_values` green | `cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 \| tail -3` | Step 3 | `1 passed; 0 failed` | +| 5 | SKILL.md 의 `"code"` substring 존재 | `grep -nF '"code"' integrations/claude-code/kebab/SKILL.md` | Step 2 | line 57 1 hit | +| 6 | workspace test 전수 green | `cargo test --workspace --no-fail-fast -j 1 2>&1 \| tail -10` | Step 4 | `0 failed`, N ≥ 1319 | +| 7 | workspace clippy clean (-D warnings) | `cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 \| tail -5` | Step 4 | warning 0 | +| 8 (optional) | dogfood retest — metro-korea.pdf 의 `pdf_ocr_pages` 0 → 21+ | manual: `kebab ingest --force-reingest ` 후 ingest_report.v1 의 `items[].pdf_ocr_pages` 검사 | Step 4 | `pdf_ocr_pages > 0` for metro-korea.pdf row | + +executor 는 row 1-7 모두 green 시 PR push gate 통과. row 8 = verifier round 0 의 optional (사용자 corpus 가용성 + 10 min cost 평가). + +--- + +## §4 Risks resolution (spec §6 의 plan-level) + +| ID | Spec §6 status | Plan-level action | +|----|----------------|--------------------| +| R-1 | resolved per critic r1 (lopdf 0.32.0 = marker 1 entry) | 본 plan §2.2.1 의 source comment 가 lopdf upgrade 시 re-verify trigger. | +| R-2 | resolved (`trim().is_empty()` cover) | Step 1 implementation 의 §2.2.1 4-단계 중 2-단계 = trim-empty zero. | +| R-3 | resolved (wire schema 변경 0) | parser_version `"pdf-text-v1"` / chunker_version `"pdf-page-v1.1"` 보존. version cascade 영향 0 (CLAUDE.md §Versioning cascade). | +| R-4 | resolved per critic r1 (grep boundary = `integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) | Step 2 가 spec 범위 내 2 surface 모두 커버. **추가 발견 (out of scope)** → §5 OQ-4. | +| R-5 | resolved (`bulk.rs:161` alias normalize 통해 영향 0) | 본 plan 동작 변경 0. | + +추가 risk — 본 plan drafter 가 식별: + +- **R-6 (NEW)**: Step 4 의 optional dogfood retest 가 `KEBAB_WORKSPACE` 또는 user-private corpus 의존. CI 환경에서 verify 불가 — verifier round 0 가 evidence 부재 시 row 8 skip 명시 권고. + +--- + +## §5 Open questions for executor + +spec ACCEPT 가 명확하므로 OQ-1/2/3 모두 resolved. 본 plan drafter 가 추가 식별: + +- **OQ-4 (NEW)**: spec §2.2 의 R-4 grep 결과, frozen boundary 외부 surface 가 동일 stale list 보유: + - `crates/kebab-mcp/src/tools/search.rs:44` — MCP tool 의 `--media` doc. + - `crates/kebab-core/src/search.rs:32` — `MEDIA_KINDS` const = `&["markdown", "pdf", "image", "audio", "other"]`. 주의: 이 const 가 functional 일 수 있음 — `code` 는 v0.18.0 부터 separate path 로 first-class 처리 (`schema.v1.media_breakdown.code` 존재 확인 per spec §1.2). const 자체 수정은 behavior change risk 동반 → 별도 spec 으로 분리. + - `crates/kebab-core/src/search.rs:52` — `MediaFilter::media` doc-comment. + - `crates/kebab-app/src/ingest_progress.rs:69` — progress label doc-comment. + - `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` — test fixture array (functional, 변경 시 test 의미 영향). + + **executor action**: 본 round 미포함. PR description 또는 Step 2 commit body 에 "follow-up: open issue for stale --media value list in 5 additional surfaces" 한 줄 명시 권장. + +- **OQ-5 (NEW)**: spec §6 의 UX consequence — pre-bugfix2 v0.20.0 user 의 `--force-reingest` 권고가 release notes 에 들어가야 하며, 별도 phase (PR review/merge 시점) 의 작업. 본 plan 의 Step 1 §2.2.3 HOTFIXES entry 가 user-facing surface 의 일부 — release notes 가 HOTFIXES 의 user action 항목을 인용 가능. + +--- + +## §6 References + +- **Spec ACCEPT (parent contract)**: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` (308 line, round 1c). +- **Critic round 1**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md` (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit, 9 finding 모두 spec 에 반영). +- **Critic round 2**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md` (NEW-1 = §4.2 stale arithmetic, NEW-2 = §4.3 scope description drift — 본 plan §2.2.2 + §2.6.1 에 정정 반영). +- **Plan drafter brief**: `.omc/reviews/2026-05-27-v0.20-bugfix2-plan-drafter-brief.md`. +- **Parent design**: `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md` §1.3 (text-detect threshold metric), §9 (version cascade). +- **Round 1 history**: branch `feat/pdf-scanned-ocr` HEAD = `e674ff4`, 4 commit 적층 (Bug #2 source-fs, Bug #3 chunk_id collision, Bug #3 test, Bug #4 pikepdf F4 fixture). +- **Code locations (line 실측)**: + - `crates/kebab-parse-pdf/src/text_quality.rs:1-103` (전체 file). + - `crates/kebab-cli/src/main.rs:158-160` (SearchArgs `media` clap doc-comment, 3-line multi-line attribute). + - `integrations/claude-code/kebab/SKILL.md:57` (search input filter 설명). + - `crates/kebab-cli/tests/cli_help_smoke.rs` (신규, Step 3). +- **External source**: `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?`). + +--- + +## §7 Constraints (spec §9 + brief §9) + +1. **branch 변경 0** — plan 자체는 documentation only. 본 file = plan deliverable. +2. **spec ACCEPT frozen** — round 1c body 보수 X. 본 plan 의 §2.2 / §2.6 의 wording 정정 (`Replace existing` → `+2 / -0 additive`) 은 plan 의 local note 로 명문, spec 본문 미변경. +3. **regression 0** — workspace test N ≥ 1319. +4. **wire schema / version cascade 변경 0** — `parser_version="pdf-text-v1"`, `chunker_version="pdf-page-v1.1"` 보존. +5. **subagent skip** — executor 가 in-session 단일 thread 실행 (worker protocol per task assignment). +6. **lightweight scope** — 본 plan 의 line target = 200-400 (round 1 plan = 849 line 의 1/3 미만). + +**Status**: DRAFT round 0 — verifier review 대기. diff --git a/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md new file mode 100644 index 0000000..8e9ca60 --- /dev/null +++ b/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md @@ -0,0 +1,1043 @@ +--- +title: "v0.20.0 sub-item 1 bugfix round 3 — plan" +created: 2026-05-27 +status: DRAFT +round: 0 +spec_path: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md +parent_spec: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +brief: .omc/reviews/2026-05-27-v0.20-bugfix3-plan-drafter-brief.md +branch: feat/pdf-scanned-ocr +base_head: f763049 +step_count: 7 +commit_count: 6 +estimated_minutes: 60 +--- + +# v0.20.0 sub-item 1 bugfix round 3 — plan + +Spec ACCEPT (`2026-05-27-v0.20-sub1-bugfix3-spec.md`, 410 line, 11/11 critic finding 반영) 의 step-level decomposition. 5 bug (#9 / #10 / #11 / #13 / #14) + HOTFIXES + parent spec cross-link 까지 한 round 에서 처리. + +## §0 Overview + +### §0.1 Scope + +| Bug | Severity | Surface | Type | +|-----|----------|---------|------| +| #9 | critical | `schema.v1.capabilities` | wire field correction (false → true) | +| #10 | medium | `error.v1` | additive error code `config_not_found` | +| #11 | critical UX | `config.pdf.ocr.request_timeout_secs` default | numeric default change (600 → 60) | +| #13 | medium | `schema.v1.models` | additive array fields (backward compat) | +| #14 | minor UX | `kebab search` / `kebab ask` input validation | additive error code path (`invalid_input`) | +| — | — | `tasks/HOTFIXES.md` + parent spec | docs handoff for Bug #11 deviation | + +### §0.2 Strategy + +- **Per-bug commit boundary** (option A): 한 commit 당 한 bug 만 — revert / bisect 가 정확. Step 6 만 doc-only. +- **wire schema = additive minor**. Bug #13 `models` 의 신규 두 field 는 optional. 기존 client 영향 0. spec §3.4 와 일치. +- **parent spec frozen**: text 변경 0. inline HTML 주석 cross-link 만. HOTFIXES.md 가 live source of truth. +- **subagent skip**: in-session direct execution. spec §7 의 worker protocol 준수. +- **regression budget**: 기존 workspace test 1350 + 본 round 새 +7 ≥ 1357 test, 모두 green. + +### §0.3 Environment + +```bash +cd /home/altair823/kebab +export CARGO_TARGET_DIR=/build/out/cargo-target/target +git status # working tree clean expected +git rev-parse HEAD # f763049 +``` + +`-j 4` default (workspace memory budget). `-j 1` 은 OOM fallback only — full workspace integration run 일 때만. + +--- + +## §1 Step table + +| Step | Subject | Files | New tests | Commit | +|------|---------|-------|-----------|--------| +| 1 | Bug #9 capabilities flip | `crates/kebab-app/src/schema.rs` | 2 unit | `fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9)` | +| 2 | Bug #10 config_not_found error | `crates/kebab-config/src/lib.rs`, `crates/kebab-app/src/error_signal.rs`, `crates/kebab-app/src/error_wire.rs`, `crates/kebab-app/src/lib.rs` (re-export) | 1 unit + 2 integration | `fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)` | +| 3 | Bug #11 OCR timeout 60s | `crates/kebab-config/src/lib.rs` | 1 unit | `fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11)` | +| 4 | Bug #13 active_parsers + active_chunkers (additive) | `crates/kebab-store-sqlite/src/store.rs` (또는 lib.rs), `crates/kebab-app/src/schema.rs`, `docs/wire-schema/v1/schema.schema.json`, `integrations/claude-code/kebab/SKILL.md` | 2 integration | `feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13)` | +| 5 | Bug #14 empty query invalid_input | `crates/kebab-cli/src/main.rs` | 2 integration | `fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14)` | +| 6 | HOTFIXES + parent spec cross-link | `tasks/HOTFIXES.md`, `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` | 0 | `docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation` | +| 7 | Final sanity (no commit) | — | — | n/a | + +**Total**: 7 step, 6 commit, 8 new test (3 unit + 5 integration). Reaches spec §5 AC-1 ~ AC-10. + +--- + +## §2 Per-step detail + +### Step 1 — Bug #9 capabilities flip + +#### §2.1.1 Files affected + +- `crates/kebab-app/src/schema.rs:137-151` (`capabilities_snapshot()`). +- `crates/kebab-app/src/schema.rs` (`#[cfg(test)] mod` 또는 새 `mod tests_capabilities` — 기존 `mod tests_stats_ext` 와 동등 위치). + +#### §2.1.2 Action + +`capabilities_snapshot()` body 의 두 줄만 변경: + +```diff + fn capabilities_snapshot() -> Capabilities { + Capabilities { + json_mode: true, + ingest_progress: true, + ingest_cancellation: true, + rag_multi_turn: true, + search_cache: true, + incremental_ingest: true, +- streaming_ask: false, ++ streaming_ask: true, + http_daemon: false, + mcp_server: true, +- single_file_ingest: false, ++ single_file_ingest: true, + bulk_search: true, + } + } +``` + +`http_daemon: false` 는 보존 — 별도 sub-item 의 non-impl. spec §3.1 의 결정과 일치. + +#### §2.1.3 New tests + +`crates/kebab-app/src/schema.rs` 의 `#[cfg(test)]` 영역에 unit test 2 개 추가: + +```rust +#[test] +fn capabilities_streaming_ask_matches_cli_surface() { + // Bug #9: kebab ask --stream 가 answer_event.v1 ndjson 정상 emit (191 event 검증) → + // capabilities.streaming_ask 가 true 여야 함. + let caps = super::capabilities_snapshot(); + assert!(caps.streaming_ask, "streaming_ask must be true (Bug #9)"); +} + +#[test] +fn capabilities_single_file_ingest_matches_cli_surface() { + // Bug #9: kebab ingest-file + kebab ingest-stdin --title 양쪽 모두 + // ingest_report.v1 정상 emit → capabilities.single_file_ingest 가 true 여야 함. + let caps = super::capabilities_snapshot(); + assert!(caps.single_file_ingest, "single_file_ingest must be true (Bug #9)"); +} +``` + +추가로 `capabilities_snapshot` 가 `pub(crate)` 또는 module-internal 일 경우 `super::` path 로 접근. private 라면 같은 module 의 child mod 에서 호출 가능. + +#### §2.1.4 Per-step acceptance + +```bash +cargo test -p kebab-app capabilities_streaming_ask_matches_cli_surface -j 4 +cargo test -p kebab-app capabilities_single_file_ingest_matches_cli_surface -j 4 +cargo test -p kebab-app schema -j 4 # 기존 schema_report.rs integration 도 green 유지 +cargo clippy -p kebab-app --all-targets -- -D warnings +``` + +기존 `mod tests_stats_ext` 의 `stats_includes_breakdowns_and_bytes_on_fresh_corpus` (schema_with_config 경유) 가 `streaming_ask`/`single_file_ingest` 를 assert 안 함 — regression 없음. + +#### §2.1.5 Commit + +```bash +git add crates/kebab-app/src/schema.rs +git commit -m "$(cat <<'EOF' +fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9) + +capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로 +보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade: +- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit +- kebab ingest-file / kebab ingest-stdin --title → ingest_report.v1 정상 + +MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시 +false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인. + +http_daemon 은 false 유지 (별도 sub-item 의 non-impl). + +EOF +)" +``` + +--- + +### Step 2 — Bug #10 ConfigNotFound + classify arm + +#### §2.2.1 Files affected + +1. `crates/kebab-config/src/lib.rs` + - 19-22 line 의 `ConfigInvalid` 정의 옆에 `ConfigNotFound` 추가. + - 688-722 line 의 `Config::load` 안 `Some(_) => Self::defaults(),` arm 을 `Some(_) => Err(...)` 로 변경. +2. `crates/kebab-app/src/error_signal.rs` + - `pub use kebab_config::ConfigNotFound;` 추가 (기존 `ConfigInvalid` 와 동등 pattern). +3. `crates/kebab-app/src/error_wire.rs` + - `classify` 안 `ConfigInvalid` arm 다음에 `ConfigNotFound` arm 추가. +4. `crates/kebab-app/src/lib.rs` + - `pub use kebab_config::ConfigInvalid;` 옆에 `pub use kebab_config::ConfigNotFound;` 추가 (기존 14 line pattern). + +#### §2.2.2 Action + +**(a) `crates/kebab-config/src/lib.rs` — error type 추가** (line ~25, `ConfigInvalid` 직후): + +```rust +/// p20-bugfix3 Bug #10: explicit `--config ` 가 missing 시 silent +/// fallback to defaults 대신 fail-fast. `kebab-app::error_wire::classify` +/// 가 downcast → `code: "config_not_found"` ErrorV1. +#[derive(Debug, thiserror::Error)] +#[error("config file does not exist: {path}")] +pub struct ConfigNotFound { + pub path: PathBuf, +} +``` + +**(b) `crates/kebab-config/src/lib.rs:688-722` — `Config::load` 분기 수정**: + +```diff + pub fn load(path: Option<&Path>) -> anyhow::Result { + let from_disk = match path { + Some(p) if p.exists() => Self::from_file(p)?, +- Some(_) => Self::defaults(), ++ Some(p) => { ++ // Bug #10: explicit --config 가 missing → silent default fallback 금지. ++ return Err(anyhow::Error::new(ConfigNotFound { ++ path: p.to_path_buf(), ++ })); ++ } + None => { + let p = Self::xdg_config_path(); + ... +``` + +상대경로 cover: `Path::exists()` 는 cwd-relative — spec §6 R-1 해결 (별도 작업 0). + +**(c) `crates/kebab-app/src/error_signal.rs` — re-export**: + +```rust +pub use kebab_config::ConfigNotFound; +``` + +(기존 `ConfigInvalid` re-export 와 동등 위치. 같은 file 안에서 `use kebab_config::ConfigInvalid;` 이미 있다면 그 옆.) + +**(d) `crates/kebab-app/src/error_wire.rs::classify`** — `ConfigInvalid` arm 직후: + +```rust +if let Some(s) = err.downcast_ref::() { + return ErrorV1 { + schema_version: ERROR_V1_ID.to_string(), + code: "config_not_found".to_string(), + message: s.to_string(), + details: json!({ + "path": s.path.to_string_lossy(), + }), + hint: Some( + "verify --config ; pass an existing toml file or omit --config to use XDG default" + .to_string(), + ), + }; +} +``` + +상단 `use crate::error_signal::{ConfigInvalid, LlmError, NotIndexed};` 에 `ConfigNotFound` 추가. + +**(e) `crates/kebab-app/src/lib.rs:14` — public re-export**: + +```rust +pub use kebab_config::{ConfigInvalid, ConfigNotFound}; +``` + +#### §2.2.3 New tests + +**(a) `crates/kebab-config/src/lib.rs` — unit test (기존 `tests` mod 안)**: + +```rust +#[test] +fn config_load_explicit_nonexistent_path_returns_config_not_found() { + // Bug #10: --config /tmp/nonexistent.toml → silent fallback 금지. + let p = std::path::Path::new("/tmp/__kebab_bugfix3_nonexistent.toml"); + assert!(!p.exists(), "test precondition: path must not exist"); + + let err = Config::load(Some(p)).expect_err("expected ConfigNotFound"); + let signal = err + .downcast_ref::() + .expect("from_load error should downcast to ConfigNotFound"); + assert_eq!(signal.path, p.to_path_buf()); +} +``` + +**(b) `crates/kebab-cli/tests/cli_error_wire.rs` 또는 신규 `crates/kebab-cli/tests/cli_config_not_found.rs` — integration test 2 개**: + +```rust +use std::process::Command; +use serde_json::Value; + +fn kebab_bin() -> String { + env!("CARGO_BIN_EXE_kebab").to_string() +} + +#[test] +fn invalid_config_path_emits_error_v1_with_nonzero_exit() { + let absent = "/tmp/__kebab_bugfix3_absolute_nonexistent.toml"; + assert!(!std::path::Path::new(absent).exists()); + + let out = Command::new(kebab_bin()) + .args(["search", "rust", "--config", absent, "--json"]) + .output() + .expect("spawn kebab"); + + assert_ne!(out.status.code(), Some(0), "exit must be nonzero on missing --config"); + let stderr = String::from_utf8_lossy(&out.stderr); + let last_line = stderr.lines().last().expect("error.v1 line on stderr"); + let v: Value = serde_json::from_str(last_line) + .unwrap_or_else(|e| panic!("expected error.v1 ndjson on stderr: {e}\nstderr={stderr}")); + assert_eq!(v["schema_version"], "error.v1"); + assert_eq!(v["code"], "config_not_found"); + assert!(v["hint"].is_string(), "hint must be present"); +} + +#[test] +fn invalid_relative_config_path_emits_config_not_found() { + // Bug #10 spec §6 R-1: relative path 도 cwd-relative 로 cover. + let tmp = tempfile::tempdir().unwrap(); + let out = Command::new(kebab_bin()) + .args(["search", "rust", "--config", "nonexistent-rel.toml", "--json"]) + .current_dir(tmp.path()) + .output() + .expect("spawn kebab"); + + assert_ne!(out.status.code(), Some(0)); + let stderr = String::from_utf8_lossy(&out.stderr); + let last_line = stderr.lines().last().expect("error.v1 line"); + let v: Value = serde_json::from_str(last_line).expect("ndjson"); + assert_eq!(v["code"], "config_not_found"); +} +``` + +기존 `cli_error_wire.rs` 의 ConfigInvalid integration test 패턴을 참고 (existing test 그대로 green 유지 — fail-fast 가 `ConfigInvalid` (file 존재 + parse 실패) 와 별개 path). + +#### §2.2.4 Per-step acceptance + +```bash +cargo test -p kebab-config config_load_explicit_nonexistent_path -j 4 +cargo test -p kebab-cli invalid_config_path_emits_error_v1_with_nonzero_exit -j 4 +cargo test -p kebab-cli invalid_relative_config_path_emits_config_not_found -j 4 +cargo test -p kebab-config -j 4 # 기존 18 test 전수 green +cargo test -p kebab-app error_wire -j 4 # 기존 classify test 전수 green (ConfigInvalid 등) +cargo clippy -p kebab-config -p kebab-app -p kebab-cli --all-targets -- -D warnings +``` + +`Config::load` 의 `None → XDG default` path 는 변경 0 — `kebab doctor` (config 없는 fresh clone) regression 없음. + +#### §2.2.5 Commit + +```bash +git add crates/kebab-config/src/lib.rs \ + crates/kebab-app/src/error_signal.rs \ + crates/kebab-app/src/error_wire.rs \ + crates/kebab-app/src/lib.rs \ + crates/kebab-cli/tests/ +git commit -m "$(cat <<'EOF' +fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10) + +이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 + +`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만 +surface — debugging nightmare. + +이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의 +`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path }) +return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found, +hint, details.path 채워서 stderr 에 ndjson 으로 emit. + +R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이 +absolute + relative 모두 cover. integration test 두 개로 검증. + +EOF +)" +``` + +--- + +### Step 3 — Bug #11 OCR timeout 60s + +#### §2.3.1 Files affected + +- `crates/kebab-config/src/lib.rs:477` (`default_pdf_ocr_request_timeout_secs`). +- `crates/kebab-config/src/lib.rs` `#[cfg(test)] mod tests` (또는 적합한 위치) — 신규 unit test 추가. + +#### §2.3.2 Action + +```diff +-fn default_pdf_ocr_request_timeout_secs() -> u64 { 600 } ++/// PDF OCR per-page request timeout 의 기본값. ++/// 6-32s 가 정상 throughput; 60s 초과는 Ollama 다운 / 매우 dense·고해상도 page 의 신호. ++/// `config.toml` 의 `[pdf.ocr] request_timeout_secs = N` 로 override. ++/// ++/// HOTFIXES 2026-05-27 (Bug #11): metro-korea.pdf dogfood 에서 page 8/13 모두 ++/// 기존 600s default 까지 완전 timeout (`chars: 0, skipped: true` × 20분 cost) → ++/// 60s 로 하향. parent spec §1000 / §1628 OQ-1 (CPU 환경 105s 의 5x 여유) 가 ++/// 가정한 "page 당 평균 105s" 보다 실측 cloud GPU Ollama 가 6-32s 로 훨씬 빠름. ++fn default_pdf_ocr_request_timeout_secs() -> u64 { 60 } +``` + +기존 470 line 의 `request_timeout_secs: default_pdf_ocr_request_timeout_secs(),` 는 동일 함수 호출이라 추가 변경 0. + +#### §2.3.3 New tests + +`crates/kebab-config/src/lib.rs` 의 `#[cfg(test)] mod tests`: + +```rust +#[test] +fn pdf_ocr_request_timeout_default_is_60s() { + // Bug #11 (dogfood 2026-05-27): default 600s → 60s. + let cfg = PdfOcrCfg::defaults(); + assert_eq!( + cfg.request_timeout_secs, 60, + "pdf.ocr.request_timeout_secs default must be 60s (Bug #11, HOTFIXES 2026-05-27)" + ); +} +``` + +기존 unit test 중 600 magic number 를 검증하는 항목이 있다면 동일 commit 안에서 60 으로 갱신. (verify: `grep -rn "request_timeout_secs.*600\|600.*request_timeout_secs" crates/kebab-config/src/` — 발견 시 그 test 만 expect 값 갱신, 새 unit test 와 같은 의미라면 기존 test 만 갱신하고 신규 test 생략 가능. 본 plan 은 spec ACCEPT 의 보수적 선택: 신규 test 도 추가해 unique name 으로 보존.) + +#### §2.3.4 Per-step acceptance + +```bash +cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4 +cargo test -p kebab-config -j 4 # 18 test 전수 green; 만일 기존 test 가 600 expect 면 같은 commit 에서 갱신 +cargo clippy -p kebab-config --all-targets -- -D warnings +``` + +`PdfOcrCfg::defaults()` 의 다른 field 는 변경 0 — `max_pixels` (2048), `valid_ratio_threshold` (0.5), `min_char_count` (20), `lang_hint` (`"kor"`) 보존. + +#### §2.3.5 Commit + +```bash +git add crates/kebab-config/src/lib.rs +git commit -m "$(cat <<'EOF' +fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11) + +metro-korea.pdf v0.20 final-dogfood (2026-05-27): +- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout + (`ms: 600000, chars: 0, skipped: true`) +- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비 + +cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다 +훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는 +config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수 +있음 — Step 6 에서 HOTFIXES + parent spec cross-link. + +EOF +)" +``` + +--- + +### Step 4 — Bug #13 active_parsers + active_chunkers (additive minor) + +#### §2.4.1 Files affected + +1. `crates/kebab-store-sqlite/src/store.rs` (또는 lib.rs — `impl SqliteStore` 의 다른 fetch_* method 와 같은 file). +2. `crates/kebab-app/src/schema.rs` (`Models` struct 정의 위치 — `kebab-app` 안 `pub struct Models` 검색해 동일 file 안에 추가). +3. `docs/wire-schema/v1/schema.schema.json` — `models.properties` 에 두 array 추가. +4. `integrations/claude-code/kebab/SKILL.md` — `models` description 갱신. +5. `crates/kebab-app/tests/schema_report.rs` (또는 신규 file) — integration test 2개. + +#### §2.4.2 Action + +**(a) `crates/kebab-store-sqlite/src/store.rs` 의 `impl SqliteStore`** — 신규 method 2개: + +```rust +/// p20-bugfix3 Bug #13: schema.v1.models.active_parsers 의 source. +/// `documents.parser_version` 컬럼의 DISTINCT 값을 정렬해 반환. +/// 빈 corpus → 빈 Vec. +pub fn fetch_distinct_parser_versions(&self) -> anyhow::Result> { + let conn = self.conn()?; + let mut stmt = conn.prepare( + "SELECT DISTINCT parser_version FROM documents + WHERE parser_version IS NOT NULL AND parser_version != '' + ORDER BY parser_version", + )?; + let rows = stmt.query_map([], |row| row.get::<_, String>(0))?; + let mut out = Vec::new(); + for r in rows { + out.push(r?); + } + Ok(out) +} + +/// p20-bugfix3 Bug #13: schema.v1.models.active_chunkers 의 source. +/// `chunks.chunker_version` 컬럼의 DISTINCT 값을 정렬해 반환. +pub fn fetch_distinct_chunker_versions(&self) -> anyhow::Result> { + let conn = self.conn()?; + let mut stmt = conn.prepare( + "SELECT DISTINCT chunker_version FROM chunks + WHERE chunker_version IS NOT NULL AND chunker_version != '' + ORDER BY chunker_version", + )?; + let rows = stmt.query_map([], |row| row.get::<_, String>(0))?; + let mut out = Vec::new(); + for r in rows { + out.push(r?); + } + Ok(out) +} +``` + +note: `self.conn()` 가 `SqliteStore` 의 기존 connection accessor 가 아니면 같은 file 안 기존 method 의 connection 획득 pattern 을 그대로 사용 (`code_lang_breakdown`, `repo_breakdown`, `corpus_revision` 가 참조 모델). + +**(b) `crates/kebab-app/src/schema.rs` — `Models` struct 확장**: + +```diff + pub struct Models { + pub parser_version: String, + pub chunker_version: String, ++ /// v0.20.1+ (Bug #13). Corpus 안 활성 parser version 전체. ++ /// 빈 corpus → empty Vec. backward compat: `parser_version` field 보존. ++ #[serde(default)] ++ pub active_parsers: Vec, ++ /// v0.20.1+ (Bug #13). Corpus 안 활성 chunker version 전체. ++ /// 빈 corpus → empty Vec. ++ #[serde(default)] ++ pub active_chunkers: Vec, + pub embedding_version: String, + pub prompt_template_version: String, + pub index_version: String, + pub corpus_revision: u64, + } +``` + +`#[serde(default)]` 는 v0.20.0 이전 client 가 schema.v1 deserialize 시 backward compat (없는 field → `Vec::new()`). + +**(c) `crates/kebab-app/src/schema.rs:192-207` — `collect_models` 갱신**: + +```diff + fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Models { ++ let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default(); ++ let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default(); ++ + Models { + parser_version: kebab_parse_md::PARSER_VERSION.to_string(), + chunker_version: cfg.chunking.chunker_version.clone(), ++ active_parsers, ++ active_chunkers, + embedding_version: cfg.models.embedding.model.clone(), + prompt_template_version: cfg.rag.prompt_template_version.clone(), + index_version: kebab_store_vector::INDEX_VERSION_STR.to_string(), + corpus_revision: store.corpus_revision(), + } + } +``` + +R-3 (spec §6) 해결: `collect_models` 가 매 schema 호출마다 재계산 — cache 없음, stale 위험 없음. + +**markdown PARSER_VERSION 보존**: 기존 `parser_version` field 는 `kebab_parse_md::PARSER_VERSION` (markdown default) 그대로 — backward compat. spec §3.4 의 결정과 일치. + +**(d) `docs/wire-schema/v1/schema.schema.json` — `models.properties` 갱신**: + +```diff + "models": { + "type": "object", + "required": [ + "parser_version", "chunker_version", "embedding_version", + "prompt_template_version", "index_version", "corpus_revision" + ], + "properties": { + "parser_version": { "type": "string" }, + "chunker_version": { "type": "string" }, ++ "active_parsers": { ++ "type": "array", ++ "items": { "type": "string" }, ++ "description": "v0.20.1+ (Bug #13). 활성 parser version 전체 (DISTINCT, ORDER BY). 빈 corpus → []. backward-compat: optional, 기존 client 무영향." ++ }, ++ "active_chunkers": { ++ "type": "array", ++ "items": { "type": "string" }, ++ "description": "v0.20.1+ (Bug #13). 활성 chunker version 전체 (DISTINCT, ORDER BY). 빈 corpus → []." ++ }, + "embedding_version": { "type": "string" }, + "prompt_template_version": { "type": "string" }, + "index_version": { "type": "string" }, + "corpus_revision": { "type": "integer", "minimum": 0 } + } + }, +``` + +`required` array 에는 추가하지 않음 — additive minor 의 정의. + +**(e) `integrations/claude-code/kebab/SKILL.md:155` — description 갱신**: + +```diff +-Returns `schema.v1`: `wire.schemas` (supported wire ids), `capabilities` (bool flags — e.g. `streaming_ask`, `rag_multi_turn`), `models` (version cascade 6-axis), `stats` (... ++Returns `schema.v1`: `wire.schemas` (supported wire ids), `capabilities` (bool flags — e.g. `streaming_ask`, `rag_multi_turn`), `models` (version cascade 6-axis + v0.20.1 `active_parsers` / `active_chunkers` arrays for multi-version corpora), `stats` (... +``` + +`description` frontmatter 는 generic 유지 — per-user trigger keyword 는 user 의 local copy 만. + +#### §2.4.3 New tests + +`crates/kebab-app/tests/schema_report.rs` (또는 신규 file `crates/kebab-app/tests/schema_active_versions.rs`): + +```rust +use kebab_app::schema_with_config; +use kebab_config::Config; + +#[test] +fn schema_models_active_arrays_empty_on_empty_corpus() { + let dir = tempfile::tempdir().unwrap(); + let mut cfg = Config::defaults(); + cfg.storage.data_dir = dir.path().to_string_lossy().into_owned(); + + let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap(); + store.run_migrations().unwrap(); + drop(store); + + let s = schema_with_config(&cfg).unwrap(); + assert!(s.models.active_parsers.is_empty(), "empty corpus → no parsers"); + assert!(s.models.active_chunkers.is_empty(), "empty corpus → no chunkers"); + // backward compat: 기존 단일 field 는 markdown default 보존. + assert_eq!(s.models.parser_version, kebab_parse_md::PARSER_VERSION); +} + +#[test] +fn schema_emits_active_parsers_and_chunkers_array_after_mixed_ingest() { + // markdown + (선택적) code ingest 후 active_parsers/chunkers 가 비어있지 않음. + // 본 test 는 kebab-app 의 ingest_with_config + schema_with_config 조합 — 기존 + // ingest_lexical.rs / code_ingest_smoke.rs 의 helper fixture 재활용 가능. + let dir = tempfile::tempdir().unwrap(); + let mut cfg = Config::defaults(); + cfg.storage.data_dir = dir.path().to_string_lossy().into_owned(); + cfg.workspace.root = { + let kb = dir.path().join("kb"); + std::fs::create_dir_all(&kb).unwrap(); + std::fs::write(kb.join("a.md"), "# A\nhello\n").unwrap(); + kb.to_string_lossy().into_owned() + }; + + // Minimal ingest — markdown only 면 active_parsers = ["md-frontmatter-v2"] + // (또는 PARSER_VERSION 의 string label) 1 entry. + kebab_app::ingest_with_config(&cfg, false).unwrap(); + + let s = schema_with_config(&cfg).unwrap(); + assert!(!s.models.active_parsers.is_empty(), "active_parsers populated after ingest"); + assert!(!s.models.active_chunkers.is_empty(), "active_chunkers populated after ingest"); + // ORDER BY → sorted (lex order). + let mut sorted = s.models.active_parsers.clone(); + sorted.sort(); + assert_eq!(s.models.active_parsers, sorted, "active_parsers must be sorted"); +} +``` + +note: `kebab_app::ingest_with_config` 정확한 시그니처 (`fn(cfg: &Config, summary_only: bool)` 또는 `fn(scope: SourceScope, summary_only: bool)`) 는 기존 `ingest_lexical.rs` 의 helper 와 동일 pattern 으로 — executor 가 in-tree resolution. + +#### §2.4.4 Per-step acceptance + +```bash +cargo test -p kebab-store-sqlite fetch_distinct -j 4 # 신규 store method (있으면) +cargo test -p kebab-app schema_models_active_arrays_empty_on_empty_corpus -j 4 +cargo test -p kebab-app schema_emits_active_parsers_and_chunkers_array_after_mixed_ingest -j 4 +cargo test -p kebab-app schema -j 4 # 기존 schema_report.rs 전수 green (특히 stats_includes_*) +cargo clippy -p kebab-store-sqlite -p kebab-app --all-targets -- -D warnings + +# JSON schema lint (additive minor check) +python3 -c "import json; json.load(open('docs/wire-schema/v1/schema.schema.json'))" +``` + +`stats_includes_breakdowns_and_bytes_on_fresh_corpus` 가 `s.models` 를 assert 안 함 — regression 없음. backward compat: 기존 `parser_version` / `chunker_version` 값 보존. + +#### §2.4.5 Commit + +```bash +git add crates/kebab-store-sqlite/src/ \ + crates/kebab-app/src/schema.rs \ + crates/kebab-app/tests/ \ + docs/wire-schema/v1/schema.schema.json \ + integrations/claude-code/kebab/SKILL.md +git commit -m "$(cat <<'EOF' +feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13) + +이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 → +multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest) +의 version cascade audit 누락 risk. + +이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec +추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는 +optional (#[serde(default)] + JSON schema required 미포함). + +source: +- kebab_store_sqlite::fetch_distinct_parser_versions() 가 + documents.parser_version DISTINCT + ORDER BY 반환. +- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern. +- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결). + +wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분. +integrations/claude-code/kebab/SKILL.md 동기 갱신. + +EOF +)" +``` + +--- + +### Step 5 — Bug #14 empty query (search + ask) + +#### §2.5.1 Files affected + +- `crates/kebab-cli/src/main.rs:818-826` (search arm 의 query_text 해석부). +- `crates/kebab-cli/src/main.rs:990` 부근 (ask arm 의 Config::load 직후 — query 변수가 이미 `&String` available). +- `crates/kebab-cli/tests/` — 신규 integration test 2개 (또는 `cli_error_wire.rs` 안 추가). + +#### §2.5.2 Action + +**(a) search arm** (line 821-826): + +```diff + // p9-fb-42: bulk mode requires no query; single-query mode requires query. + let query_text = match query.as_ref() { +- Some(q) => q.clone(), ++ Some(q) if q.trim().is_empty() => { ++ return Err(anyhow::Error::new(kebab_app::StructuredError( ++ kebab_app::ErrorV1 { ++ schema_version: kebab_app::ERROR_V1_ID.to_string(), ++ code: "invalid_input".to_string(), ++ message: "query is empty; provide a non-empty search term or use --bulk".into(), ++ details: serde_json::Value::Null, ++ hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()), ++ }, ++ ))); ++ } ++ Some(q) => q.clone(), + None => { + return Err(anyhow::anyhow!("query is required unless --bulk is set")); + } + }; +``` + +`--bulk` mode 우선 — 기존 line 730 의 `if *bulk { ... return Ok(()); }` 가 먼저라 empty query check 가 영향 0. + +**(b) ask arm** (line 990 의 `let cfg = ...` 직후): + +```diff + } => { + let cfg = kebab_config::Config::load(cli.config.as_deref())?; ++ if query.trim().is_empty() { ++ return Err(anyhow::Error::new(kebab_app::StructuredError( ++ kebab_app::ErrorV1 { ++ schema_version: kebab_app::ERROR_V1_ID.to_string(), ++ code: "invalid_input".to_string(), ++ message: "query is empty; provide a non-empty prompt".into(), ++ details: serde_json::Value::Null, ++ hint: Some("e.g. `kebab ask \"explain this code\"`".into()), ++ }, ++ ))); ++ } + if *stream { +``` + +`query: String` (Some 강제 — line 206) 라 `.trim()` 직접 호출 가능. + +#### §2.5.3 New tests + +`crates/kebab-cli/tests/cli_empty_query.rs` (신규) 또는 `cli_error_wire.rs` 안: + +```rust +use std::process::Command; +use serde_json::Value; + +fn kebab_bin() -> String { + env!("CARGO_BIN_EXE_kebab").to_string() +} + +fn parse_error_v1(stderr: &str) -> Value { + let last = stderr.lines().last().expect("stderr ndjson"); + serde_json::from_str(last).unwrap_or_else(|e| panic!("expected ndjson: {e}\n{stderr}")) +} + +#[test] +fn search_empty_query_emits_invalid_input() { + for q in ["", " "] { + let out = Command::new(kebab_bin()) + .args(["search", q, "--json"]) + .output() + .expect("spawn"); + assert_ne!(out.status.code(), Some(0), "empty/whitespace query must fail: {q:?}"); + let stderr = String::from_utf8_lossy(&out.stderr); + let v = parse_error_v1(&stderr); + assert_eq!(v["schema_version"], "error.v1"); + assert_eq!(v["code"], "invalid_input", "stderr={stderr}"); + } +} + +#[test] +fn ask_empty_query_emits_invalid_input() { + let out = Command::new(kebab_bin()) + .args(["ask", "", "--json"]) + .output() + .expect("spawn"); + assert_ne!(out.status.code(), Some(0)); + let stderr = String::from_utf8_lossy(&out.stderr); + let v = parse_error_v1(&stderr); + assert_eq!(v["code"], "invalid_input"); +} +``` + +#### §2.5.4 Per-step acceptance + +```bash +cargo test -p kebab-cli search_empty_query_emits_invalid_input -j 4 +cargo test -p kebab-cli ask_empty_query_emits_invalid_input -j 4 +cargo test -p kebab-cli -j 4 # 기존 cli_error_wire / cli_help_smoke / ... 전수 green +cargo clippy -p kebab-cli --all-targets -- -D warnings +``` + +`--bulk < ndjson` 의 empty stdin path 는 spec §2.2 의 별도 case — 본 fix 범위 외 (`bulk` arm 이 query 무시). + +#### §2.5.5 Commit + +```bash +git add crates/kebab-cli/src/main.rs crates/kebab-cli/tests/ +git commit -m "$(cat <<'EOF' +fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14) + +이전: `kebab search "" --json` / `kebab search " " --json` / `kebab ask "" --json` +모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user +mistake (typo, shell expansion 실수) 가 silent → debugging 비용. + +이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError +(ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존 +exit_code() 의 generic non-zero path). + +--bulk mode 는 영향 0 (bulk arm 이 query 무시). + +EOF +)" +``` + +--- + +### Step 6 — HOTFIXES + parent spec cross-link (Bug #11 deviation) + +#### §2.6.1 Files affected + +- `tasks/HOTFIXES.md` — dated entry append. +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` — inline HTML 주석 1 줄 (PDF OCR config 또는 OQ-1 의 timeout 600 언급 위치 가까이). + +#### §2.6.2 Action + +**(a) `tasks/HOTFIXES.md` — append dated subsection**: + +기존 HOTFIXES.md "How to add an entry" 섹션 직전에 5-field 형식으로 추가: + +```markdown +## 2026-05-27 — PDF OCR `request_timeout_secs` default 600s → 60s (Bug #11) + +**Discovered**: v0.20.0 final dogfood (2026-05-27), metro-korea.pdf 의 page 8 + 13. + +**Symptom**: 두 page 모두 `kebab ingest` 가 600s 까지 완전 timeout (`ms: 600000, chars: 0, skipped: true`). 본문 indexed 안 됨, page 당 20분 cost 낭비, user 가 ingest 완료 signal 못 받음. + +**Root cause**: `default_pdf_ocr_request_timeout_secs() = 600` (parent spec `2026-04-27-kebab-final-form-design.md` §1000 + §1628 OQ-1 의 "CPU 환경 105s 의 5x 여유" 가정). 실측 cloud GPU Ollama 의 per-page throughput 는 6-32s — 600s 까지 가야 timeout 이라면 Ollama 다운 상태가 사실상 확실. 600s 가 fail-fast 신호로 작동 안 함. + +**Fix** (v0.20.0 bugfix3 round 3, branch `feat/pdf-scanned-ocr`): +- `crates/kebab-config/src/lib.rs:477` `default_pdf_ocr_request_timeout_secs() = 60`. +- Doc-comment 보강 — 6-32s 정상 throughput, 60s 초과는 Ollama 다운 / 매우 dense·고해상도 page 신호. +- User override path 보존 — `config.toml [pdf.ocr] request_timeout_secs = N` 로 늘릴 수 있음 (release notes 에 명문). + +**Amends**: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §1000 / §1628 OQ-1 (parent spec frozen — text 변경 없음, inline HTML 주석 cross-link 1 줄만 추가). 본 entry 가 live source of truth. +``` + +날짜 헤더의 위치는 기존 entries 의 시간순 (가장 최근이 file 위쪽 또는 아래쪽 — 본 file 의 `## 2026-05-01 —` 이후로 이어지는 자리). executor 가 file head + 가장 최근 dated subsection 의 위치 보고 정확한 anchor 삽입. + +**(b) parent spec inline 주석** — frozen text 보존: + +`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 의 PDF OCR config block (또는 OQ-1 timeout 언급 위치, brief 의 §1000/§1628 reference) 에 다음 1 줄 HTML 주석 inline 추가: + +```markdown + +``` + +위치: executor 가 `grep -n "request_timeout_secs\|600\|pdf.*timeout\|OQ-1" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 로 가장 가까운 anchor 식별 후 그 line 의 직후 또는 같은 paragraph 끝에 inline 으로 삽입. **frozen text 의 prose 자체는 변경 0** — HTML 주석 (``) 은 markdown render 시 invisible. + +#### §2.6.3 New tests + +없음 (docs-only commit). + +#### §2.6.4 Per-step acceptance + +```bash +# parent spec 의 prose diff 가 주석 1 줄 외에는 0 인지 확인: +git diff docs/superpowers/specs/2026-04-27-kebab-final-form-design.md | grep -E "^[+-][^<]" | head -20 +# 위 결과는 모두 "+" 만 보여야 함. + +# HOTFIXES entry markdown render 검증 (link sanity): +python3 -c "import pathlib; t = pathlib.Path('tasks/HOTFIXES.md').read_text(); assert '2026-05-27 — PDF OCR' in t and 'Bug #11' in t" + +# (optional) markdownlint 가 repo 에 wired 되어 있으면 양쪽 file 에 대해 실행. +``` + +#### §2.6.5 Commit + +```bash +git add tasks/HOTFIXES.md docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +git commit -m "$(cat <<'EOF' +docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation + +Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`) +의 frozen-spec deviation handoff. + +- tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause / + Fix / Amends 5-field 포맷 (기존 entries 와 일치). +- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md: PDF OCR config block + (§1000 / §1628 OQ-1 부근) 에 inline HTML 주석 1 줄 cross-link. prose 변경 0 — + parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible. + +HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙). + +EOF +)" +``` + +--- + +### Step 7 — Final sanity (no commit) + +#### §2.7.1 Workspace-wide check + +```bash +# 전체 빌드 + clippy 한 번에: +cargo build --workspace --release -j 4 +cargo clippy --workspace --all-targets -- -D warnings + +# 전체 test (-j 1 — 18 integration-test binary 의 link OOM 방지): +cargo test --workspace --no-fail-fast -j 1 +``` + +기준: 기존 1350 test + 본 round 새 +7 test = **1357+ test, 모두 green**. fail 시 step-별로 어디서 regression 인지 isolate. + +#### §2.7.2 (Optional) Dogfood retest + +지금 fresh release binary 가 이미 bugfix2 round 까지 반영 (`/build/out/cargo-target/target/release/kebab`). bugfix3 commit 후 release rebuild + 5 bug 별 수동 smoke: + +```bash +# Bug #9: capabilities 둘 다 true. +kebab schema --json | jq '.capabilities | {streaming_ask, single_file_ingest}' + +# Bug #10: nonexistent --config → exit≠0 + error.v1 code=config_not_found. +kebab search rust --config /tmp/nope.toml --json; echo "exit=$?" + +# Bug #11: defaults 의 timeout 60. +kebab config dump 2>/dev/null | grep request_timeout_secs # (또는 init template 확인) + +# Bug #13: mixed corpus 에서 active_parsers/chunkers 둘 다 populate. +kebab schema --json | jq '.models | {active_parsers, active_chunkers}' + +# Bug #14: empty query 양쪽 모두 invalid_input. +kebab search "" --json; echo "exit=$?" +kebab ask "" --json; echo "exit=$?" +``` + +dogfood 는 optional — workspace test green + clippy clean 가 commit 의 충분 조건. dogfood 결과는 final round 의 review 단계에서 캡쳐. + +#### §2.7.3 Branch state + +```bash +git log --oneline -7 +# (예상) +#
docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation +#
fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14) +#

feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13) +#

fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11) +#

fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10) +#

fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9) +# f763049 test(cli): assert 'code' in search --help output (Bug #7 regression pin) +``` + +6 commit (Step 6 = 1 doc commit). PR 는 `feat/pdf-scanned-ocr` branch 그대로 force-update (base=main, spec §0 의 "PR #189 force-update" 참조). + +--- + +## §3 Verifier checklist + +cumulative — Step 7 까지 진행 후 verifier 가 점검. + +| # | Criterion | Command | Expected | Spec AC | +|----|----------|---------|----------|---------| +| V-1 | capabilities flag 둘 다 true (`streaming_ask` + `single_file_ingest`) | `cargo test -p kebab-app capabilities_streaming_ask_matches_cli_surface capabilities_single_file_ingest_matches_cli_surface -j 4` | green | AC-1 | +| V-2 | absolute missing `--config` → exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-cli invalid_config_path_emits_error_v1_with_nonzero_exit -j 4` | green | AC-2 | +| V-3 | relative missing `--config` → exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-cli invalid_relative_config_path_emits_config_not_found -j 4` | green | AC-7 | +| V-4 | OCR timeout default 60s | `cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4` | green | AC-3 | +| V-5 | active_parsers/chunkers populate on mixed corpus + 빈 corpus 빈 array | `cargo test -p kebab-app schema_emits_active_parsers_and_chunkers_array_after_mixed_ingest schema_models_active_arrays_empty_on_empty_corpus -j 4` | green | AC-4 | +| V-6 | empty query (search "" + " ") → invalid_input | `cargo test -p kebab-cli search_empty_query_emits_invalid_input -j 4` | green | AC-5 | +| V-7 | empty query (ask "") → invalid_input | `cargo test -p kebab-cli ask_empty_query_emits_invalid_input -j 4` | green | AC-6 | +| V-8 | workspace test 전수 green | `cargo test --workspace --no-fail-fast -j 1` | exit=0, 1357+ test green | AC-8 | +| V-9 | clippy clean | `cargo clippy --workspace --all-targets -- -D warnings` | exit=0, warn 0 | AC-7 spec layer | +| V-10 | wire schema additive minor valid (parse + required 유지) | `python3 -c "import json; s=json.load(open('docs/wire-schema/v1/schema.schema.json')); assert 'active_parsers' in s['properties']['models']['properties']; assert 'active_parsers' not in s['properties']['models']['required']"` | exit=0 | AC-9 | +| V-11 | SKILL.md 동기 갱신 (active_* 언급) | `grep -q "active_parsers\|active_chunkers" integrations/claude-code/kebab/SKILL.md` | exit=0 | AC-10 | +| V-12 | HOTFIXES entry + parent spec cross-link 존재 | `grep -q "2026-05-27 — PDF OCR" tasks/HOTFIXES.md && grep -q "HOTFIX 2026-05-27: pdf.ocr" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` | exit=0 | AC-8 spec layer | +| V-13 | parent spec prose 변경 0 (HTML 주석만) | `git diff main -- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md \| grep -E "^[+-][^<]" \| wc -l` | 0 | constraint §7.2 | +| V-14 | (optional manual) `kebab ask --stream` regression 없음 | `kebab ask --stream "what is rust" 2>&1 \| head -3` | answer_event.v1 ndjson | AC-10 manual | + +V-1 ~ V-12 자동, V-13 자동 (diff line count 0), V-14 manual / optional. + +--- + +## §4 Risks resolution (spec §6 의 plan-level) + +| Risk | Resolution | Step | Verification | +|------|-----------|------|--------------| +| R-1 — config layer 순서 (kebab-config error vs kebab-app classify) | spec §3.2 의 (a) 선택: `kebab-config` 자체 error type + `kebab-app` downcast. 기존 `ConfigInvalid` pattern 그대로 mirror (re-export 4 file: error_signal.rs, lib.rs, error_wire.rs, classify). | Step 2 | V-2 + V-3 + 기존 error_wire test (config_invalid_classifies_to_config_invalid_code) 보존. | +| R-2 — config relative path (cwd-relative) | `std::path::Path::exists()` 는 cwd-relative — 별도 작업 0. integration test 가 `tempfile::tempdir() + current_dir(...)` 로 absolute / relative 양쪽 cover. | Step 2 | V-3. | +| R-3 — active_* cache invalidation | `collect_models` 가 매 schema 호출마다 store query 직접 — cache 없음. R-3 N/A. | Step 4 | V-5 (빈 corpus + mixed 둘 다). 향후 caching 추가 시 corpus_revision invalidation 명문. | +| R-4 — corpus shrink 시 stale | 위와 동일 (every-call 재계산). | Step 4 | V-5. | +| R-5 — 60s 도 dense/고해상도 page timeout 가능 | mitigation: config.toml `[pdf.ocr] request_timeout_secs = N` override path 유지. 새 release notes 명문 (v0.20.1 minor bump 시). | Step 3 + Step 6 | V-4 (default value test). release notes 는 본 plan 범위 외 — gitea-release 시 cover. | + +추가 OQ — Step 2 의 ConfigNotFound 가 `Config::from_file` 의 read_failed (file 존재하나 IO error) 와 구분되는가? **결정**: 본 fix 는 `!p.exists()` path 만 처리. file 은 존재하나 permission denied 등 IO error 는 기존 `ConfigInvalid::read_failed` (line 729-733) path 그대로 — 두 error 가 명확히 disjoint. + +--- + +## §5 Open questions for executor + +1. **Step 2 (b)** — `ConfigNotFound` 추가 시 `crates/kebab-app/src/error_signal.rs` 의 정확한 export shape: 기존 `pub use kebab_config::ConfigInvalid;` 가 그 file 에 있는지 vs `lib.rs` 직접 인지 grep 으로 확인. 기존 pattern 그대로 mirror. +2. **Step 3** — 기존 unit test 중 `request_timeout_secs.*600` 가 있는지 `grep -rn "request_timeout_secs.*600\|600.*request_timeout_secs" crates/kebab-config/` 로 사전 확인. 발견 시 같은 commit 안에서 expect 값을 60 으로 갱신 (별도 commit 금지 — atomic). +3. **Step 4 (a)** — `SqliteStore::conn()` 의 정확한 method 이름이 다른 file 에 따라 (`get_conn`, `connection`, `with_conn` 등) 다를 수 있음. 기존 `code_lang_breakdown` / `repo_breakdown` impl 옆에 같은 pattern 으로 추가 — connection 획득 line 그대로 복붙. +4. **Step 4 (e)** — `integrations/claude-code/kebab/SKILL.md` 의 schema description 갱신 범위: brief §0.1 의 "6-axis → 8-axis 또는 '+ active arrays'" 중 후자 채택 (6-axis 라는 표현이 다른 곳에 인용될 수 있어 number 를 늘리는 대신 "+ array" 추가 형식 — backward compat 표현). +5. **Step 6 (b)** — parent spec inline 주석의 정확한 anchor: brief 의 "§1000 / §1628 OQ-1" 은 spec body 안 section number 가 아닌 line number 일 가능성. executor 가 `wc -l docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` + `grep -n "request_timeout_secs\|OQ-1\|600" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 결과로 가장 가까운 anchor 식별 후 inline insertion. **prose 변경 0** 이 핵심 invariant. +6. **Step 5 — `--bulk` precedence**: 기존 line 730 (`Some(bulk)` 분기) 이 query empty check 보다 먼저인지 확인. 본 plan 은 그렇다고 가정 (line 818 에서 cfg load 가 일어나기 전에 bulk return) — false 면 bulk path 가 empty check 의 영향을 받음. + +--- + +## §6 References + +- spec: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md` (410 line, ACCEPT 11/11) +- brief: `.omc/reviews/2026-05-27-v0.20-bugfix3-plan-drafter-brief.md` +- prior critic rounds: + - `.omc/reviews/2026-05-27-v0.20-bugfix3-spec-closure-result.md` + - `.omc/reviews/2026-05-27-v0.20-bugfix3-spec-closure-r2-result.md` +- source dogfood report: `.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md` +- code anchors: + - `crates/kebab-app/src/schema.rs:137-151` (capabilities_snapshot) + - `crates/kebab-app/src/schema.rs:192-207` (collect_models) + - `crates/kebab-config/src/lib.rs:19-22` (ConfigInvalid pattern model) + - `crates/kebab-config/src/lib.rs:477` (default_pdf_ocr_request_timeout_secs) + - `crates/kebab-config/src/lib.rs:688-722` (Config::load) + - `crates/kebab-app/src/error_wire.rs:49-104` (classify) + - `crates/kebab-cli/src/main.rs:206` (Ask query: String) + - `crates/kebab-cli/src/main.rs:718` (Cmd::Search) + - `crates/kebab-cli/src/main.rs:818-826` (search arm Config::load + query_text) + - `crates/kebab-cli/src/main.rs:977-990` (Cmd::Ask) + - `docs/wire-schema/v1/schema.schema.json:30-44` (Models object) + - `integrations/claude-code/kebab/SKILL.md:155` (schema description) + - `tasks/HOTFIXES.md` (dated entries pattern) + +--- + +## §7 Constraints (spec §7 mirror) + +1. **branch 변경 0** — 모든 commit 이 `feat/pdf-scanned-ocr` 에 올라감. base=main 의 force-update 만 (Step 7 후 PR push). +2. **spec ACCEPT (frozen contract) 변경 0** — `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md` 본 plan 안에서 read-only. +3. **regression 0** — 기존 workspace test (~1350) 전수 green + 본 round 새 +7 test 추가 (총 1357+). +4. **wire schema = additive minor** — Bug #13 의 `active_parsers` / `active_chunkers` 는 optional, JSON schema `required` 미포함, `#[serde(default)]`. v0.20.1 minor bump 로 충분. +5. **parent spec text 변경 = inline HTML 주석 1 줄만** — frozen prose 보존, HOTFIXES.md 가 live source of truth. +6. **subagent skip** — direct in-session 작성, nested worker spawn 금지 (worker protocol). +7. **commit message style**: 기존 commit log (`f763049 test(cli): ...`, `8cf73d1 docs(cli): ...`, `a58ee10 fix(parse-pdf): ...`) 의 `kind(scope): subject (Bug #N)` pattern 그대로. body 는 Why + What — 본 plan 의 commit block 그대로 사용. +8. **estimated time**: 60 min — Step 1 (5min) + Step 2 (15min) + Step 3 (5min) + Step 4 (20min) + Step 5 (10min) + Step 6 (5min) + Step 7 sanity. spec 의 30-45 min 보다 보수적. + +--- + +_End of plan._ diff --git a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md new file mode 100644 index 0000000..f54af3b --- /dev/null +++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md @@ -0,0 +1,965 @@ +--- +title: v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture +created: 2026-05-27 +status: ACCEPT (round 2 closure — Phase A complete) +target_version: 0.20.0 (PR #189 force-update) +parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md +dogfood_evidence: .omc/reviews/2026-05-27-v0.20-dogfood-report.md +review_history: + - "2026-05-27 spec round 1 critic (opus, thorough) — ACCEPT, HIGH 0 + MEDIUM 3 + LOW 2 + NIT 2" + - "2026-05-27 spec round 1c rewrite (opus, drafter) — MEDIUM/LOW/NIT all applied" + - "2026-05-27 spec round 2 closure critic (opus) — ACCEPT, 7/7 applied + 1 NIT (frontmatter status, applied here)" +--- + +# v0.20.0 sub-item 1 bugfix — chunk_id collision + walker code limit + F4 fixture + +본 spec 은 v0.20.0 sub-item 1 (scanned PDF OCR) 의 PR #189 dogfood 에서 발견된 +3 bug 의 root cause 분석 + fix design + acceptance criteria 를 명문화한다. +후속 plan + executor 단계의 source 다. + +## §1 Background + dogfood evidence chain + +### §1.1 dogfood 환경 + +| 항목 | 값 | +|------|----| +| Binary | `kebab v0.20.0` (commit `b4d9e60`) | +| Ollama endpoint | `http://192.168.0.47:11434` (qwen2.5vl:3b) | +| Isolated KB | `/build/cache/tmp/v0.20-dogfood/` | +| Corpus | 9 PDF (PoC + sub-item fixture + 3 user PDF, 466 KB ~ 58 MB) | + +### §1.2 3 bug 의 reproducibility + +| Bug | Severity | Trigger | Reproducible | +|-----|----------|---------|--------------| +| #2 walker code limit | Important | 256 KB+ PDF/image/markdown ingest | 항상 (default config) | +| #3 chunk_id collision | **Critical** | scanned_page2.pdf (1580 OCR chars) ingest | force-reingest 마다 | +| #4 F4 Pages tree | Important | mojibake.pdf (F4 fixture) ingest | 항상 | + +### §1.3 dogfood report 인용 + +dogfood report (`.omc/reviews/2026-05-27-v0.20-dogfood-report.md`) 의 핵심 인용: + +- Bug #2: `scanned=3, skipped_size_exceeded=6` — workspace 9 PDF 중 3 만 통과, + 6 PDF (F1 466KB / F2 756KB / metro 57MB / thermal-pos 1.1MB / thermal-label + 2.7MB / internals 820KB) walker 단계 skip. +- Bug #3: `"DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint + failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint + failed"` — scanned_page2.pdf chunk INSERT 단계에서 발생. +- Bug #4: `block_count: 0, chunk_count: 0` — F4 mojibake.pdf 의 ingest 결과 + 가 PdfTextExtractor 의 "1 paragraph per page" invariant 위반. + +## §2 Goals + non-goals + +### §2.1 Goals + +- 3 bug 모두 v0.20.0 안 fix (PR #189 force-update path — 새 commit 들이 같은 + branch 위에 stack). +- parent spec + (`docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md`) 의 invariant 보존: + - §1.4 PdfTextExtractor 의 "1 Block::Paragraph per page". + - §3.5 post-extract OCR enrichment 의 block_id 보존 (in-place mutate path). + - §4.6 wire schema additive 만 (V00X migration 불필요). +- parent plan + (`docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md`) round 1c ACCEPT + 의 design decisions 와 충돌 0. +- workspace test regression 0 (`cargo test --workspace -j 1`). + +### §2.2 Non-goals + +- 새 wire schema major bump (v1 → v2) — 본 fix 들은 추가 schema 변경 0. +- 새 V00X sqlite migration — `chunks` table DDL 변경 없음, fix 는 chunk_id 계산 + path 한정. +- F4 fixture 의 invariant 변경 (ToUnicode CMap 부재 + valid 1-page PDF + 요구사항 유지). +- 새 config knob 추가 (`[ingest.pdf].max_file_bytes` 같은 per-media-type limit + 은 v0.21+ scope; 본 fix 는 walker 의 code path 분리만). + +## §3 Bug #2 — walker code limit + +### §3.1 Root cause (file:line evidence) + +`crates/kebab-source-fs/src/connector.rs:42-72` — `FsSourceConnector` 가 +`Config::new` 에서 `max_file_bytes` 와 `max_file_lines` 를 +`config.ingest.code` 단일 namespace 에서 읽는다: + +```rust +Ok(Self { + default_root: root, + default_exclude: config.workspace.exclude.clone(), + copy_threshold_bytes, + skip_generated_header: config.ingest.code.skip_generated_header, + max_file_bytes: config.ingest.code.max_file_bytes, // <-- code-specific + max_file_lines: config.ingest.code.max_file_lines, // <-- code-specific +}) +``` + +`crates/kebab-source-fs/src/connector.rs:169-190` — walker 의 size check 가 +`is_oversized(...)` 호출 시 path 의 media type 무관: + +```rust +if crate::code_meta::is_oversized( + &abs_path, + self.max_file_bytes, // generic limit, applied 모든 file + self.max_file_lines, +).unwrap_or(false) { + fs_skips.skipped_size_exceeded = + fs_skips.skipped_size_exceeded.saturating_add(1); + // ... + continue; +} +``` + +`crates/kebab-source-fs/src/code_meta.rs:114-129` — `is_oversized(...)` 자체는 +generic helper (extension 무관): + +```rust +pub(crate) fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result { + let meta = std::fs::metadata(path)?; + if meta.len() > max_bytes { + return Ok(true); + } + // line cap (streaming) + ... +} +``` + +`crates/kebab-config/src/lib.rs:535-547` — `IngestCodeCfg::default()` 의 +`max_file_bytes = 262_144` (256 KB) — 대부분 PDF/image 가 이를 초과. + +### §3.2 Decision matrix + +| Option | 설명 | 장점 | 단점 | +|--------|------|------|------| +| **A — code path only** | walker 의 size check 를 code file (extension recognized by `code_lang_for_path`) 에만 적용 | 간단 / 기존 default behavior 보존 / Bug #2 즉시 해결 | PDF/image/markdown 의 size limit 0 — 1 GB PDF 도 walker 통과 | +| B — per-type config | 새 `[ingest.pdf]`, `[ingest.image]`, `[ingest.markdown]` section 추가 + per-type limit | user-tunable | 새 config field × 3 + serde default + env override + tests — v0.20 hotfix scope 초과 | +| C — generic limit + docs note | 같은 generic limit 유지하지만 의도 명문화 | code 변경 0 | UX bug 미해결 (dogfood 의 workaround config 가 production 강제) → **거부** | + +### §3.3 Chosen path — Option A + +walker 의 size cap 은 code-specific 의도. PDF/image/markdown 의 size 는 +parser 단계에서 자체 검증 (PDF 의 lopdf load_mem 은 256 KB 이상도 정상 처리, +image 의 OCR 호출도 max_pixels 로 자체 cap). v0.21+ 에서 per-type config +필요 시 Option B 로 진화. + +`is_code_file(path: &Path) -> bool` helper 추가: +- `code_meta::code_lang_for_path(path).is_some()` = code file. 기존 helper + 재사용으로 매핑 일관성 보장 (Tier 1 + Tier 2 basename + extension list + 완전 동일). + +### §3.4 Implementation (Rust diff) + +`crates/kebab-source-fs/src/code_meta.rs` — `pub(crate)` helper 추가: + +```rust +/// Returns true when `path`'s filename/extension is recognised as a code file +/// (per `code_lang_for_path`). Used by the walker to apply +/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files, +/// not to PDF/image/markdown (which have their own size controls in their +/// respective parsers). +pub(crate) fn is_code_file(path: &Path) -> bool { + code_lang_for_path(path).is_some() +} +``` + +`crates/kebab-source-fs/src/connector.rs:168-190` — walker conditional 추가: + +```rust +// p10-1A-1: apply per-file generated-header + size-cap checks on files +// that passed the override (gitignore/builtin/kebabignore) matching. +// v0.20.0 sub-item 1 bugfix: size-cap (max_file_bytes / max_file_lines) +// applies ONLY to code files. PDF/image/markdown bypass — their parsers +// have their own size controls. +if crate::code_meta::is_code_file(&abs_path) + && crate::code_meta::is_oversized( + &abs_path, + self.max_file_bytes, + self.max_file_lines, + ) + .unwrap_or(false) +{ + fs_skips.skipped_size_exceeded = + fs_skips.skipped_size_exceeded.saturating_add(1); + push_sample( + &mut fs_skips.skip_examples.size_exceeded, + &abs_path, + &root, + ); + tracing::debug!( + path = %rel_path.display(), + max_bytes = self.max_file_bytes, + max_lines = self.max_file_lines, + "skip: code file exceeds size cap" + ); + continue; +} +``` + +`skip_generated_header` 의 conditional 적용은 별개 — generated header sniff +은 path extension 무관하게 first 512 bytes 의 ASCII content 만 본다. PDF/image +의 binary 첫 512 byte 가 "do not edit" 같은 ASCII string 을 절대 포함하지 +않으므로 false positive 0. **`is_generated_file` 의 walker conditional 추가는 +불필요** — 기존 behavior 유지. + +### §3.5 Test additions + +`crates/kebab-source-fs/src/connector.rs` 의 기존 test module 에 추가: + +```rust +#[test] +fn size_cap_skips_only_code_files() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + // 300 KB PDF (binary), 300 KB markdown (text), 300 KB Rust (code). + let big_blob: Vec = vec![b'x'; 300_000]; + std::fs::write(root.join("paper.pdf"), &big_blob).unwrap(); + std::fs::write(root.join("notes.md"), &big_blob).unwrap(); + std::fs::write(root.join("big.rs"), &big_blob).unwrap(); + + let conn = FsSourceConnector::new( + &cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000), + ) + .unwrap(); + let (assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + let paths: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect(); + // PDF + Markdown pass through walker. + assert!(paths.contains(&"paper.pdf".to_string())); + assert!(paths.contains(&"notes.md".to_string())); + // Code file gets skipped. + assert!(!paths.contains(&"big.rs".to_string())); + assert!( + skips.skip_examples.size_exceeded.iter().any(|p| p.contains("big.rs")), + "size_exceeded examples should contain only big.rs: {:?}", + skips.skip_examples.size_exceeded + ); + assert!( + !skips.skip_examples.size_exceeded.iter().any(|p| p.contains("paper.pdf")), + "PDF must NOT appear in size_exceeded examples: {:?}", + skips.skip_examples.size_exceeded + ); +} +``` + +추가로 기존 test `ingest_report_counts_oversized_files_by_bytes` 의 fixture +이름이 `huge.rs` 라서 invariant 보존됨. `ingest_report_size_cap_by_line_count` +도 `longfile.rs` 라서 동일. + +## §4 Bug #3 — chunk_id collision (Critical) + +### §4.1 Root cause investigation + +#### §4.1.1 chunker 의 collision-avoidance workaround + +`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 에 collision 회피 +설명: + +``` +Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids) +|| policy_hash) collides when one block (= one PDF page) is split into +multiple chunks: every chunk on that page has identical block_ids. + +Workaround: feed a per-chunk variant format!("{base_policy_hash}#c{char_start}") +into the recipe's policy_hash slot. +``` + +`crates/kebab-chunk/src/pdf_page_v1.rs:170-172` 의 actual call: + +```rust +let per_chunk_hash = format!("{base_policy_hash}#c{char_start}"); +let chunk_id = + id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash); +``` + +여기 `char_start` = `chunk_page(...)` 의 첫 번째 tuple field = **post-overlap +`actual_start`** (NOT 원본 segment boundary `start`). + +#### §4.1.2 overlap 의 actual_start 계산 + +`crates/kebab-chunk/src/pdf_page_v1.rs:266-281`: + +```rust +let actual_start = if let Some(prev) = chunks.last() { + let prev_min = prev.0; // previous chunk 의 actual_start + let mut a = start; + let mut acc_o: usize = 0; + while a > prev_min { + let cl = chars[a - 1].len_utf8(); + if acc_o + cl > overlap_bytes { + break; + } + acc_o += cl; + a -= 1; + } + a +} else { + start +}; +``` + +`while a > prev_min` — overlap walk 는 previous chunk 의 actual_start 까지만 +back-walk. overlap_bytes 가 충분히 크고 `start - prev_min` 이 작으면 +`actual_start = prev_min`. **두 chunk 가 같은 actual_start = 같은 `#c{N}`**. + +#### §4.1.3 가설 검증 — F2 (1580 chars OCR) + +가정: F2 의 OCR text 가 첫 ~80 chars 안에 sentence-end (`.` + whitespace) +또는 paragraph break (`\n\n`) 를 포함. + +- 기본 chunking policy: `target_tokens=500` → `target_bytes=1500`, + `overlap_tokens=80` → `overlap_bytes = min(240, 750) = 240`. +- 한국어 char = 3 byte UTF-8. overlap_bytes=240 → 80 char 까지 back-walk. +- 가정한 bounds = `[0, 30, ~n]` (첫 ~30 chars 안 sentence-end 1 개). +- segment 1: start=0, chunk_end=30 → chunks.push((0, 30, ...)). `#c0`. +- segment 2: start=30, byte_len(30, n) >> target_bytes → 단일 segment chunk. + - actual_start walk: a=30 → walk back while a > 0, acc_o ≤ 240. + - 30 chars * 3 byte = 90 byte ≤ 240. → a=0 (=prev_min) 에서 loop 종료. + - actual_start = 0 = prev_min. +- chunks.push((0, n, ...)). `#c0` — **collision with chunk 1**. + +같은 doc 안 두 chunk 의 chunk_id input: +- `{kind:"chunk", doc_id:doc_id_F2, chunker_version:"pdf-page-v1", + block_ids:[block_id_F2], policy_hash:"{base_hash}#c0"}` +- canonical JSON 동일 → blake3 동일 → chunk_id 동일. + +→ `put_chunks` 의 `INSERT INTO chunks` 에서 첫 row 성공, 두 번째 row 가 +PRIMARY KEY violation. + +#### §4.1.4 F1 (779 chars OCR) 가 collision 안 하는 이유 + +F1 OCR text 도 한국어이지만 character 분포가 다르거나 첫 ~80 char 안 sentence +boundary 부재. 그 경우 bounds = `[0, n]` 또는 first boundary 가 80 char 이후 +→ chunk 2 의 actual_start 가 prev_min 이 아닌 다른 값 → distinct `#c{N}` +값 → distinct chunk_id. + +→ **F2 만 collision** 이라는 dogfood 의 observation 과 일치. + +#### §4.1.5 dogfood report 의 가설 평가 + +dogfood report 는 "scanned_page1 의 chunk_id 와 동일" 로 cross-doc collision +을 추정. 본 spec 의 investigation 결과 = **intra-doc (F2 내부) collision**. +근거: +- chunk_id input 에 `doc_id` 포함 → 서로 다른 doc 의 chunk_id 는 자동으로 다름. +- 같은 doc 안 두 chunk 가 같은 block_id + 같은 `#c{N}` policy_hash 면 + identical chunk_id. +- 가설 A (policy_hash default magic value) — 검증 안 됨, base_policy_hash 는 + policy 의 canonical JSON blake3 (deterministic). +- 가설 B (id_for_block 의 char_end 가 hash 의 일부) — 가능성 있지만 chunk_id + collision 자체와 무관 (block_id 변경은 chunk_id 변경을 produce; 다른 + collision pattern). +- 가설 C (chunker 의 block_ids ordering) — 가능성 있지만 single-block per + chunk 이므로 ordering N/A. +- 가설 D (OCR text 가 다른 doc 와 동일 inline) — chunk_id 의 input 에 text + 미포함, N/A. + +**Confirmed root cause** = 가설 C 의 variant — 단일 page 가 multi-chunk 일 +때 overlap 의 actual_start 가 prev chunk 의 actual_start 로 collapse, `#c{N}` +suffix 동일. + +### §4.2 Decision matrix + +| Option | 설명 | 장점 | 단점 | +|--------|------|------|------| +| **A — segment boundary `start`** | `per_chunk_hash` 의 suffix 를 post-overlap `actual_start` 대신 segment boundary `start` 로 변경 | minimal change / segment boundary 는 monotonically increasing (chunk_page 의 seg_idx loop invariant) → 항상 distinct / chunk_id 의 semantic 의도 보존 | chunk_page 의 return tuple shape 변경 필요 | +| B — chunk ordinal | `per_chunk_hash = "#c{ordinal}"` (page 안 chunk index 0, 1, 2, ...) | 가장 simple / segment boundary 무관 | chunk_id 의 "meaningful hash input" semantic 약화 | +| C — (`char_start`, `char_end`) pair | `per_chunk_hash = "#c{char_start}-{char_end}"` | 두 chunk 가 같은 char_start 라도 char_end 가 다르면 distinct | char_end 도 overlap clamp 에 의해 동일 가능 (e.g. last chunk 이 두 번 분할되면) — invariant 약함 | +| D — sequence number + char_start | `per_chunk_hash = "#c{ordinal}-{char_start}"` | invariant 완전 보장 | redundant info, hash input 가 더 길어짐 | + +### §4.3 Chosen path — Option A + +근거: +- chunk_page 의 main loop 는 `seg_idx` 가 strictly increasing, segment + boundary `bounds[seg_idx]` 도 strictly increasing (bounds 가 dedup 후 unique). + 따라서 segment boundary `start` 를 hash suffix 로 쓰면 같은 page 안 chunk + 들의 hash input 가 보장된 distinct. +- chunk_id 의 semantic: "어떤 segment 부터 시작한 chunk 인가" — overlap 이전 + 의 segment boundary 가 진짜 semantic origin. overlap 은 retrieval boundary + 를 위한 enrichment. +- chunk_page 의 return tuple 을 `(segment_start, actual_start, chunk_end, + slice)` 의 4-tuple 로 확장 (또는 segment_start 를 chunker loop 안에서 별도 + track) — minimal diff. + +### §4.4 Implementation + +`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `chunk_page` return signature 확장: + +```rust +/// Split a single page's text into ordered chunks, each represented as +/// `(segment_start, actual_start, chunk_end, text_slice)`. +/// +/// - `segment_start` = pre-overlap segment boundary. Strictly increasing +/// across the returned vec. Use this for chunk_id uniqueness suffixes. +/// - `actual_start` = post-overlap start char index. May collapse to a +/// previous chunk's `actual_start` under aggressive overlap policy. +/// Use this for `SourceSpan::Page::char_start`. +/// - `chunk_end` = chunk's end char index (exclusive). +fn chunk_page( + text: &str, + target_bytes: usize, + overlap_bytes: usize, +) -> Vec<(usize, usize, usize, String)> { + // ... (existing logic, but each push uses (segment_start, actual_start, chunk_end, slice)) + chunks.push((start, actual_start, chunk_end, slice)); + // ... +} +``` + +caller 의 `chunk` method 도 동일하게 update: + +```rust +for (segment_start, char_start, char_end, slice) in + chunk_page(&p.text, target_bytes, overlap_bytes) +{ + // ... existing u32 conversion + span construction ... + let span = SourceSpan::Page { + page: page_num, + char_start: Some(u32::try_from(char_start).expect("...")), + char_end: Some(u32::try_from(char_end).expect("...")), + }; + let block_ids: Vec = vec![p.common.block_id.clone()]; + // segment_start (pre-overlap boundary) is strictly increasing across + // chunks, even when overlap walk collapses actual_start to prev_min. + let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}"); + let chunk_id = + id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash); + // ... rest unchanged ... +} +``` + +`crates/kebab-chunk/src/pdf_page_v1.rs:47-60` 의 module doc 도 동시 update — +기존 description 의 `"#c{char_start}"` 가 새 fix 에 stale 하므로: + +```rust +//! Design §4.2's chunk_id = blake3(doc_id || chunker_version || sort(block_ids) +//! || policy_hash) collides when one block (= one PDF page) is split into +//! multiple chunks: every chunk on that page has identical block_ids. +//! +//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk +//! variant `format!("{base_policy_hash}#c{segment_start}")` into the +//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap segment +//! boundary, strictly increasing across the returned chunks even when the +//! overlap walk collapses `actual_start` to a previous chunk's `prev_min`. +//! Logged in tasks/HOTFIXES.md (2026-05-27 — Bug #3 second-iteration patch). +``` + +추가로 `tasks/HOTFIXES.md` 에 dated entry 추가 (본 fix 이 chunk_id deviation +의 **second-iteration patch** — 첫 iteration 의 `#c{char_start}` workaround 가 +aggressive overlap case 에서 collision 을 leave 했음을 명문화): + +```markdown +## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround +collapses under aggressive overlap (Bug #3 second-iteration patch) + +**Symptom**: F2 (1580 chars OCR) ingest 시 `DocumentStore::put_chunks (pdf): +UNIQUE constraint failed: chunks.chunk_id`. ... + +**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의 ... +post-overlap `actual_start` 가 prev chunk 의 actual_start 로 collapse ... + +**Fix** (this spec, §4.4): `chunk_page` return tuple 에 `segment_start` +추가, `per_chunk_hash` 의 suffix 를 `segment_start` 로 변경 ... + +**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump +(see §4.4.1). multi-chunk PDF page 의 chunk_id 가 변경 — design §9 +cascade trigger 로 explicit invalidation. + +**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md` +§4.4. parent design §4.2 chunk_id recipe 자체는 unchanged (workaround layer +의 internal computation 만 변경). +``` + +#### §4.4.1 chunk_id determinism 보존 + +기존 single-chunk-per-page case (e.g. small pages, `text.len() <= target_bytes`) +: +- `chunk_page` 의 early return: `vec![(0, n, text.to_string())]` → 새 shape + 로 `vec![(0, 0, n, text.to_string())]`. `segment_start = 0 = actual_start`. +- `#c0` suffix 동일 → chunk_id 동일. + +multi-chunk case 의 첫 chunk: +- segment_start = bounds[0] = 0, actual_start = start = 0 (no previous chunk). +- `#c0` suffix 동일 → chunk_id 동일. + +multi-chunk case 의 second-and-later chunk: +- 기존: `actual_start` (overlap-walked, may be == 0). +- 새: `segment_start` = bounds[seg_idx] > 0. +- → chunk_id 변경 (intentional, collision 회피). + +→ existing v0.19 (pre-OCR) PDF KB 안 multi-chunk pages 의 chunk_id 가 변경됨. +이는 v0.20 의 force-reingest path 에서 자동 갱신. + +**Decision (round 1c, closes §7.2 Open Q1): chunker_version bump +`pdf-page-v1` → `pdf-page-v1.1`** (critic round 1 M-1 권장 채택). + +근거: +- 정상 multi-chunk PDF page (예: dogfood report Scenario 1 의 metro-korea.pdf + 의 21 block / 34 chunk — Bug #3 trigger 안 한 정상 path) 의 chunk_id 가 + internal computation 변경으로 silent 하게 다른 값으로 mapping. + chunker_version 을 `pdf-page-v1` 유지하면 store/embedding layer 의 cascade + audit 가 발생 안 함 → 사용자가 `--force-reingest` 를 명시적으로 호출하지 + 않는 한 vector store 의 chunk_id ↔ chunk_text 가 silent mismatch 가능. +- design §9 cascade rule 의 본래 의도 = chunker algorithm 변경 시 explicit + version bump → store layer 의 자동 invalidation report. `pdf-page-v1.1` + bump 는 그 rule 의 직접 적용. +- bump cost = zero — v0.20.0 자체가 force-update release (PR #189 단일 + release commit 위에 cumulative bugfix stack) 이고, parent spec + (`2026-05-27-pdf-scanned-ocr-spec.md`) 의 OCR feature 활성화가 어차피 + force-reingest 권장 path. single-chunk PDF page 는 chunker_version 만 + 다르면 새 doc_id chain 안에서 동일하게 cascade 재계산. +- benefit = explicit user-facing audit trail. 다음 ingest 시 cascade + invalidation 이 store layer report 에 명시. + +cascade 의 다른 version field (parser_version / embedding_version / +prompt_template_version / index_version) 는 unchanged — chunker layer +한정 patch. + +`PdfPageV1Chunker` 의 `chunker_version()` 상수 update: +```rust +impl Chunker for PdfPageV1Chunker { + fn chunker_version(&self) -> ChunkerVersion { + ChunkerVersion("pdf-page-v1.1".to_string()) // was: "pdf-page-v1" + } + // ... +} +``` + +`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `PARSER_VERSION` 또는 const +`CHUNKER_VERSION` 도 동시 갱신 (해당 crate 의 actual constant 명에 맞춰서). + +### §4.5 Test additions + +`crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` 에 추가: + +```rust +#[test] +fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() { + // Regression test for v0.20.0 sub-item 1 Bug #3: post-overlap actual_start + // can collapse to prev_min, producing identical `#c{char_start}` suffixes + // and identical chunk_ids → sqlite chunks.chunk_id PRIMARY KEY violation + // at put_chunks INSERT time. + // + // Synthesises Korean OCR text shape: dense Korean characters (3 bytes + // per char) with a single early sentence-end boundary at char ~10 + + // long tail. + + // 10 Korean chars (= 30 UTF-8 bytes) + "." + " " + ~500 more Korean chars. + let early_seg: String = std::iter::repeat('가').take(10).collect(); + let tail: String = std::iter::repeat('나').take(500).collect(); + let page_text = format!("{early_seg}. {tail}"); + + let doc = make_pdf_doc(&[&page_text]); + let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte + let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap(); + + assert!( + chunks.len() >= 2, + "expected ≥2 chunks for {} byte page; got {}", + page_text.len(), + chunks.len() + ); + + // Hard invariant: all chunk_ids must be unique. Without the fix, the + // second chunk would have actual_start = 0 (== first chunk's + // actual_start) under the aggressive-overlap walk → identical `#c0` + // hash suffix → identical chunk_id → PRIMARY KEY violation. + let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect(); + ids.sort_unstable(); + let total = ids.len(); + ids.dedup(); + assert_eq!( + ids.len(), + total, + "all chunk_ids must be unique even when overlap walks actual_start back to prev_min" + ); +} +``` + +(round 1c L-1: 원래 round 0 의 second test +`chunk_id_recipe_uses_segment_start_not_actual_start` 는 본 test 의 +uniqueness 검증과 redundant + 실제 assertion 이 `assert!(chunks.len() >= 2)` +뿐이라 test name 의 의도와 mismatch — 제거.) + +추가로 `crates/kebab-app/tests/` 에 integration 수준의 regression test: + +```rust +// crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (new) +// +// v0.20.0 sub-item 1 Bug #3 regression: 다중 scanned PDF (각자 단일 page + +// 다른 OCR text length) 의 ingest 가 chunk_id collision 없이 모두 통과. +// +// Mock OCR engine (kebab-parse-image 의 MockOcrEngine 또는 inline impl) 이 +// page 마다 다른 text 길이 (예: 30 chars, 200 chars, 800 chars) return 하도록 +// 구성. real Ollama 호출 회피. + +#[test] +fn multi_scanned_pdf_ingest_no_chunk_id_collision() { + // ... setup: 3 scanned PDF fixture, mock OCR engine, isolated KB + // ... assert: ingest_report.items 모두 kind != Error + // ... assert: store.get_chunks_count() = sum of per-PDF chunk_counts +} +``` + +(round 1c NIT-1: 파일명과 함수명을 `multi_scanned_pdf_ingest_no_chunk_id_collision` +로 통일 — 원래 round 0 의 파일명 `pdf_multi_scan_no_chunk_id_collision.rs` 는 +fn name 과 mismatch.) + +#### §4.5.1 Pre-condition — MockOcrEngine availability (round 1c M-3) + +본 integration test 는 `OcrEngine` trait 의 mock impl 을 요구. executor 단계의 +1st step: + +1. `grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/` + 로 MockOcrEngine 위치 확인. +2. **현재 상태** (2026-05-27 verifier probe): + - `crates/kebab-parse-image/src/ocr.rs:235` — production `impl OcrEngine for OllamaVisionOcr`. + - `crates/kebab-app/tests/pdf_ocr_apply.rs:25` — `impl OcrEngine for MockOcrEngine` (test-only). +3. 본 새 integration test (`multi_scanned_pdf_ingest_no_chunk_id_collision.rs`) + 는 같은 crate (`kebab-app`) 안의 별 test binary 라 `pdf_ocr_apply.rs` 의 + private MockOcrEngine 를 직접 import 불가. executor 의 선택지: + - **Option A (권장)**: `crates/kebab-app/tests/common/mock_ocr.rs` 에 + MockOcrEngine 를 lift (per-page text 길이를 ctor argument 로 받는 + parameterised 형태). 두 test (`pdf_ocr_apply.rs` + 본 신 test) 모두 + `mod common;` 으로 share. + - **Option B**: 본 신 test 안에 inline `impl OcrEngine for LocalMock { ... }` + 중복 정의 (test isolation 우선, share 비용 회피). +4. 부재 시 (또는 sharing 어려움 시 — Option B 도 비현실적 시) §6 row 7 의 + acceptance 를 **conditional downgrade** — `kebab-chunk` 의 + unit-level invariant (§6 row 4) 만으로 Bug #3 의 core regression 핀 + 확보. integration 회피. + +executor 의 dependency 확인 task 의 결정 path 가 §7.2 Open Q4 에서 +closed. + +### §4.6 Acceptance (Bug #3 fix) + +- F1 (779 chars) + F2 (1580 chars) 동시 ingest 시 chunk_id collision 0. +- `--force-reingest` 마다 collision 0. +- 5+ scanned PDF (한국어 OCR text 100~3000 chars 분포) 의 KB 에서 collision 0. +- `crates/kebab-chunk` 의 기존 1000-determinism test + (`deterministic_chunk_ids_1000`) 통과 보존. +- workspace test regression 0. +- new test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` + + integration test `multi_scanned_pdf_ingest_no_chunk_id_collision` 추가. + +## §5 Bug #4 — F4 fixture Pages tree + +### §5.1 Root cause + +#### §5.1.1 현상 + +```json +{ + "doc_path": "mojibake.pdf", + "kind": "new", + "byte_len": 22568, + "pdf_ocr_pages": 0, + "pdf_ocr_ms_total": 0, + "block_count": 0, + "chunk_count": 0, + "warnings": [] +} +``` + +PdfTextExtractor 의 invariant (§1.4 "1 Block::Paragraph per page") 위반. + +#### §5.1.2 lopdf get_pages() 의 reaction + +dogfood probe: +- `lopdf::Document::load_mem(F4_bytes)` → OK. +- `pdf_doc.get_pages()` → empty `BTreeMap`. +- PDF byte stream 안 `/Type /Page` count = 1, `/Count` value = 1. + +→ structurally 1 page 가 존재하지만 lopdf 의 Pages tree traversal +(`/Pages` → `/Kids` chain) 가 broken. + +#### §5.1.3 fixture 생성 path 분석 + +`tests/fixtures/_synth/mojibake.py`: + +```python +c = canvas.Canvas(str(dst), pagesize=A4) +c.setFont(FONT_NAME, 12) +y = A4[1] - 30*mm +for line in ["Mojibake fixture (no ToUnicode CMap)", "..."]: + c.drawString(30*mm, y, line) + y -= 16 +c.save() + +data = dst.read_bytes() +# pattern: "/ToUnicode " — strip indirect object reference +new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data) +dst.write_bytes(new_data) +``` + +**Step 2 분석**: `re.sub` 가 `/ToUnicode N M R` byte sequence 를 제거하지만: +- 제거된 bytes 의 length 만큼 PDF 의 byte offset 가 shift. +- cross-reference table (`xref`) 의 offset entries 가 stale. +- `startxref` value 의 offset 도 stale. + +**Step 3 의 startxref fix** (`tasks/HOTFIXES.md` 의 commit `c2cd3a7`): +- manual byte edit `22130 → 22114` 로 startxref 갱신. +- 그러나 xref table 자체의 individual offsets 도 stale — Pages tree 의 + `/Kids` array 가 가리키는 indirect object 의 actual byte position 가 + xref entry 와 mismatch. +- lopdf 의 strict load 는 startxref + xref table 를 1차 검증; load 는 성공 + 하지만 Pages tree traversal 시 indirect object resolution fail → empty + Pages map. + +### §5.2 Fixture re-generation strategy + +| Option | 설명 | 장점 | 단점 | +|--------|------|------|------| +| **A — pikepdf** | reportlab 합성 후 pikepdf 로 open + ToUnicode 제거 + save (xref auto-regen) | proper xref regeneration / Pages tree intact / library available (pip install pikepdf) | 새 Python dependency (`pikepdf`) | +| B — qpdf normalize | byte-edit 후 `qpdf --linearize input.pdf output.pdf` | external tool (이미 sub-item 1 acceptance criteria 에 qpdf hint 가 있음) | qpdf 의 normalize 가 broken xref 를 거부할 수 있음 (또는 ToUnicode reference 를 다시 inline 할 수 있음) | +| C — reportlab disable ToUnicode | reportlab 의 합성 시 Type 0 font 의 ToUnicode CMap 생성 disable | byte-edit 회피 — clean | reportlab API 가 ToUnicode disable 를 직접 expose 안 함 (font 의 subclass 또는 monkeypatch 필요) | + +### §5.3 Chosen path — Option A (pikepdf) + +근거: +- pikepdf 는 PDF 의 proper PDF surgery library — qpdf 의 Python bindings. +- xref table 의 auto-regeneration + Pages tree 의 integrity 보존. +- `pip install pikepdf` 로 dependency 추가 — 이미 fixture generation 용 Python + venv 가 reportlab 사용 중이라 추가 install 가 trivial. + +#### §5.3.1 ToUnicode strip 의 pikepdf approach + +reportlab 의 Type 0 font 에서 ToUnicode CMap reference 는 font dictionary 안 +`/ToUnicode ` 로 등장. pikepdf 로 font dictionary 의 `/ToUnicode` entry 만 +제거: + +```python +import pikepdf +with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf: + # Walk all indirect objects, delete /ToUnicode entry whenever found. + # PDF spec 상 /ToUnicode 는 Font dictionary 의 child 로만 등장 → + # false-positive 위험 practically zero. Font type 명시 check 생략 (§5.4 + # 의 actual implementation 과 동일 형태). + for obj in pdf.objects: + if isinstance(obj, pikepdf.Dictionary): + if "/ToUnicode" in obj: + del obj["/ToUnicode"] + pdf.save(str(dst)) +``` + +pikepdf 의 save 는 xref + Pages tree 의 integrity 자동 보존. + +### §5.4 Implementation (mojibake.py revision) + +`tests/fixtures/_synth/mojibake.py` 의 완전 rewrite: + +```python +"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap. + +Strategy: +1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함). +2. pikepdf 로 open + font dictionary 의 /ToUnicode entry 제거 + save (xref 자동 regen). + +Dependency: reportlab + pikepdf. Install via `pip install reportlab pikepdf`. + +Usage: + python3 tests/fixtures/_synth/mojibake.py \ + crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf +""" +import sys +from pathlib import Path +from reportlab.lib.pagesizes import A4 +from reportlab.lib.units import mm +from reportlab.pdfbase import pdfmetrics +from reportlab.pdfbase.ttfonts import TTFont +from reportlab.pdfgen import canvas +import pikepdf + +DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf" +FONT_NAME = "DejaVuSans" +pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF)) + +dst = Path(sys.argv[1]) + +# Step 1: 정상 PDF 합성. +c = canvas.Canvas(str(dst), pagesize=A4) +c.setFont(FONT_NAME, 12) +y = A4[1] - 30 * mm +for line in [ + "Mojibake fixture (no ToUnicode CMap)", + "Text extraction yields garbage \x00\x01\x02", +]: + c.drawString(30 * mm, y, line) + y -= 16 +c.save() + +# Step 2: pikepdf 로 /ToUnicode reference strip + xref regeneration. +removed = 0 +with pikepdf.open(str(dst), allow_overwriting_input=True) as pdf: + for obj in pdf.objects: + if isinstance(obj, pikepdf.Dictionary): + if "/ToUnicode" in obj: + del obj["/ToUnicode"] + removed += 1 + pdf.save(str(dst)) + +if removed == 0: + print("ERROR: no /ToUnicode entry found in any dictionary", file=sys.stderr) + sys.exit(2) + +# Step 3: invariant 검증 — load + page count. +with pikepdf.open(str(dst)) as pdf: + n_pages = len(pdf.pages) + if n_pages != 1: + print(f"ERROR: expected 1 page, got {n_pages}", file=sys.stderr) + sys.exit(3) + # ToUnicode 부재 invariant 확인. + raw = Path(dst).read_bytes() + if b"/ToUnicode" in raw: + print("ERROR: /ToUnicode still present after strip", file=sys.stderr) + sys.exit(4) + +print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped via pikepdf, 1 page)") +``` + +### §5.5 Test additions + +`crates/kebab-parse-pdf/tests/text_extractor.rs` (or relevant existing test +file) 에 추가: + +```rust +/// F4 mojibake.pdf 의 Pages tree invariant — Step 2 의 fixture re-generation +/// (pikepdf-based) 가 lopdf 의 get_pages() 를 정상 return 하도록 보장. +/// +/// Bug #4 regression: 이전 fixture (byte-edit + manual startxref) 는 +/// lopdf 의 strict load 는 통과시키지만 Pages tree traversal 시 broken +/// indirect object resolution → empty pages map → block_count=0. +#[test] +fn mojibake_fixture_load_yields_one_page() { + let bytes = include_bytes!("../tests/fixtures/mojibake.pdf"); + let doc = lopdf::Document::load_mem(bytes).expect("F4 fixture must lopdf-load"); + let pages = doc.get_pages(); + assert_eq!( + pages.len(), + 1, + "F4 fixture must have exactly 1 page (Pages tree integrity)" + ); +} + +#[test] +fn mojibake_fixture_has_no_tounicode_cmap() { + // Step 2 의 ToUnicode 부재 invariant. + let bytes = std::fs::read("tests/fixtures/mojibake.pdf").unwrap(); + let count = bytes.windows(b"/ToUnicode".len()) + .filter(|w| *w == b"/ToUnicode") + .count(); + assert_eq!(count, 0, "F4 fixture must not contain /ToUnicode marker"); +} + +#[test] +fn pdf_text_extractor_on_mojibake_yields_one_block() { + // PdfTextExtractor 의 invariant: 1 Block::Paragraph per page. + // F4 fixture 의 ToUnicode 부재 → text extraction yields garbage 또는 + // empty → 1 empty Block::Paragraph + "scanned candidate" warning. + let bytes = include_bytes!("../tests/fixtures/mojibake.pdf"); + // ... ExtractContext setup + extractor.extract(&ctx, bytes) ... + let canonical = extractor.extract(&ctx, bytes).unwrap(); + assert_eq!(canonical.blocks.len(), 1, "expected 1 Block::Paragraph per page"); + // text 는 garbage 또는 empty — invariant 는 block 자체의 존재. + let warning_present = canonical.provenance.events.iter().any(|e| { + matches!(e.kind, ProvenanceKind::Warning) + && e.note.as_ref().is_some_and(|n| n.contains("scanned candidate")) + }); + assert!(warning_present || !canonical.blocks[0].text_is_empty(), + "text-detect first 의 empty fallback 시 scanned-candidate warning 필수"); +} +``` + +### §5.6 Acceptance (Bug #4 fix) + +- F4 fixture re-generation 후 `lopdf::Document::load_mem(...).get_pages().len() = 1`. +- F4 fixture 의 ToUnicode CMap 부재 invariant 보존 + (`grep -c "/ToUnicode" mojibake.pdf` = 0). +- PdfTextExtractor 의 F4 ingest 시 `block_count = 1`, + warning `"page1 empty (scanned candidate)"` 또는 garbage text. +- dogfood retest 시 mojibake.pdf 의 `block_count: 1`, + `chunk_count: 0~1` (depending on text content). +- 기존 `text_extractor_regression.rs` 의 F4 baseline 갱신 — old baseline 자체 + 가 broken invariant 의 snapshot 이라 update 필요. +- workspace test regression 0. + +## §6 Acceptance criteria (consolidated) + +| # | Verifier | Bug | 명령 | +|---|---------|-----|------| +| 1 | walker bypasses size cap for PDF | #2 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files` | +| 2 | walker still skips oversized code files | #2 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes` | +| 3 | 256KB+ PDF/markdown ingest default config | #2 | dogfood retest: `kebab ingest` 시 `skipped_size_exceeded = 0` for non-code | +| 4 | chunker collision regression test | #3 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` | +| 5 | chunker determinism preserved | #3 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000` | +| 6 | chunker overlap clamp preserved | #3 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target` | +| 7 | integration: multi-scanned PDF ingest (conditional — §4.5.1 의 MockOcrEngine share 가능 시) | #3 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision` | +| 8 | dogfood: F1 + F2 force-reingest | #3 | dogfood retest: `kebab ingest --force-reingest` 시 errors = 0 (encrypted 제외) | +| 9 | F4 fixture lopdf 1-page invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page` | +| 10 | F4 fixture ToUnicode 부재 invariant | #4 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap` | +| 11 | F4 PdfTextExtractor 1-block invariant | #4 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block` | +| 12 | dogfood: F4 ingest yields block_count=1 | #4 | dogfood retest: mojibake.pdf 의 ingest item `block_count: 1` | +| 13 | workspace clippy clean | all | `cargo clippy --workspace --all-targets -- -D warnings` | +| 14 | workspace full test pass | all | `cargo test --workspace --no-fail-fast -j 1` | +| 15 | dogfood end-to-end 9 PDF | all | dogfood retest: 9 PDF 모두 ingest, errors = 2 (encrypted only) | +| 16 | chunker_version cascade — final value | #3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` (round 1c M-1 결정) | + +## §7 Risks + open questions + +### §7.1 Risks + +- **Bug #3 fix 가 chunk_id 변경**: multi-chunk PDF page (pre-OCR 시점에 1500 + byte 초과 page) 의 chunk_id 가 변경됨. 사용자가 `--force-reingest` 1회 + 필요. v0.20.0 force-update path 라 acceptable (user 가 어차피 fresh + ingest). README 또는 release note 에 명시. +- **Bug #2 fix 의 side-effect**: 1 GB 이상의 PDF 가 walker 통과 → lopdf 의 + load_mem 가 메모리 폭발 위험. v0.20 scope 외 (Phase 9 부터 streaming + parser 검토 — design §9.2 의 future scope). 본 fix 에서는 acceptable. +- **Bug #4 fix 의 fixture binary 변경**: F4 mojibake.pdf 의 SHA256 가 변경 + → git LFS / binary diff 의 noise. `text_extractor_regression.rs` 의 + baseline 도 새 fixture 의 output 으로 update — 한 commit 안 동시 처리. +- **pikepdf install requirement**: fixture re-generation 시 `pip install + pikepdf` 필요. CI 환경 (만약 fixture regeneration 이 CI 의 일부) 의 + Python dependency 추가 — 본 spec 의 fix 는 fixture 자체를 commit 하므로 + generation 은 1회성, CI 의존성 미발생. + +### §7.2 Open questions + +1. **chunker_version bump 의 cost-benefit**: ✅ **CLOSED (round 1c M-1)** — + `pdf-page-v1` → `pdf-page-v1.1` bump 결정. cascade audit trail explicit + + v0.20 force-update path 라 cost zero. detail = §4.4.1 의 "Decision + (round 1c, closes §7.2 Open Q1)" 단락. +2. **Bug #2 의 Option B (per-type config) 의 v0.20 scope inclusion**: 본 spec + 은 v0.21+ 로 defer. critic round 1 ACCEPT — v0.20 안 inclusion 권고 없음. +3. **F4 fixture 의 invariant**: critic round 1 ACCEPT — ToUnicode 부재 + + valid Pages tree 조합은 pikepdf 의 proper PDF surgery 로 정확히 reproducible. + Step 2 의 design (mojibake.py rewrite) sound. +4. **integration test 의 mock OCR**: ✅ **CLOSED (round 1c M-3)** — + `crates/kebab-app/tests/pdf_ocr_apply.rs:25` 에 이미 `impl OcrEngine for + MockOcrEngine` 존재. executor 의 share path (Option A — `tests/common/ + mock_ocr.rs` lift) 또는 inline 중복 (Option B) 결정만 남음. share 가 불가 + 능 시 §6 row 7 의 conditional downgrade — detail = §4.5.1 의 "Pre-condition + — MockOcrEngine availability" 단락. +5. **chunk_page tuple shape 변경**: Option A 의 4-tuple `(segment_start, + actual_start, chunk_end, slice)` 가 외부 callers 에 영향을 주는가? + `chunk_page` 는 module-private (`fn chunk_page`) 이라 외부 caller 0, + 안전. critic round 1 ACCEPT. + +## §8 References + +- dogfood report: `.omc/reviews/2026-05-27-v0.20-dogfood-report.md` +- parent spec (frozen): `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md` +- parent plan (round 1c ACCEPT): `docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md` +- source code (root cause evidence): + - `crates/kebab-source-fs/src/connector.rs` (Bug #2) + - `crates/kebab-source-fs/src/code_meta.rs` (is_oversized + code_lang_for_path) + - `crates/kebab-config/src/lib.rs` (IngestCodeCfg) + - `crates/kebab-core/src/ids.rs` (id_for_chunk / id_for_block recipes) + - `crates/kebab-chunk/src/pdf_page_v1.rs` (PdfPageV1Chunker + chunk_page) + - `crates/kebab-app/src/pdf_ocr_apply.rs` (post-extract OCR enrichment) + - `crates/kebab-app/src/lib.rs:1769-1968` (ingest_one_pdf_asset wiring) + - `crates/kebab-store-sqlite/src/documents.rs:103-155` (put_chunks DELETE+INSERT) + - `migrations/V001__init.sql:80-94` (chunks table DDL — chunk_id PRIMARY KEY) + - `tests/fixtures/_synth/mojibake.py` (Bug #4 fixture source) +- design §3.4 (SourceSpan::Page), §3.5 (Chunk + chunk_id recipe), + §4.2 (id_from canonical JSON), §5.2 (walker builtin blacklist), + §9 (versioning cascade). diff --git a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md new file mode 100644 index 0000000..fdd7f20 --- /dev/null +++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md @@ -0,0 +1,308 @@ +--- +title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text" +created: 2026-05-27 +status: "DRAFT round 1c" +parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md +contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"] +related_specs: + - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md + - docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +related_dogfood: + - .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7) +--- + +# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text + +## §1 Problem statement + +### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection + +**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text. + +**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered. + +**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated. + +### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values + +**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing. + +**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**. + +**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly. + +--- + +## §2 Scope + non-scope + +### §2.1 Included in this spec + +- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`. +- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases). +- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green. +- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`. +- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring. +- **Traceability**: Link both fixes to parent spec §1.3 design intent. + +### §2.2 Explicitly out of scope + +**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8. + +**Non-bug observations**: +- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching). +- Ask phrasing-sensitive refusal: RAG corner case; not a code defect. +- Binary staleness: Environmental artifact, not applicable to spec. + +**Ancillary risks**: +- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec. +- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible. + +--- + +## §3 Decisions + +### §3.1 Bug #6: Known mojibake marker stripping + +Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages. + +**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold. + +**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers. + +### §3.2 Bug #7: CLI doc-comment update + +Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes. + +### §3.3 Parent spec traceability + +Both fixes uphold parent spec §1.3: +- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent. +- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0). + +No changes to parser_version, chunker_version, or wire schema. + +--- + +## §4 Implementation specification + +### §4.1 Bug #6: text_quality.rs diff + +**File**: `crates/kebab-parse-pdf/src/text_quality.rs` + +**Change**: +1. Add constant array of known mojibake markers (lines 8–10): + ```rust + // Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text). + // Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap + // encodings fall through to `String::from_utf8_lossy(bytes)`, which yields + // PUA / replacement-char territory already covered by `pure_pua_zero`. + // Re-verify on lopdf dependency upgrade. + const MOJIBAKE_MARKERS: &[&str] = &[ + "?Identity-H Unimplemented?", + ]; + ``` + +2. Refactor `compute_valid_char_ratio()` (lines 39–106): + ```rust + pub fn compute_valid_char_ratio(s: &str) -> f32 { + // 1) Strip known mojibake markers before counting valid chars. + // Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker + // substrings (bypassing PUA detection). + let mut cleaned: String = s.to_string(); + // `had_marker` guard preserves prior behavior for whitespace-only input + // (returns ratio of whitespace validity, not 0.0) when no markers found. + // With markers stripped, the guard enables the trim-empty check. + let mut had_marker = false; + for marker in MOJIBAKE_MARKERS { + if cleaned.contains(marker) { + had_marker = true; + cleaned = cleaned.replace(marker, ""); + } + } + // 2) Whitespace-only cleaned text → 0.0 (marker-only page). + if had_marker && cleaned.trim().is_empty() { + return 0.0; + } + // 3) Marker-dominance heuristic — when stripped chars exceed remaining + // chars (i.e. marker > 50% of original), the page is "mostly mojibake + // with some decodeable page-furniture" (e.g. metro-korea.pdf has + // header text in a separate font + body that is Identity-H CID). + // Force ratio downward to trigger OCR fallback (parent spec §1.3 intent). + if had_marker { + let stripped_chars = s.len().saturating_sub(cleaned.len()); + if stripped_chars > cleaned.len() { + // Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold). + // The 0.3 cap (not 0.0) preserves a small signal that some text + // WAS decodeable, useful for downstream metrics if ever exposed. + let mut total = 0u32; + let mut valid = 0u32; + for c in cleaned.chars() { + total += 1; + if is_valid_text_char(c) { + valid += 1; + } + } + let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 }; + return raw_ratio.min(0.3); + } + } + // 4) Otherwise compute ratio on cleaned text (existing logic). + let mut total = 0u32; + let mut valid = 0u32; + for c in cleaned.chars() { + total += 1; + if is_valid_text_char(c) { + valid += 1; + } + } + if total == 0 { + return 0.0; + } + valid as f32 / total as f32 + } + ``` + +**Invariants preserved**: +- Function signature and return type unchanged (→ byte-identical caller surface). +- Existing character category logic (hangul, CJK, Latin-1) unmodified. +- Empty-string behavior (return 0.0) preserved. + +### §4.2 Bug #6: Unit tests + +Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic: + +```rust +#[test] +fn identity_h_marker_dominance_caps_ratio_below_threshold() { + // metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header. + // Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR). + // With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback). + let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20)); + let r = compute_valid_char_ratio(&s); + assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}"); +} + +#[test] +fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() { + // Inverse case: short marker noise + long valid text → ratio stays high + // (no false OCR trigger on otherwise-good pages). + let header = "x".repeat(200); // 200 char valid ASCII + let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char + let r = compute_valid_char_ratio(&s); + assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}"); +} +``` + +**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green. + +### §4.3 Bug #7: CLI doc-comment diff + +**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150–160) + +**Change**: +```diff +-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing ++/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing +``` + +### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update + +Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7): + +**File**: `integrations/claude-code/kebab/SKILL.md` (line 57) + +**Change**: +```diff +-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`) ++`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`) +``` + +### §4.4 Bug #7: CLI help assertion + +Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test): + +```rust +#[test] +fn search_help_lists_code_in_media_values() { + let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab")) + .args(["search", "--help"]) + .output() + .expect("kebab search --help"); + let stdout = String::from_utf8_lossy(&out.stdout); + assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value"); +} +``` + +### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade) + +- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface. +- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected). +- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`. + +--- + +## §5 Acceptance criteria + +- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes. +- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes. +- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text). +- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes. +- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line. +- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration). +- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds. + +--- + +## §6 Risks + open questions + +### Identified risks + +**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.** + +**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended. + +**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade. + +**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157–159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist. + +**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness. + +### Open questions + +**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array. + +**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback. + +**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes. + +### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest + +This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback. + +**User action required**: Explicit `kebab ingest --force-reingest ` to purge cached skip decisions and re-process affected files. + +**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest ` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation. + +**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes. + +Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126–128 precedent). + +--- + +## §7 References + +- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9 +- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7 +- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit) +- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker) +- **Code locations**: + - text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106` + - CLI help: `crates/kebab-cli/src/main.rs:157–159` + - Skill integration: `integrations/claude-code/kebab/SKILL.md:57` + - CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values) + +--- + +**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session. + +**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs). diff --git a/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md new file mode 100644 index 0000000..cc75e77 --- /dev/null +++ b/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md @@ -0,0 +1,410 @@ +--- +title: "v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings" +created: 2026-05-27 +status: DRAFT +round: 1c +parent_spec: 2026-04-27-kebab-final-form-design.md +contract_sections: + - "1.1 (ask streaming)" + - "2.2 (error handling)" + - "2.4 (JSON wire schema)" + - "3.1 (config XDG)" + - "4.1 (capabilities schema)" +source_report: .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md +--- + +# v0.20.0 sub-item 1 bugfix round 3 — final-dogfood findings + +Post-bugfix2 final dogfood (2026-05-27) 에서 발견된 **5개 bug** 의 fix design. PR #189 force-update (base=main). Spec scope: root cause + fix decision + acceptance criteria + parent spec traceability. Bug #12 falsified (scope 외). Fix 5개 모두 trivial ~ small refactor (기존 1350 test + 추가 5+ test). + +--- + +## §1 Problem statement + +### §1.1 Bug #9: capabilities false negative (Critical) + +`kebab schema --json` 의 `capabilities.streaming_ask` 와 `capabilities.single_file_ingest` 가 모두 `false` hardcoded. 그러나 실제 구현: +- `kebab ask --stream` → `answer_event.v1` ndjson events 정상 emit (191 events 검증). +- `kebab ingest-file ` → `ingest_report.v1` 신규/갱신 정상. +- `kebab ingest-stdin --title ` → 정상. + +**Impact**: MCP host, Claude Code skill 등 agent 가 `capabilities: { streaming_ask: false, single_file_ingest: false }` 보고 routing 결정 시 false negative. user 가 실제 동작하는 feature 를 사용 불가능하다고 오인. + +### §1.2 Bug #10: config fail-fast (UX) + +```bash +kebab search "rust" --config /tmp/nonexistent.toml --json +# exit=0, {"hits":[],"schema_version":"search_response.v1"} +``` + +explicit path 가 missing 시 silent fallback to default config (XDG path). debugging nightmare — typo 또는 wrong path 가 0 hit 으로만 surface. + +### §1.3 Bug #11: OCR timeout 600s (Critical UX) + +`config.pdf.ocr.request_timeout_secs = 600` (10분/page default). metro-korea.pdf dogfood 증거: +- page 8 + page 13 에서 Ollama remote 의 slow response → 600s 완전 timeout. +- 결과: `ms: 600000, chars: 0, skipped: true` emit → 본문 indexed 안 됨 + 20분 cost waste. + +**Production impact**: 사용자가 ingest 완료 signal 못 받음, 일부 page 검색 불가. + +### §1.4 Bug #13: schema.models single value (UX) + +```json +{ + "chunker_version": "md-heading-v1", + "parser_version": "md-frontmatter-v2", + ... +} +``` + +그러나 corpus 안 multi-active: +- parsers: `md-frontmatter-v2`, `pdf-text-v1`, `code-rust-v1`, `code-python-v1`, `none-v1`. +- chunkers: `md-heading-v1`, `pdf-page-v1.1`, `code-rust-ast-v1`, `code-python-ast-v1`, `dockerfile-file-v1`, `k8s-manifest-resource-v1`, `manifest-file-v1`, `code-text-paragraph-v1`. + +**Impact**: user 가 `kebab schema` 보고 active version 식별 불완전, version cascade audit 시 누락 risk. + +### §1.5 Bug #14: empty query silent (Minor UX) + +```bash +kebab search "" --json +# exit=0, {"hits":[],"next_cursor":null,"schema_version":"search_response.v1"} +``` + +empty query (또는 whitespace-only) 가 silent 0 hit return. user mistake → explicit error 가 정합. + +--- + +## §2 Scope + non-scope + +### §2.1 Included: 5 bug fix + +| Bug | Category | Severity | Fix type | +|-----|----------|----------|----------| +| #9 | wire schema | critical | capability flag hardcoded boolean → actual feature check | +| #10 | config UX | medium | silent fallback → error.v1 with config_not_found | +| #11 | OCR config | critical | default 600s → 60s timeout | +| #13 | wire schema | medium | single field → additive array fields (backward compat) | +| #14 | input validation | minor | empty query silent → error.v1 with invalid_input | + +### §2.2 Out of scope + +- **Bug #12 (falsified)**: `inspect doc` blocks[].text 가 code parser 에서 "?" placeholder. 근본: `.text` 아님, `.code` field 정상 emit. user workflow 는 `.code` 로 접근 가능 → spec 범위 외. +- dogfood report §12 의 다른 axis (ranking bias, multi-root caveat) → 별도 phase. + +--- + +## §3 Decisions + +### §3.1 Bug #9: capabilities 정정 + +**Decision**: `schema.rs::capabilities_snapshot()` 의 두 field 를 true 로 update. + +```rust +fn capabilities_snapshot() -> Capabilities { + Capabilities { + json_mode: true, + ingest_progress: true, + ingest_cancellation: true, + rag_multi_turn: true, + search_cache: true, + incremental_ingest: true, + streaming_ask: true, // ← WAS FALSE, actual TRUE + http_daemon: false, // ← preserved (not-impl, separate sub-item) + mcp_server: true, + single_file_ingest: true, // ← WAS FALSE, actual TRUE + bulk_search: true, + } +} +``` + +**Rationale**: actual implementation 이 production-grade streaming ask + single-file ingest 지원. schema report 가 reality 와 정합되어야 agent routing 정확함. + +### §3.2 Bug #10: config_not_found error + +**Decision**: `kebab-config` 가 자체 error type `ConfigNotFound` 정의, `kebab-app::error_wire` 가 classify arm 추가. + +Pseudo-code: +```rust +// crates/kebab-config/src/lib.rs (또는 적절한 error module) +#[derive(Debug, thiserror::Error)] +#[error("config file does not exist: {path}")] +pub struct ConfigNotFound { + pub path: PathBuf, +} + +// Config::load 안: +pub fn load(opt_path: Option<&Path>) -> anyhow::Result { + match opt_path { + Some(p) if !p.exists() => Err(anyhow::Error::new(ConfigNotFound { path: p.to_path_buf() })), + Some(p) => Self::from_file(p), + None => Self::from_xdg_default_or_defaults(), + } +} +``` + +Classify arm in `kebab-app/src/error_wire.rs`: +```rust +if let Some(e) = err.downcast_ref::() { + return ErrorV1 { + schema_version: ERROR_V1_ID.to_string(), + code: "config_not_found".to_string(), + message: format!("config file does not exist: {}", e.path.display()), + details: json!({ "path": e.path }), + hint: Some("verify --config argument; use --config to point to a writable toml file, or omit to use XDG default".to_string()), + }; +} +``` + +**Exit code**: 2 (config error, not 0 silent). + +### §3.3 Bug #11: OCR timeout 60s + +**Decision**: `default_pdf_ocr_request_timeout_secs()` → 600 에서 60 으로 감소. + +```rust +fn default_pdf_ocr_request_timeout_secs() -> u64 { + 60 // 1 min, production-friendly per dogfood evidence +} +``` + +**Doc-comment 추가**: +```rust +/// Default OCR request timeout in seconds. Most pages complete in 6-32s. +/// Set to upper-bound valid throughput; exceeding 60s may indicate +/// Ollama unavailability or very dense/high-res pages. +/// Override via [pdf.ocr] request_timeout_secs = N in config.toml. +``` + +### §3.4 Bug #13: active_parsers + active_chunkers (additive) + +**Decision**: wire schema additive minor — `Models` struct 에 두 배열 추가, 기존 single field 보존 (backward compat). `kebab-store-sqlite` 가 fetch methods 제공. + +**Store API** (crates/kebab-store-sqlite/src/lib.rs): +```rust +impl SqliteStore { + /// SELECT DISTINCT parser_version FROM documents WHERE parser_version IS NOT NULL ORDER BY parser_version + pub fn fetch_distinct_parser_versions(&self) -> anyhow::Result> { + let conn = self.conn()?; + let mut stmt = conn.prepare( + "SELECT DISTINCT parser_version FROM documents + WHERE parser_version IS NOT NULL + ORDER BY parser_version" + )?; + let rows = stmt.query_map([], |row| row.get::<_, String>(0))?; + let mut out = Vec::new(); + for r in rows { out.push(r?); } + Ok(out) + } + + pub fn fetch_distinct_chunker_versions(&self) -> anyhow::Result> { + let conn = self.conn()?; + let mut stmt = conn.prepare( + "SELECT DISTINCT chunker_version FROM chunks + WHERE chunker_version IS NOT NULL + ORDER BY chunker_version" + )?; + let rows = stmt.query_map([], |row| row.get::<_, String>(0))?; + let mut out = Vec::new(); + for r in rows { out.push(r?); } + Ok(out) + } +} +``` + +**Models struct** (crates/kebab-app/src/schema.rs): +```rust +pub struct Models { + /// Deprecated since v0.20.1. Use active_parsers for multi-parser corpus. + /// Reports default parser version (markdown path). + pub parser_version: String, + + /// Deprecated since v0.20.1. Use active_chunkers for multi-chunker corpus. + pub chunker_version: String, + + /// All parser versions active in corpus (v0.20.1+). May be empty if corpus is empty. + pub active_parsers: Vec, + + /// All chunker versions active in corpus (v0.20.1+). May be empty if corpus is empty. + pub active_chunkers: Vec, + + pub embedding_version: String, + pub prompt_template_version: String, + pub index_version: String, + pub corpus_revision: u64, +} +``` + +**Computation** (crates/kebab-app/src/schema.rs::collect_models): +```rust +let store = open_store_for_stats(cfg)?; +let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default(); +let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default(); + +Ok(Models { + parser_version: active_parsers.first().cloned().unwrap_or_else(|| kebab_parse_md::PARSER_VERSION.to_string()), + chunker_version: active_chunkers.first().cloned().unwrap_or_else(|| kebab_chunk::md_heading_v1::VERSION_LABEL.to_string()), + active_parsers, + active_chunkers, + ... +}) +``` + +**Fallback**: markdown-fallback 유지. 기존 `parser_version` + `chunker_version` hardcode 보존 (backward compat). + +### §3.5 Bug #14: empty query validation + +**Decision**: `search` 및 `ask` command 모두에 query empty check + error.v1 emit. + +**Search command** (crates/kebab-cli/src/main.rs::search arm): +```rust +if let Some(q) = query.as_ref() { + if q.trim().is_empty() { + return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 { + schema_version: ERROR_V1_ID.to_string(), + code: "invalid_input".to_string(), + message: "query is empty; provide a non-empty search term or use --bulk".into(), + details: Value::Null, + hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()), + }))); + } +} +``` + +**Ask command** (crates/kebab-cli/src/main.rs::ask arm): +```rust +if query.trim().is_empty() { + return Err(anyhow::Error::new(kebab_app::StructuredError(ErrorV1 { + schema_version: ERROR_V1_ID.to_string(), + code: "invalid_input".to_string(), + message: "query is empty; provide a non-empty prompt".into(), + details: Value::Null, + hint: Some("e.g. `kebab ask 'explain this code'`".into()), + }))); +} +``` + +Both commands now validate; no silent fallback. + +--- + +## §4 Implementation specification + +### §4.1 Files to modify + +1. **Bug #9 capability fix**: `crates/kebab-app/src/schema.rs` + - line 137–151: `capabilities_snapshot()` — flip `streaming_ask: false` → `true`, `single_file_ingest: false` → `true`. + - add test: `capabilities_streaming_ask_matches_cli_surface()`. + - add test: `capabilities_single_file_ingest_matches_cli_surface()`. + +2. **Bug #10 config_not_found**: Two files + - `crates/kebab-config/src/lib.rs`: + - Define `ConfigNotFound` error struct (with `#[derive(Debug, thiserror::Error)]`). + - Modify `Config::load(opt_path: Option<&Path>)` — path existence check, `return Err(anyhow::Error::new(ConfigNotFound { ... }))`. + - add test: `config_load_explicit_nonexistent_path_returns_error()`. + - `crates/kebab-app/src/error_wire.rs`: + - Add classify arm after existing `ConfigInvalid` case. + - Map `kebab_config::ConfigNotFound` → `ErrorV1 { code: "config_not_found", ... }`. + +3. **Bug #13 schema.models**: Three components + - `crates/kebab-store-sqlite/src/lib.rs`: + - Implement `fetch_distinct_parser_versions()` — SQL SELECT DISTINCT on documents.parser_version + ORDER BY. + - Implement `fetch_distinct_chunker_versions()` — SQL SELECT DISTINCT on chunks.chunker_version + ORDER BY. + - `crates/kebab-app/src/schema.rs`: + - Modify `Models` struct — add `active_parsers: Vec`, `active_chunkers: Vec` fields. + - Modify computation logic (`collect_models` or equiv) — call store methods, populate arrays, fallback to markdown defaults for single fields. + - add test: `schema_models_active_arrays_empty_on_empty_corpus()`. + - add test: `schema_models_active_arrays_populated_after_mixed_ingest()`. + - `docs/wire-schema/v1/schema.schema.json`: + - `Models` object — add `"active_parsers": { "type": "array", "items": { "type": "string" } }`. + - add `"active_chunkers": { "type": "array", "items": { "type": "string" } }`. + - Mark deprecated in comment: `parser_version` + `chunker_version` (additive, backward compat). + +4. **Bug #14 empty query validation**: `crates/kebab-cli/src/main.rs` + - search command arm: add `if query.trim().is_empty()` check → error.v1 code=invalid_input. + - ask command arm: add identical `if query.trim().is_empty()` check → error.v1 code=invalid_input. + +5. **Wire schema v1 doc update**: `docs/wire-schema/v1/` + - Update schema doc to note `active_parsers` / `active_chunkers` optional (additive). + +6. **Integration**: `integrations/claude-code/kebab/SKILL.md` + - Update `schema.models` surface docs — reference new `active_*` arrays for multi-version corpora. + +7. **Tests** (new or extended): + - `crates/kebab-cli/tests/`: invalid --config path (absolute + relative) → error.v1 + exit≠0. + - `crates/kebab-cli/tests/`: empty query (search + ask) → error.v1 code=invalid_input + exit≠0. + - `crates/kebab-config/tests/`: config file not found → ConfigNotFound error. + - `crates/kebab-app/tests/`: mixed corpus schema — active_parsers/chunkers include all ingested versions. + +### §4.2 Regression checks + +- Existing 1350 workspace tests: `cargo test --workspace --no-fail-fast -j 1` must pass green. +- All non-bug capabilities (json_mode, ingest_progress, ingest_cancellation, rag_multi_turn, search_cache, incremental_ingest, mcp_server, bulk_search) stay true. +- Default config path resolution (no --config) unchanged — silent fallback to XDG only if `--config` not passed. +- Relative path behavior (cwd-relative, Rust std path::Path::exists()) preserved. +- Empty corpus → empty `active_parsers` / `active_chunkers` array (not null, not error). +- Existing hardcoded `parser_version` + `chunker_version` fields continue to report markdown defaults (backward compat). +- Schema version bump not required (wire schema additive minor, backward compat). + +--- + +## §5 Acceptance criteria + +| # | Criterion | Evidence | +|----|-----------|----------| +| AC-1 | `kebab schema --json` emit `streaming_ask: true` + `single_file_ingest: true` | `cargo test -p kebab-app capabilities_* -j 4` green | +| AC-2 | `kebab search "x" --config /nonexistent.toml --json` emit exit≠0 + error.v1 code=config_not_found | `cargo test -p kebab-config config_load_explicit_nonexistent_path_returns_error -j 4` green | +| AC-3 | `cargo test -p kebab-config pdf_ocr_request_timeout_default_is_60s -j 4` → green | unit test confirms default = 60s (no manual timing) | +| AC-4 | After mixed ingest (MD + PDF + code), `kebab schema --json` emits both `active_parsers` + `active_chunkers` arrays containing all versions | integration test pass | +| AC-5 | `kebab search "" --json` and `kebab search " " --json` both emit exit≠0 + error.v1 code=invalid_input | integration test pass | +| AC-6 | `kebab ask "" --json` emit exit≠0 + error.v1 code=invalid_input (ask symmetry) | integration test pass | +| AC-7 | `kebab search "rust" --config nonexistent-relative.toml --json` (relative path) emit exit≠0 + error.v1 code=config_not_found | integration test pass | +| AC-8 | All 1350+ workspace tests pass; no new failures | `cargo test --workspace --no-fail-fast -j 1` exit=0 | +| AC-9 | Wire schema backward compat: old clients reading `parser_version` + `chunker_version` still work; `active_*` arrays optional per schema | JSON schema `additionalProperties: false` review | +| AC-10 | `kebab ask --stream` still works; streaming events emitted (no regression) | manual `kebab ask --stream "explain this" 2>&1 | head -3` | + +--- + +## §6 Risks + resolutions + +### Risks + +- **R-1** (Bug #10): Relative path `./config.toml` must resolve from cwd, not from binary location. **Resolution**: Rust `std::path::Path::exists()` is cwd-relative; no workaround needed. +- **R-2** (Bug #13): Empty corpus → empty `active_parsers` / `active_chunkers` array. **Resolution**: Unit test `schema_models_active_arrays_empty_on_empty_corpus()` mandated (AC-4). +- **R-3** (resolved): `collect_models` uses no cache (every-call re-computation). `active_parsers/chunkers` reflect corpus state at invocation time. If future caching is added, `corpus_revision` increment signals invalidation — document at that time. +- **R-4** (Bug #14): `ask` command validation — covered by same fix (§3.5 mandates both search + ask). +- **R-5** (Bug #11): 60s may still timeout on very dense/high-res pages. **Mitigation**: User can override via `config.toml [pdf.ocr] request_timeout_secs = N`. Release notes explicitly call this out. + +--- + +--- + +## §7 Parent spec deviation (HOTFIXES handoff) + +**F-11 MEDIUM finding**: parent spec `2026-04-27-kebab-final-form-design.md` (frozen) specifies PDF OCR request_timeout_secs = 600s (§1000 + §1628 OQ-1, rationale: "CPU 환경 105s 의 5x 여유"). Bug #11 (dogfood evidence) contradicts — 600s causes timeouts; 60s production-optimal. + +**Deviation handling**: +1. Parent spec stays frozen (no edits). +2. **HOTFIXES entry (executor Step N)**: `tasks/HOTFIXES.md` receives dated entry: + ```markdown + 2026-05-27 — PDF OCR request_timeout_secs default 600s → 60s (v0.20.0 bugfix3 dogfood evidence). Bug #11. + ``` +3. **Parent spec cross-link (executor Step N)**: parent spec `2026-04-27-kebab-final-form-design.md` receives inline comment at §1000 (default value code block) or §1628 (OQ-1 paragraph): + ```markdown + + ``` + +**Parent spec invariant**: No changes to parent spec text; only cross-link comment + HOTFIXES.md entry. Frozen design contract preserved. + +--- + +## §8 References + +- [Dogfood report](../../../.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md) — 5 bugs discovered + decisions. +- [Parent spec (frozen contract)](2026-04-27-kebab-final-form-design.md) — §1, §2, §4 (capabilities, error handling, JSON schema, config XDG). +- `crates/kebab-app/src/schema.rs:137–151` (capabilities_snapshot). +- `crates/kebab-config/src/lib.rs` (Config::load, default_pdf_ocr_request_timeout_secs). +- `crates/kebab-app/src/error_wire.rs` (classify ConfigNotFound). +- `crates/kebab-store-sqlite/src/lib.rs` (fetch_distinct_parser_versions, fetch_distinct_chunker_versions). +- `crates/kebab-cli/src/main.rs` (search + ask query validation). +- `docs/wire-schema/v1/schema.schema.json` (Models + Capabilities objects). +- `tasks/HOTFIXES.md` (2026-05-27 entry, Bug #11 deviation record).