feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1) #189
Reference in New Issue
Block a user
Delete Branch "feat/pdf-scanned-ocr"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
요약
v0.20.0 sub-item 1 — embedded text 가 없는 scanned PDF (책 스캔, 영수증, 카메라 page) 의 OCR ingest via Ollama vision LLM (qwen2.5vl:3b).
총 30 commit on
feat/pdf-scanned-ocr(14 sub-item 1 base + 4 bugfix1 + 3 bugfix2 + 8 bugfix3 + 1 doc).v0.20 sub-item 1 base (commit
b4d9e60이전 14 commit)(기존 PR description 그대로)
Bugfix1 (commit
b4d9e60이후 4 commit)dogfood (qwen2.5vl:3b real Ollama, 9 PDF corpus) 에서 발견된 3 bug fix:
d9acda5— fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)436fd01— fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3, Critical)241ded5— test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)e674ff4— fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)chunker_version cascade
pdf-page-v1→pdf-page-v1.1bump — design §9 cascade rule explicit invalidation.Bugfix2 (commit
e674ff4이후 3 commit)post-bugfix1 전방위 dogfood 에서 발견된 2 bug:
a58ee10— fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6, Critical)8cf73d1— docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)f763049— test(cli): assert 'code' in search --help output (Bug #7 regression pin)Bug #6 (Critical) 핵심
lopdf의 Identity-H CID font ToUnicode CMap 미정 page 에서 emit 하는?Identity-H Unimplemented?literal 이 ASCII printable — 기존compute_valid_char_ratio가 valid 로 인식 → mojibake page 가 OCR fallback threshold 0.5 통과 → garbage text indexed. fix: marker stripping + marker-dominance heuristic (Option B) —stripped_chars > cleaned.len()시ratio.min(0.3)→ OCR trigger.Bugfix2 effect (dogfood retest)
metro-korea.pdf (58 MB Identity-H):
pdf_ocr_pages: 0 → 19/21triggered. body marker garbage → actual OCR text ("개척자 정주영", "한국은행장", etc.). 한국어 lexical "정주영" 1 hit, "한국은행장" 1 hit.Bugfix3 (commit
f763049이후 8 commit + 1 doc)post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire checklist) 에서 발견된 5 bug:
760eee8— fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9, Critical wire schema)28f5137— fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)10b0e2f— fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11, Critical UX)d9c7aab— feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13, additive minor)2c7fa71— fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14)5bba95f— docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation854a180— fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13 follow-up)9b44e27— test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)46e9947— docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md (round artifacts)Bug 별 user-visible verification (post-bugfix3 dogfood retest)
streaming_ask: false,single_file_ingest: false(hardcoded false negatives — agent host mis-route)streaming_ask: true,single_file_ingest: true✓error.v1 { code: config_not_found, details.path, hint }, exit=2 ✓[pdf.ocr] request_timeout_secs = 600(10 min/page) → 20 min wasted on 2 page that fail to indexrequest_timeout_secs = 60(1 min) ✓parser_version+chunker_version(markdown only — pdf/code 누락)active_parsers: [code-python-v1, code-rust-v1, md-frontmatter-v2, none-v1, pdf-text-v1](5),active_chunkers: [code-python-ast-v1, …, pdf-page-v1.1](8) — deterministic ORDER BY ✓kebab search "" --jsonsilenthits: [], exit=0error.v1 { code: invalid_input, message: "query is empty…", hint }, exit=2; ask 도 동일 ✓Bug #8 + #12 = falsified (V007 trigram 의 2-char query 한계는 design constraint; Code block wire 의 본문은
codefield,text가 아님).Bugfix3 workflow
docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md(410 line, round 2 closure ACCEPT 11/11 finding)docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md(1043 line, closure ACCEPT 7/7 step + 10/10 acceptance).omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.md(8 commit, 13/13 verifier row green).omc/reviews/2026-05-27-v0.20-final-dogfood-report.mdWorkspace test + clippy
cargo test --workspace --no-fail-fast -j 1→ 전수 pass (1350 baseline + new 8 = 1358+)cargo clippy --workspace --all-targets -- -D warnings→ exit 0Wire schema additive minor (Bug #13)
docs/wire-schema/v1/schema.schema.json의modelsobject 에 두 array field 추가:active_parsers: string[](optional, NOT inrequired)active_chunkers: string[](optional)기존
parser_version+chunker_version보존 (backward compat).integrations/claude-code/kebab/SKILL.md동기 갱신.HOTFIXES handoff (Bug #11 parent spec deviation)
tasks/HOTFIXES.md2026-05-27 dated entry +docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md의 §1000 / §1628 OQ-1 옆 cross-link comment 추가. parent spec frozen prose 변경 0 (HTML comment 2 lines only).🤖 Generated with Claude Code
Bugfix4 — OCR timeout 180s + ingest log feature (commit
6a9551e~6850077)post-bugfix3 final dogfood (2026-05-28) 에서 사용자 발견:
Commits
6a9551e— fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up, sweet-spot 점진적 축소 정책)f60304b— feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)f8a4c79— feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)bef0c98— feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields (image_byte_size, image_width, image_height, failure_reason)f9dc0f7— feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc sync)415227b— test(app): ingest_log_smoke integration test (AC-9)445b096— fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke6850077— style: cargo fmt --all (round 4 follow-up) + spec/plan/dogfood doc artifactsSweet-spot 분석 (R5 dogfood log evidence)
{state_dir}/logs/ingest-{run_id}.ndjson의 summary record:180s default 가 p90 의 5.5× = 충분한 buffer. 향후 dogfood 마다 p90 측정 후 default 점진적 축소 (e.g., 120s = p90×3.6, 90s = p90×2.7) 가능.
Wire schema additive minor (Bug #13 sibling)
docs/wire-schema/v1/ingest_progress.schema.json의pdf_ocr_finishedevent 의 4 추가 field optional:image_byte_size: u64— raster image byte size.image_width: u32— raster width (future).image_height: u32— raster height (future).failure_reason: string— "timeout" / "ocr_error" / null.backward-compat — old consumer 가 무시 가능.
Log feature surface
[logging] ingest_log_enabled = truedefault — log file 자동 생성.[logging] ingest_log_dir = "{state_dir}/logs"default — XDG state dir.ingest-{ISO8601}Z-{hex8}.ndjson.ocr,parse_error,skip,error,summary.사용자가
cat log | jq로 분석 가능. SQLite mirror 또는 query CLI 는 future enhancement (out of scope).Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan. A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()` → local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in `ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0). A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존 dep). description 갱신은 Step 3 (module 추가 후) 으로 이연. A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth (.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item 의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline. Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures, result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md). Key findings: - reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ]; useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0. - Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires manual PDF construction via zlib.compress. - ghostscript pdfwrite rejects TIFF input (/undefined in II*) — ImageMagick `convert -compress Group4` used for F7 CCITTFax. B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/: - F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어). - F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침). - F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0); Noto CJK TTC has PostScript outlines unsupported by reportlab. - F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input). - F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input). Synth scripts under tests/fixtures/_synth/: - scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0). - mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip. - flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` -> Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1 line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른 encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode passthrough only (H-3 invariant, image crate 도입 0). Integration test (`tests/page_image.rs`, 2 test): - f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path. - flate_raw_fixture_yields_none — F6 fixture negative path. C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char = ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin Extended + common Korean punctuation. 빈 string → 0.0. caller (`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5). Unit test (`mod tests`, 7 + F4 conditional): - empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo. - f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가 ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3). Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page image extract helpers ...", Step 1 A2 이연분). fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차 수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2) prior:aeeff36(Step 2 fixtures) +fb3952d(Step 2 F7 record fix) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)` in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 + PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1). post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가 kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존). helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피. Per-page decision matrix (spec §4.1 line 459-464): - always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count). - always_on=false + needs_ocr → in-place OCR (text-detect block mutate). - needs_ocr=false → skip. DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는 extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true). OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true). D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159): PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시 `anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`. lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf; no new crate introduced — dep graph kebab-parse-* baseline unchanged). Integration test (`tests/pdf_ocr_apply.rs`, 10 test): - f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate. - f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip. - f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op. - f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate. - f3_input_with_always_on_pushes_dual_blocks — always_on dual-block. - f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event. - f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split). - ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event. - dual_block_ordinals_are_deterministic_and_unique — ordinal invariant. - cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3). MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 → mock CanonicalDocument construction + F1 bytes reuse pattern (Option B: PdfTextExtractor::extract 를 통한 실제 production path canonical 생성). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3) prior:c2cd3a7(Step 3) +8d81bc1(Step 3 clippy fix) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉. F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn: - PdfCfg { ocr: PdfOcrCfg }. - PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/ languages/max_pixels/request_timeout_secs/valid_ratio_threshold/ min_char_count/lang_hint). - defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char. - mirror of image OCR cfg pattern (spec §4.5). Config struct extension: - `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`. 11 env var override (parallel to KEBAB_IMAGE_OCR_*): KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES, MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT, LANG_HINT}. F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규): - toml roundtrip (11 field). - defaults (opt-in + qwen2.5vl:3b). - env override (4 key sample + default preservation). F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment: - PdfOcrOpts struct + 6 field. - PdfOcrSummary struct + 2 field. - apply_ocr_to_pdf_pages fn (Errors block 포함). - PdfOcrProgress enum + 2 variant + 5 field. body 변경 0, doc-only. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2) prior:9f003ef(Step 4) — code reviewer Important I-1 resolution contract: §9 (additive minor wire bump — Step 7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2 (ordering invariant doc) fix. G3 — JSON Schema sync (additive minor — schema_version 보존): ingest_progress.schema.json: - kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished. - 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool). - 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가). ingest_report.schema.json: - items.items.properties 신규 정의 (이전 stub ["array", "null"] 만). - pdf_ocr_pages + pdf_ocr_ms_total (nullable integer). - 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...). Step 6 reviewer M-4 (Important) — skipped field carry: - IngestEvent::PdfOcrFinished 에 skipped: bool 추가. - ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate. Step 6 reviewer M-2 (Minor) — ordering invariant doc: - crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신: ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted < PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted). .md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md deliverable retro N/A (해당 file 0). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3) prior:b9ee09f(Step 6 wiring) + Step 6 reviewer M-4/M-2 권고 contract: §9 (additive minor wire bump — schema_version 보존) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 7 reviewer concern fix (spec literal deviation). H1 — kebab-cli/src/progress.rs printer activation: - 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder) 를 사람-친화 stderr line printer 로 활성화. - spec §4.6.1 line 1085-1086 wording 그대로: - PdfOcrStarted → ` 📷 OCR page {page}...` - PdfOcrFinished (skipped=false) → ` ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})` - PdfOcrFinished (skipped=true) → ` ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용) - `!quiet` gate 정합 (AssetStarted/Finished pattern mirror). H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test: - pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`). - F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished event 가 emit 됨을 verify. Started count == Finished count invariant. - Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test ingest_progress --ignored`. - mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의 ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일. Step 7 reviewer concern fix — spec §4.6.1 literal: - line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제 field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신. - line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`. - line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`. - `skipped` field 도 명시 (Step 6 reviewer M-4 결과). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2) prior:4c5ccd5(Step 7 wire schema) — Step 7 reviewer concern 1 의 fix contract: §9 (additive minor wire bump — Step 7 commit 에서 완료) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. J0 — release notes path decision: commit body (RELEASE_NOTES.md / docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body release notes 형식 따름). Step 11 K1 commit body 안 inline. J1 — README.md: - Configuration section 의 toml table list 에 `[pdf.ocr]` 추가. - 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용, `kebab ingest --force` 필요"). J2 — HANDOFF.md: - phase status P7 row 확장: 3/3 component + post-extract OCR enrichment (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM). - "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX). J3 — docs/ARCHITECTURE.md: - OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)` (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply, DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%). - PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) + parser_version "pdf-text-v1" 보존 + provenance event 차별화. J3 — docs/SMOKE.md: - `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b). - 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`: v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke. J4 — release notes draft (Step 11 K1 commit body 의 source): - result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only + family asymmetry + dogfood/test 결과). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4) prior:1d4e301(Step 9 + Cargo.lock follow-up) contract: §9 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Bugfix2 round ready for review
feat/pdf-scanned-ocr가e674ff4→f763049force-update (3 commit added):a58ee10— Bug #6 (Critical) Identity-H mojibake marker strip + dominance heuristic8cf73d1— Bug #7 (Minor doc) CLI --media help textf763049— Bug #7 regression pin testVerification
pdf_ocr_pages: 0 → 19/21 (OCR triggered)Round trail
.omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.mddocs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md(308 line)docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md(388 line).omc/reviews/2026-05-27-v0.20-bugfix2-executor-result.mdReady for merge decision.
🤖 Generated with Claude Code
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 + `{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만 surface — debugging nightmare. 이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의 `Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path }) return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found, hint, details.path 채워서 stderr 에 ndjson 으로 emit. R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이 absolute + relative 모두 cover. integration test 두 개로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Bugfix3 round ready for review
feat/pdf-scanned-ocr가f763049→46e9947force-update (9 commit added: 8 code + 1 doc).post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire) 에서 발견된 5 bug fix.
Bug 별 user-visible verification (post-fix dogfood retest)
Verification
-D warningsEXIT=0Workflow trail
.omc/reviews/2026-05-27-v0.20-final-dogfood-report.mddocs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.mddocs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md.omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.mdWire schema additive minor
docs/wire-schema/v1/schema.schema.json의models.active_parsers+active_chunkers가 optional (NOT required) — backward-compat.integrations/claude-code/kebab/SKILL.md동기 갱신.HOTFIXES (Bug #11 deviation)
tasks/HOTFIXES.md2026-05-27 entry + parent spec cross-link comment (parent spec prose 변경 0).총 30 commit on branch. Ready for merge decision.
🤖 Generated with Claude Code
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설: * ingest_log_enabled (bool, default true) * ingest_log_dir (PathBuf, default "{state_dir}/logs") #[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재 시 LoggingCfg::default() 자동 init (AC-10 backward compat). {state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter) 의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원, spec §6 R-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>v0.20.x ingest log surface 의 module side. crates/kebab-app/src/ ingest_log.rs 신규: * IngestLogWriter — open/write_event/write_summary/flush + Drop flush * LogEvent enum 4 variant (ocr / parse_error / skip / error) * IngestSummary struct (kind="summary" literal + 11 stat field) * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex) * expand_log_dir ({state_dir} placeholder 의 hand-roll expand) * now_ts (Rfc3339 UTC helper) * percentiles helper (sorted Vec p50/p90/max) uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5). 본 step 은 self-contained writer + 5 unit test. ingest pipeline 의 emit hook 5개 wiring 은 step 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>v0.20.x ingest log feature 의 wire side. additive minor cascade: * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field: - image_byte_size: Option<u64> - image_width: Option<u32> - image_height: Option<u32> - failure_reason: Option<String> * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property (모두 optional, required 변경 없음 = additive minor) * integrations/claude-code/kebab/SKILL.md — wire schema description 동기 기존 ingest_progress.v1 consumer (CLI wire dump, integration test fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의 Option::None 으로 backward-compat. version bump 0 (additive minor = binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>crates/kebab-app/tests/ingest_log_smoke.rs 신규: * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF → ingest → assert log file exists + 각 line valid JSON + 각 kind ∈ {ocr,parse_error,skip,error,summary} + last line kind=summary + scanned>0. * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일 때 log_dir 안 ingest-*.ndjson 파일 0개 verify. fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf 재사용 (OCR disabled — Ollama 없이 smoke 실행). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string (clippy::unnecessary_hashes). * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok, |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure). * cargo fmt --all applied to pre-existing formatting drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Bugfix4 + ingest log feature ready for review
feat/pdf-scanned-ocr가6a9551e→68500778 commit add (HEAD).Round 4 commits
6a9551e— fix(config): OCR timeout 60s → 180s (Bug #11 follow-up, sweet spot 점진적 축소)f60304b— feat(config): [logging] section (ingest_log_enabled + ingest_log_dir)f8a4c79— feat(app): IngestLogWriter + LogEvent enumbef0c98— feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fieldsf9dc0f7— feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc sync)415227b— test(app): ingest_log_smoke integration test445b096— fix(test): clippy + fmt fixes6850077— style: cargo fmt --all + spec/plan/dogfood artifactsVerification (R5 dogfood)
-D warningsexit 0.Log file 자동 생성 verify
R5 dogfood run 의
~/.local/state/kebab/logs/ingest-20260528T042332Z-b9352652.ndjson:Sweet-spot evidence (from log summary)
180s default = p90 × 5.5 = 충분한 buffer. 향후 점진적 축소 가능.
Wire schema additive minor
ingest_progress.v1.pdf_ocr_finishedevent 의 4 추가 field optional (image_byte_size + image_width + image_height + failure_reason). backward-compat.Configurable log path
per-ingest-run filename =
ingest-{ISO8601}Z-{hex8}.ndjson. user 가jq으로 분석.Ready for merge decision.
🤖 Generated with Claude Code