feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1) #189

Merged
altair823 merged 39 commits from feat/pdf-scanned-ocr into main 2026-05-28 04:37:54 +00:00
Owner

요약

v0.20.0 sub-item 1 — embedded text 가 없는 scanned PDF (책 스캔, 영수증, 카메라 page) 의 OCR ingest via Ollama vision LLM (qwen2.5vl:3b).

30 commit on feat/pdf-scanned-ocr (14 sub-item 1 base + 4 bugfix1 + 3 bugfix2 + 8 bugfix3 + 1 doc).

v0.20 sub-item 1 base (commit b4d9e60 이전 14 commit)

(기존 PR description 그대로)

Bugfix1 (commit b4d9e60 이후 4 commit)

dogfood (qwen2.5vl:3b real Ollama, 9 PDF corpus) 에서 발견된 3 bug fix:

  • d9acda5 — fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
  • 436fd01 — fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3, Critical)
  • 241ded5 — test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
  • e674ff4 — fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)

chunker_version cascade

pdf-page-v1pdf-page-v1.1 bump — design §9 cascade rule explicit invalidation.

Bugfix2 (commit e674ff4 이후 3 commit)

post-bugfix1 전방위 dogfood 에서 발견된 2 bug:

  • a58ee10 — fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6, Critical)
  • 8cf73d1 — docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
  • f763049 — test(cli): assert 'code' in search --help output (Bug #7 regression pin)

Bug #6 (Critical) 핵심

lopdf 의 Identity-H CID font ToUnicode CMap 미정 page 에서 emit 하는 ?Identity-H Unimplemented? literal 이 ASCII printable — 기존 compute_valid_char_ratio 가 valid 로 인식 → mojibake page 가 OCR fallback threshold 0.5 통과 → garbage text indexed. fix: marker stripping + marker-dominance heuristic (Option B) — stripped_chars > cleaned.len()ratio.min(0.3) → OCR trigger.

Bugfix2 effect (dogfood retest)

metro-korea.pdf (58 MB Identity-H): pdf_ocr_pages: 0 → 19/21 triggered. body marker garbage → actual OCR text ("개척자 정주영", "한국은행장", etc.). 한국어 lexical "정주영" 1 hit, "한국은행장" 1 hit.

Bugfix3 (commit f763049 이후 8 commit + 1 doc)

post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire checklist) 에서 발견된 5 bug:

  • 760eee8 — fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9, Critical wire schema)
  • 28f5137 — fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)
  • 10b0e2f — fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11, Critical UX)
  • d9c7aab — feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13, additive minor)
  • 2c7fa71 — fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14)
  • 5bba95f — docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation
  • 854a180 — fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13 follow-up)
  • 9b44e27 — test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)
  • 46e9947 — docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md (round artifacts)

Bug 별 user-visible verification (post-bugfix3 dogfood retest)

Bug Pre-fix Post-fix
#9 capabilities streaming_ask: false, single_file_ingest: false (hardcoded false negatives — agent host mis-route) streaming_ask: true, single_file_ingest: true
#10 invalid config silent fallback to default, exit=0, 0 hit (debugging nightmare) error.v1 { code: config_not_found, details.path, hint }, exit=2 ✓
#11 OCR timeout [pdf.ocr] request_timeout_secs = 600 (10 min/page) → 20 min wasted on 2 page that fail to index request_timeout_secs = 60 (1 min) ✓
#13 schema.models single parser_version + chunker_version (markdown only — pdf/code 누락) active_parsers: [code-python-v1, code-rust-v1, md-frontmatter-v2, none-v1, pdf-text-v1] (5), active_chunkers: [code-python-ast-v1, …, pdf-page-v1.1] (8) — deterministic ORDER BY ✓
#14 empty query kebab search "" --json silent hits: [], exit=0 error.v1 { code: invalid_input, message: "query is empty…", hint }, exit=2; ask 도 동일 ✓

Bug #8 + #12 = falsified (V007 trigram 의 2-char query 한계는 design constraint; Code block wire 의 본문은 code field, text 가 아님).

Bugfix3 workflow

  • spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md (410 line, round 2 closure ACCEPT 11/11 finding)
  • plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md (1043 line, closure ACCEPT 7/7 step + 10/10 acceptance)
  • executor result: .omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.md (8 commit, 13/13 verifier row green)
  • dogfood report: .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md

Workspace test + clippy

  • cargo test --workspace --no-fail-fast -j 1 → 전수 pass (1350 baseline + new 8 = 1358+)
  • cargo clippy --workspace --all-targets -- -D warnings → exit 0

Wire schema additive minor (Bug #13)

docs/wire-schema/v1/schema.schema.jsonmodels object 에 두 array field 추가:

  • active_parsers: string[] (optional, NOT in required)
  • active_chunkers: string[] (optional)

기존 parser_version + chunker_version 보존 (backward compat). integrations/claude-code/kebab/SKILL.md 동기 갱신.

HOTFIXES handoff (Bug #11 parent spec deviation)

tasks/HOTFIXES.md 2026-05-27 dated entry + docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md 의 §1000 / §1628 OQ-1 옆 cross-link comment 추가. parent spec frozen prose 변경 0 (HTML comment 2 lines only).

🤖 Generated with Claude Code

Bugfix4 — OCR timeout 180s + ingest log feature (commit 6a9551e ~ 6850077)

post-bugfix3 final dogfood (2026-05-28) 에서 사용자 발견:

  • OCR 60s default 가 dense Korean newspaper page (page 8/9/13) 의 OCR 강제 timeout → 1 page 더 indexed 손실 (round 2 의 2 → round 3 의 3).
  • 사용자 명시: "60초 → 180초 + sweet spot 점진적 축소 정책" + "OCR/ingest 실패의 자세한 로그 + 통계 surface".

Commits

  • 6a9551e — fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up, sweet-spot 점진적 축소 정책)
  • f60304b — feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)
  • f8a4c79 — feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)
  • bef0c98 — feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields (image_byte_size, image_width, image_height, failure_reason)
  • f9dc0f7 — feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc sync)
  • 415227b — test(app): ingest_log_smoke integration test (AC-9)
  • 445b096 — fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke
  • 6850077 — style: cargo fmt --all (round 4 follow-up) + spec/plan/dogfood doc artifacts

Sweet-spot 분석 (R5 dogfood log evidence)

{state_dir}/logs/ingest-{run_id}.ndjson 의 summary record:

{"kind":"summary","run_id":"20260528T042332Z-b9352652","scanned":11,"new":11,"errors":0,
 "ocr_pages":23,"ocr_failures":3,"ocr_p50_ms":1528,"ocr_p90_ms":33362,"ocr_max_ms":61914,
 "duration_ms":653658}
metric value 해석
p50 1.5s 대부분 page (광고/sparse text)
p90 33s dense article (10%)
max (정상) 62s 한자 + 한국어 dense (page 9 / 15)
timeout (180s) 180001ms / 180007ms page 8 / 13 — 신문 dense article (4 column + multi image)

180s default 가 p90 의 5.5× = 충분한 buffer. 향후 dogfood 마다 p90 측정 후 default 점진적 축소 (e.g., 120s = p90×3.6, 90s = p90×2.7) 가능.

Wire schema additive minor (Bug #13 sibling)

docs/wire-schema/v1/ingest_progress.schema.jsonpdf_ocr_finished event 의 4 추가 field optional:

  • image_byte_size: u64 — raster image byte size.
  • image_width: u32 — raster width (future).
  • image_height: u32 — raster height (future).
  • failure_reason: string — "timeout" / "ocr_error" / null.

backward-compat — old consumer 가 무시 가능.

Log feature surface

  • [logging] ingest_log_enabled = true default — log file 자동 생성.
  • [logging] ingest_log_dir = "{state_dir}/logs" default — XDG state dir.
  • per-ingest-run filename = ingest-{ISO8601}Z-{hex8}.ndjson.
  • record kinds: ocr, parse_error, skip, error, summary.

사용자가 cat log | jq 로 분석 가능. SQLite mirror 또는 query CLI 는 future enhancement (out of scope).

## 요약 v0.20.0 sub-item 1 — embedded text 가 없는 scanned PDF (책 스캔, 영수증, 카메라 page) 의 OCR ingest via Ollama vision LLM (qwen2.5vl:3b). 총 **30 commit** on `feat/pdf-scanned-ocr` (14 sub-item 1 base + 4 bugfix1 + 3 bugfix2 + 8 bugfix3 + 1 doc). ## v0.20 sub-item 1 base (commit b4d9e60 이전 14 commit) (기존 PR description 그대로) ## Bugfix1 (commit b4d9e60 이후 4 commit) dogfood (qwen2.5vl:3b real Ollama, 9 PDF corpus) 에서 발견된 3 bug fix: - **`d9acda5`** — fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2) - **`436fd01`** — fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3, **Critical**) - **`241ded5`** — test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression) - **`e674ff4`** — fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4) ### chunker_version cascade `pdf-page-v1` → `pdf-page-v1.1` bump — design §9 cascade rule explicit invalidation. ## Bugfix2 (commit e674ff4 이후 3 commit) post-bugfix1 전방위 dogfood 에서 발견된 2 bug: - **`a58ee10`** — fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6, **Critical**) - **`8cf73d1`** — docs(cli): list 'code' in --media help string + SKILL.md (Bug #7) - **`f763049`** — test(cli): assert 'code' in search --help output (Bug #7 regression pin) ### Bug #6 (Critical) 핵심 `lopdf` 의 Identity-H CID font ToUnicode CMap 미정 page 에서 emit 하는 `?Identity-H Unimplemented?` literal 이 ASCII printable — 기존 `compute_valid_char_ratio` 가 valid 로 인식 → mojibake page 가 OCR fallback threshold 0.5 통과 → garbage text indexed. fix: marker stripping + marker-dominance heuristic (Option B) — `stripped_chars > cleaned.len()` 시 `ratio.min(0.3)` → OCR trigger. ### Bugfix2 effect (dogfood retest) metro-korea.pdf (58 MB Identity-H): `pdf_ocr_pages: 0 → 19/21` triggered. body marker garbage → actual OCR text ("개척자 정주영", "한국은행장", etc.). 한국어 lexical "정주영" 1 hit, "한국은행장" 1 hit. ## Bugfix3 (commit f763049 이후 8 commit + 1 doc) post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire checklist) 에서 발견된 5 bug: - **`760eee8`** — fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9, **Critical wire schema**) - **`28f5137`** — fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10) - **`10b0e2f`** — fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11, **Critical UX**) - **`d9c7aab`** — feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13, additive minor) - **`2c7fa71`** — fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14) - **`5bba95f`** — docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation - **`854a180`** — fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13 follow-up) - **`9b44e27`** — test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up) - **`46e9947`** — docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md (round artifacts) ### Bug 별 user-visible verification (post-bugfix3 dogfood retest) | Bug | Pre-fix | Post-fix | |-----|---------|----------| | #9 capabilities | `streaming_ask: false`, `single_file_ingest: false` (hardcoded false negatives — agent host mis-route) | `streaming_ask: true`, `single_file_ingest: true` ✓ | | #10 invalid config | silent fallback to default, exit=0, 0 hit (debugging nightmare) | `error.v1 { code: config_not_found, details.path, hint }`, exit=2 ✓ | | #11 OCR timeout | `[pdf.ocr] request_timeout_secs = 600` (10 min/page) → 20 min wasted on 2 page that fail to index | `request_timeout_secs = 60` (1 min) ✓ | | #13 schema.models | single `parser_version` + `chunker_version` (markdown only — pdf/code 누락) | `active_parsers: [code-python-v1, code-rust-v1, md-frontmatter-v2, none-v1, pdf-text-v1]` (5), `active_chunkers: [code-python-ast-v1, …, pdf-page-v1.1]` (8) — deterministic ORDER BY ✓ | | #14 empty query | `kebab search "" --json` silent `hits: []`, exit=0 | `error.v1 { code: invalid_input, message: "query is empty…", hint }`, exit=2; ask 도 동일 ✓ | Bug #8 + #12 = **falsified** (V007 trigram 의 2-char query 한계는 design constraint; Code block wire 의 본문은 `code` field, `text` 가 아님). ### Bugfix3 workflow - spec: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md` (410 line, round 2 closure ACCEPT 11/11 finding) - plan: `docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md` (1043 line, closure ACCEPT 7/7 step + 10/10 acceptance) - executor result: `.omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.md` (8 commit, 13/13 verifier row green) - dogfood report: `.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md` ### Workspace test + clippy - `cargo test --workspace --no-fail-fast -j 1` → 전수 pass (1350 baseline + new 8 = 1358+) - `cargo clippy --workspace --all-targets -- -D warnings` → exit 0 ### Wire schema additive minor (Bug #13) `docs/wire-schema/v1/schema.schema.json` 의 `models` object 에 두 array field 추가: - `active_parsers: string[]` (optional, NOT in `required`) - `active_chunkers: string[]` (optional) 기존 `parser_version` + `chunker_version` 보존 (backward compat). `integrations/claude-code/kebab/SKILL.md` 동기 갱신. ### HOTFIXES handoff (Bug #11 parent spec deviation) `tasks/HOTFIXES.md` 2026-05-27 dated entry + `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md` 의 §1000 / §1628 OQ-1 옆 cross-link comment 추가. parent spec frozen prose 변경 0 (HTML comment 2 lines only). 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Bugfix4 — OCR timeout 180s + ingest log feature (commit 6a9551e ~ 6850077) post-bugfix3 final dogfood (2026-05-28) 에서 사용자 발견: - OCR 60s default 가 dense Korean newspaper page (page 8/9/13) 의 OCR 강제 timeout → 1 page 더 indexed 손실 (round 2 의 2 → round 3 의 3). - 사용자 명시: "60초 → 180초 + sweet spot 점진적 축소 정책" + "OCR/ingest 실패의 자세한 로그 + 통계 surface". ### Commits - **`6a9551e`** — fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up, sweet-spot 점진적 축소 정책) - **`f60304b`** — feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir) - **`f8a4c79`** — feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log) - **`bef0c98`** — feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields (image_byte_size, image_width, image_height, failure_reason) - **`f9dc0f7`** — feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync) - **`415227b`** — test(app): ingest_log_smoke integration test (AC-9) - **`445b096`** — fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke - **`6850077`** — style: cargo fmt --all (round 4 follow-up) + spec/plan/dogfood doc artifacts ### Sweet-spot 분석 (R5 dogfood log evidence) `{state_dir}/logs/ingest-{run_id}.ndjson` 의 summary record: ```json {"kind":"summary","run_id":"20260528T042332Z-b9352652","scanned":11,"new":11,"errors":0, "ocr_pages":23,"ocr_failures":3,"ocr_p50_ms":1528,"ocr_p90_ms":33362,"ocr_max_ms":61914, "duration_ms":653658} ``` | metric | value | 해석 | |--------|------:|------| | p50 | **1.5s** | 대부분 page (광고/sparse text) | | p90 | **33s** | dense article (10%) | | max (정상) | **62s** | 한자 + 한국어 dense (page 9 / 15) | | timeout (180s) | **180001ms / 180007ms** | page 8 / 13 — 신문 dense article (4 column + multi image) | **180s default 가 p90 의 5.5× = 충분한 buffer**. 향후 dogfood 마다 p90 측정 후 default 점진적 축소 (e.g., 120s = p90×3.6, 90s = p90×2.7) 가능. ### Wire schema additive minor (Bug #13 sibling) `docs/wire-schema/v1/ingest_progress.schema.json` 의 `pdf_ocr_finished` event 의 4 추가 field optional: - `image_byte_size: u64` — raster image byte size. - `image_width: u32` — raster width (future). - `image_height: u32` — raster height (future). - `failure_reason: string` — "timeout" / "ocr_error" / null. backward-compat — old consumer 가 무시 가능. ### Log feature surface - `[logging] ingest_log_enabled = true` default — log file 자동 생성. - `[logging] ingest_log_dir = "{state_dir}/logs"` default — XDG state dir. - per-ingest-run filename = `ingest-{ISO8601}Z-{hex8}.ndjson`. - record kinds: `ocr`, `parse_error`, `skip`, `error`, `summary`. 사용자가 `cat log | jq` 로 분석 가능. SQLite mirror 또는 query CLI 는 future enhancement (out of scope).
altair823 added 15 commits 2026-05-27 11:04:38 +00:00
Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan.

A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()`
→ local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in
`ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0).

A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only
v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존
dep). description 갱신은 Step 3 (module 추가 후) 으로 이연.

A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth
(.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item
의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline.
Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aeeff36 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`)
의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나,
실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은
`gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true`
flag 사용 (ImageMagick `convert` 호출 0회).

fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK
(Step 2 spec compliance + code quality review verified). historical record 의
factual correction only.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.

- L20  uninlined_format_args: format!("page {} not in get_pages()", page_num)
                              → format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or:        .map(|n| n == b"Image").unwrap_or(false)
                              → .is_some_and(|n| n == b"Image")

cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).

review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉.

F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn:
- PdfCfg { ocr: PdfOcrCfg }.
- PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/
  languages/max_pixels/request_timeout_secs/valid_ratio_threshold/
  min_char_count/lang_hint).
- defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char.
- mirror of image OCR cfg pattern (spec §4.5).

Config struct extension:
- `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`.

11 env var override (parallel to KEBAB_IMAGE_OCR_*):
KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES,
MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT,
LANG_HINT}.

F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규):
- toml roundtrip (11 field).
- defaults (opt-in + qwen2.5vl:3b).
- env override (4 key sample + default preservation).

F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment:
- PdfOcrOpts struct + 6 field.
- PdfOcrSummary struct + 2 field.
- apply_ocr_to_pdf_pages fn (Errors block 포함).
- PdfOcrProgress enum + 2 variant + 5 field.
body 변경 0, doc-only.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2)
prior: 9f003ef (Step 4) — code reviewer Important I-1 resolution
contract: §9 (additive minor wire bump — Step 7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd918a6 의 F2 test file (crates/kebab-config/tests/pdf_ocr.rs) 의 4 line
`assert_eq!(bool_field, true|false)` 가 workspace clippy pedantic
의 `bool_assert_comparison` 위반 → CI gate
`cargo clippy --workspace --all-targets -- -D warnings` exit 1.

각 assertion 의 canonical form 적용:
- assert_eq!(x, false) → assert!(!x)
- assert_eq!(x, true)  → assert!(x)

semantic + behavior 동일, 4 line edit, logic 변경 0.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-05-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-05-spec-review-result.md (I-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.

G3 — JSON Schema sync (additive minor — schema_version 보존):

ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).

ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).

Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
  PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.

Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
  ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
  PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).

.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).

H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
  를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
  - PdfOcrStarted → `  📷 OCR page {page}...`
  - PdfOcrFinished (skipped=false) → `  ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
  - PdfOcrFinished (skipped=true)  → `  ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).

H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
  event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
  ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
  ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.

Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
  field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 9 (commit 4819768) 에서 추가된 dev-dep (strsim 0.11 + kebab-parse-image
path) 의 Cargo.lock cascade. worker 가 명시적 commit 에 포함 안 함 — follow-up
commit 으로 lock 동기화.

dep graph baseline (-e normal) 영향 0 (dev-dep 만 추가).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# v0.20.0 — scanned PDF OCR via Ollama vision LLM

v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.

## 1. OCR opt-in 사용법

`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.

```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```

qwen2.5vl:3b 의 Ollama pull:

```bash
ollama pull qwen2.5vl:3b   # 3GB Ollama image
```

## 2. v0.19 indexed scanned PDF 의 force-reingest

v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:

```bash
kebab ingest --root /path/to/kb --force
```

## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)

v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.

scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:

```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```

v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.

## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)

image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.

## Dogfood + test 결과

- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
  - F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
  - F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.

## 변경된 surface

- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.

## 변경되지 않은 surface (invariant)

- Extractor::extract trait body byte-identical (PR #187).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).

## sub-item 의 11 commit history

9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)

## § Acceptance §9 verifier evidence

K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 added 4 commits 2026-05-27 14:27:16 +00:00
v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes`
가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+)
이 walker pre-extract skip. fix:

- `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path)
  -> bool` helper 추가 (= `code_lang_for_path(path).is_some()`).
- `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가
  `is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/
  markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem,
  image OCR max_pixels) 가 cover.
- `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가:
  `size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과
  검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0.

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1)
prior: b4d9e60 (PR #189)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.

- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
  `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
  del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
  ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
  (removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
  pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
  regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
  bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
  invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
  (3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
  threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
  fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
  CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
altair823 added 3 commits 2026-05-27 21:28:22 +00:00
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.

Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.

Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.

Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.

Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.

Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Owner

Bugfix2 round ready for review

feat/pdf-scanned-ocre674ff4f763049 force-update (3 commit added):

  • a58ee10 — Bug #6 (Critical) Identity-H mojibake marker strip + dominance heuristic
  • 8cf73d1 — Bug #7 (Minor doc) CLI --media help text
  • f763049 — Bug #7 regression pin test

Verification

  • workspace: 1350 passed; 0 failed (clippy -D warnings clean)
  • dogfood retest (metro-korea.pdf, 58 MB Identity-H CID font):
    • pdf_ocr_pages: 0 → 19/21 (OCR triggered)
    • body: marker garbage → actual OCR text ("개척자 정주영", "한국은행장", etc.)
    • 한국어 lexical search: 0 hit → "정주영" 1 hit, "한국은행장" 1 hit ✓

Round trail

Phase Artifact Result
Dogfood .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md Bug #6 (Critical) + #7 (Minor doc) catalog; #8 falsified
Spec docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md (308 line) round 2 ACCEPT (9/9 critic finding)
Plan docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md (388 line) closure ACCEPT (4/4 step + 8/8 verifier)
Executor .omc/reviews/2026-05-27-v0.20-bugfix2-executor-result.md DONE (3 commit, 7/8 verifier row green)

Ready for merge decision.

🤖 Generated with Claude Code

## Bugfix2 round ready for review `feat/pdf-scanned-ocr` 가 e674ff4 → **f763049** force-update (3 commit added): - `a58ee10` — Bug #6 (Critical) Identity-H mojibake marker strip + dominance heuristic - `8cf73d1` — Bug #7 (Minor doc) CLI --media help text - `f763049` — Bug #7 regression pin test ### Verification - workspace: **1350 passed; 0 failed** (clippy -D warnings clean) - dogfood retest (metro-korea.pdf, 58 MB Identity-H CID font): - `pdf_ocr_pages`: **0 → 19/21** (OCR triggered) - body: marker garbage → actual OCR text ("개척자 정주영", "한국은행장", etc.) - 한국어 lexical search: 0 hit → "정주영" 1 hit, "한국은행장" 1 hit ✓ ### Round trail | Phase | Artifact | Result | |-------|----------|--------| | Dogfood | `.omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md` | Bug #6 (Critical) + #7 (Minor doc) catalog; #8 falsified | | Spec | `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` (308 line) | round 2 ACCEPT (9/9 critic finding) | | Plan | `docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md` (388 line) | closure ACCEPT (4/4 step + 8/8 verifier) | | Executor | `.omc/reviews/2026-05-27-v0.20-bugfix2-executor-result.md` | DONE (3 commit, 7/8 verifier row green) | Ready for merge decision. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 added 8 commits 2026-05-28 01:21:13 +00:00
capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로
보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade:
- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit
- kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상

MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시
false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인.

http_daemon 은 false 유지 (별도 sub-item 의 non-impl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 +
`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만
surface — debugging nightmare.

이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의
`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path })
return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found,
hint, details.path 채워서 stderr 에 ndjson 으로 emit.

R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이
absolute + relative 모두 cover. integration test 두 개로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
metro-korea.pdf v0.20 final-dogfood (2026-05-27):
- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout
  (`ms: 600000, chars: 0, skipped: true`)
- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비

cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다
훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는
config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수
있음 — Step 6 에서 HOTFIXES + parent spec cross-link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 →
multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest)
의 version cascade audit 누락 risk.

이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String>
추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는
optional (#[serde(default)] + JSON schema required 미포함).

source:
- kebab_store_sqlite::fetch_distinct_parser_versions() 가
  documents.parser_version DISTINCT + ORDER BY 반환.
- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern.
- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결).

wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분.
integrations/claude-code/kebab/SKILL.md 동기 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
이전: `kebab search "" --json` / `kebab search "  " --json` / `kebab ask "" --json`
모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user
mistake (typo, shell expansion 실수) 가 silent → debugging 비용.

이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError
(ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존
exit_code() 의 generic non-zero path).

--bulk mode 는 영향 0 (bulk arm 이 query 무시).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`)
의 frozen-spec deviation handoff.

- tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause /
  Fix / Amends 5-field 포맷 (기존 entries 와 일치).
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block
  line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link.
  prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible.

HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 4 의 Models struct 확장 (active_parsers / active_chunkers 추가) 이
crates/kebab-cli/src/wire.rs 의 테스트 fixture 초기화를 누락 → E0063 컴파일 에러.
#[serde(default)] 는 serde 역직렬화 전용 — struct literal 초기화 에는 모든 field 필요.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나
Bug #9 fix (760eee8) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 added 1 commit 2026-05-28 01:21:40 +00:00
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Owner

Bugfix3 round ready for review

feat/pdf-scanned-ocrf76304946e9947 force-update (9 commit added: 8 code + 1 doc).

post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire) 에서 발견된 5 bug fix.

Bug 별 user-visible verification (post-fix dogfood retest)

Bug Pre-fix Post-fix
#9 (Critical wire schema) capabilities.streaming_ask + single_file_ingest = false (agent host mis-route) true ✓
#10 (UX fail-fast) invalid --config silent 0 hit exit=2 + error.v1 config_not_found + details.path ✓
#11 (Critical UX OCR) request_timeout_secs = 600 (10 min/page) 60 (1 min) ✓
#13 (UX partial schema) parser_version + chunker_version single (md only) + active_parsers (5) + active_chunkers (8), deterministic ORDER BY ✓
#14 (Minor UX) empty query silent exit=2 + error.v1 invalid_input + hint (search + ask) ✓

Verification

  • workspace: all tests pass (1350 baseline + 8 new = 1358+), clippy -D warnings EXIT=0
  • spec ACCEPT (410 line, 11/11 critic finding)
  • plan ACCEPT (1043 line, 7/7 step + 10/10 acceptance)
  • executor: 8 commit, 13/13 verifier green

Workflow trail

Round Artifact Result
Round 3 dogfood .omc/reviews/2026-05-27-v0.20-final-dogfood-report.md Bug #9/#10/#11/#13/#14 catalog (Bug #8/#12 falsified)
Spec docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md round 2 ACCEPT
Plan docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md closure ACCEPT
Executor .omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.md DONE (8 commit, 0 concerns)

Wire schema additive minor

docs/wire-schema/v1/schema.schema.jsonmodels.active_parsers + active_chunkers 가 optional (NOT required) — backward-compat. integrations/claude-code/kebab/SKILL.md 동기 갱신.

HOTFIXES (Bug #11 deviation)

tasks/HOTFIXES.md 2026-05-27 entry + parent spec cross-link comment (parent spec prose 변경 0).

총 30 commit on branch. Ready for merge decision.

🤖 Generated with Claude Code

## Bugfix3 round ready for review `feat/pdf-scanned-ocr` 가 f763049 → **46e9947** force-update (9 commit added: 8 code + 1 doc). post-bugfix2 final 전방위 dogfood (DOGFOOD.md §12 entire) 에서 발견된 5 bug fix. ### Bug 별 user-visible verification (post-fix dogfood retest) | Bug | Pre-fix | Post-fix | |-----|---------|----------| | #9 (Critical wire schema) | capabilities.streaming_ask + single_file_ingest = false (agent host mis-route) | true ✓ | | #10 (UX fail-fast) | invalid --config silent 0 hit | exit=2 + error.v1 config_not_found + details.path ✓ | | #11 (Critical UX OCR) | request_timeout_secs = 600 (10 min/page) | 60 (1 min) ✓ | | #13 (UX partial schema) | parser_version + chunker_version single (md only) | + active_parsers (5) + active_chunkers (8), deterministic ORDER BY ✓ | | #14 (Minor UX) | empty query silent | exit=2 + error.v1 invalid_input + hint (search + ask) ✓ | ### Verification - workspace: **all tests pass** (1350 baseline + 8 new = 1358+), clippy `-D warnings` EXIT=0 - spec ACCEPT (410 line, 11/11 critic finding) - plan ACCEPT (1043 line, 7/7 step + 10/10 acceptance) - executor: 8 commit, 13/13 verifier green ### Workflow trail | Round | Artifact | Result | |-------|----------|--------| | Round 3 dogfood | `.omc/reviews/2026-05-27-v0.20-final-dogfood-report.md` | Bug #9/#10/#11/#13/#14 catalog (Bug #8/#12 falsified) | | Spec | `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix3-spec.md` | round 2 ACCEPT | | Plan | `docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md` | closure ACCEPT | | Executor | `.omc/reviews/2026-05-27-v0.20-bugfix3-executor-result.md` | DONE (8 commit, 0 concerns) | ### Wire schema additive minor `docs/wire-schema/v1/schema.schema.json` 의 `models.active_parsers` + `active_chunkers` 가 optional (NOT required) — backward-compat. `integrations/claude-code/kebab/SKILL.md` 동기 갱신. ### HOTFIXES (Bug #11 deviation) `tasks/HOTFIXES.md` 2026-05-27 entry + parent spec cross-link comment (parent spec prose 변경 0). 총 30 commit on branch. Ready for merge decision. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 added 1 commit 2026-05-28 01:40:28 +00:00
Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page
(metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page
더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선
coverage 쪽으로 너무 깎임.

Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터
dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default
로 직접 jump 안 함.

- crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180
- unit test rename (_is_60s → _is_180s) + assertion 180
- crates/kebab-config/tests/pdf_ocr.rs assert_eq 180
- tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가

User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N
로 user 가 직접 tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 added 7 commits 2026-05-28 04:18:45 +00:00
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설:
  * ingest_log_enabled (bool, default true)
  * ingest_log_dir (PathBuf, default "{state_dir}/logs")

#[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재
시 LoggingCfg::default() 자동 init (AC-10 backward compat).

{state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter)
의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base
는 {state_dir} 미지원, spec §6 R-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.20.x ingest log surface 의 module side. crates/kebab-app/src/
ingest_log.rs 신규:
  * IngestLogWriter — open/write_event/write_summary/flush + Drop flush
  * LogEvent enum 4 variant (ocr / parse_error / skip / error)
  * IngestSummary struct (kind="summary" literal + 11 stat field)
  * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex)
  * expand_log_dir ({state_dir} placeholder 의 hand-roll expand)
  * now_ts (Rfc3339 UTC helper)
  * percentiles helper (sorted Vec p50/p90/max)

uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5).

본 step 은 self-contained writer + 5 unit test. ingest pipeline 의
emit hook 5개 wiring 은 step 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.20.x ingest log feature 의 wire side. additive minor cascade:

  * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field:
      - image_byte_size: Option<u64>
      - image_width:     Option<u32>
      - image_height:    Option<u32>
      - failure_reason:  Option<String>
  * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property
    (모두 optional, required 변경 없음 = additive minor)
  * integrations/claude-code/kebab/SKILL.md — wire schema description 동기

기존 ingest_progress.v1 consumer (CLI wire dump, integration test
fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의
Option::None 으로 backward-compat. version bump 0 (additive minor =
binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook:

  Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush)
  Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr)
  Hook 3: ingest_one_*_asset Err arm (LogEvent::Error)
  Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip)
  Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error

Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에
events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app
재호출 안 함 — cycle 회피).

Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이
clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex>
로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded
per-asset loop 라 deadlock/contention 위험 없음.

Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
crates/kebab-app/tests/ingest_log_smoke.rs 신규:

  * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
    ingest → assert log file exists + 각 line valid JSON +
    각 kind ∈ {ocr,parse_error,skip,error,summary} + last
    line kind=summary + scanned>0.

  * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
    때 log_dir 안 ingest-*.ndjson 파일 0개 verify.

fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string
    (clippy::unnecessary_hashes).
  * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok,
    |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure).
  * cargo fmt --all applied to pre-existing formatting drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Owner

Bugfix4 + ingest log feature ready for review

feat/pdf-scanned-ocr6a9551e6850077 8 commit add (HEAD).

Round 4 commits

  1. 6a9551e — fix(config): OCR timeout 60s → 180s (Bug #11 follow-up, sweet spot 점진적 축소)
  2. f60304b — feat(config): [logging] section (ingest_log_enabled + ingest_log_dir)
  3. f8a4c79 — feat(app): IngestLogWriter + LogEvent enum
  4. bef0c98 — feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields
  5. f9dc0f7 — feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc sync)
  6. 415227b — test(app): ingest_log_smoke integration test
  7. 445b096 — fix(test): clippy + fmt fixes
  8. 6850077 — style: cargo fmt --all + spec/plan/dogfood artifacts

Verification (R5 dogfood)

  • workspace test: 1370 passed / 0 failed / 50 ignored + ingest_log_smoke green.
  • clippy -D warnings exit 0.
  • Bug #11 fix 효과: 23분 → 5.2분 → 9.3분 → 10.9분 ingest (180s = round 2 와 동등 coverage 19/21 OCR + cost 50% 절약).

Log file 자동 생성 verify

R5 dogfood run 의 ~/.local/state/kebab/logs/ingest-20260528T042332Z-b9352652.ndjson:

  • 23 OCR records + 1 summary
  • 3 OCR failures: page 8/13 (180s timeout), mojibake.pdf (text-detect skip)

Sweet-spot evidence (from log summary)

metric value
p50 1.5s
p90 33s
max 62s
timeout cap 180s (×2 = 6분 cap)

180s default = p90 × 5.5 = 충분한 buffer. 향후 점진적 축소 가능.

Wire schema additive minor

ingest_progress.v1.pdf_ocr_finished event 의 4 추가 field optional (image_byte_size + image_width + image_height + failure_reason). backward-compat.

Configurable log path

[logging]
ingest_log_enabled = true   # default
ingest_log_dir = "{state_dir}/logs"   # XDG state dir

per-ingest-run filename = ingest-{ISO8601}Z-{hex8}.ndjson. user 가 jq 으로 분석.

Ready for merge decision.

🤖 Generated with Claude Code

## Bugfix4 + ingest log feature ready for review `feat/pdf-scanned-ocr` 가 6a9551e → **6850077** 8 commit add (HEAD). ### Round 4 commits 1. `6a9551e` — fix(config): OCR timeout 60s → 180s (Bug #11 follow-up, sweet spot 점진적 축소) 2. `f60304b` — feat(config): [logging] section (ingest_log_enabled + ingest_log_dir) 3. `f8a4c79` — feat(app): IngestLogWriter + LogEvent enum 4. `bef0c98` — feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields 5. `f9dc0f7` — feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync) 6. `415227b` — test(app): ingest_log_smoke integration test 7. `445b096` — fix(test): clippy + fmt fixes 8. `6850077` — style: cargo fmt --all + spec/plan/dogfood artifacts ### Verification (R5 dogfood) - workspace test: **1370 passed / 0 failed / 50 ignored** + ingest_log_smoke green. - clippy `-D warnings` exit 0. - Bug #11 fix 효과: 23분 → 5.2분 → 9.3분 → 10.9분 ingest (180s = round 2 와 동등 coverage 19/21 OCR + cost 50% 절약). ### Log file 자동 생성 verify R5 dogfood run 의 `~/.local/state/kebab/logs/ingest-20260528T042332Z-b9352652.ndjson`: - 23 OCR records + 1 summary - 3 OCR failures: page 8/13 (180s timeout), mojibake.pdf (text-detect skip) ### Sweet-spot evidence (from log summary) | metric | value | |--------|------:| | p50 | 1.5s | | p90 | **33s** | | max | 62s | | timeout cap | 180s (×2 = 6분 cap) | 180s default = p90 × 5.5 = 충분한 buffer. 향후 점진적 축소 가능. ### Wire schema additive minor `ingest_progress.v1.pdf_ocr_finished` event 의 4 추가 field optional (image_byte_size + image_width + image_height + failure_reason). backward-compat. ### Configurable log path ```toml [logging] ingest_log_enabled = true # default ingest_log_dir = "{state_dir}/logs" # XDG state dir ``` per-ingest-run filename = `ingest-{ISO8601}Z-{hex8}.ndjson`. user 가 `jq` 으로 분석. Ready for merge decision. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 merged commit 09333d0b05 into main 2026-05-28 04:37:54 +00:00
altair823 deleted branch feat/pdf-scanned-ocr 2026-05-28 04:37:59 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#189