docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)

Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:34:24 +00:00
parent 1d4e301e5e
commit 90726ab283
4 changed files with 58 additions and 4 deletions

View File

@@ -17,7 +17,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) + P10 전체
| **P4** | Local LLM + RAG + grounded answer | `kebab-llm`, `kebab-llm-local`, `kebab-rag` | P3 | ✅ 완료 |
| **P5** | Golden query / regression eval | `kebab-eval` | P4 | ✅ 완료 |
| **P6** | 이미지 ingestion (OCR + caption) | `kebab-parse-image` | P5 | ✅ 완료 (4/4 component, OCR/caption Ollama-vision) |
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P7** | PDF text + page citation + scanned OCR (v0.20.0 sub-item 1) | `kebab-parse-pdf` + `kebab-app::pdf_ocr_apply` | P5 + P6 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring + post-extract OCR enrichment via qwen2.5vl:3b vision LLM) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
@@ -32,6 +32,7 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
- **v0.20 sub-item 1 (scanned PDF OCR via qwen2.5vl:3b)**: post-extract enrichment pattern (`kebab-app::pdf_ocr_apply`, H-1 resolution), DCTDecode-only v1 scope (FlateDecode/CCITTFax page 는 warning + skip), parser_version `"pdf-text-v1"` 보존 + force-reingest UX 명문 (H-4).
- **2026-05-26 kebab-normalize + kebab-parse-types 흡수 (24 → 22 crates, design §3.7b 재작성)** — v0.19.0 cut. 4 parser 중 markdown 한 갈래만 lift 를 경유하는 reality 가 design §3.7b 의 fan-in ≥ 2 가정과 diverge → thin layer (`kebab-parse-types`) + `kebab-normalize` 두 crate 가 `kebab-parse-md` 로 흡수. 5 사용 type + 3 forward-declared struct 모두 `kebab-parse-md::{types,normalize}` module 의 `pub` re-export 로 보존. wire / surface impact = 0 (CLI / TUI / MCP / `--json` / config / XDG / parser_version 모두 unchanged). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-26 design deviation entry).
- **2026-05-26 v0.18.0 fb-41 multi-hop RAG + NLI verification ship (PR #176-180) + post-PR9 cleanup (PR #181)** — pre-v0.18.0 dogfood (`/build/cache/dogfood-v018/`, 33 assets / 205 chunks, gemma3:4b CPU only / 16 GB RAM) 에서 발견된 S7 caffeine hallucination 의 root cause = LLM-self-judge ceiling (synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 silent emit, self-judge 가 reject 못함). 학계 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론 = deterministic post-synthesis verification. mDeBERTa-v3 XNLI ONNX (280 MB, Xenova HF) 가 `(packed_chunks, answer)` entailment 검사 — `[rag] nli_threshold > 0` (default 0.0 = disabled, production 권장 0.5) 일 때 활성. dogfood retest 측정 — S7 PR-8 baseline `grounded=true + Adam hallucination` → PR-9 `nli_verification_failed, nli_score 0.0035`. wire additive minor — `answer.v1.verification` field + `refusal_reason``nli_verification_failed` / `nli_model_unavailable` 추가, pre-v0.18 reader 무영향. 5 sub-PR 시퀀스 + cleanup PR (clippy::pedantic baseline + 의도적 30+ allow + H1 `[models.nli].model` config wiring + 9 new tests). post-refactor retest = PR-9d byte-identical (deterministic 확인). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 fb-41 PR-9 closure entry + S3 follow-up).
- **2026-05-25 v0.17.2 post-v0.17.1 polish (PR #164 + #165)** — v0.17.1 의 두 follow-up closure. (1) `[image.ocr] request_timeout_secs` 별 노브 — `crates/kebab-parse-image/src/ocr.rs::REQUEST_TIMEOUT` hard 300s 제거, LLM 쪽 패턴 (PR #162) 을 OCR 어댑터에 동일 적용. 사용자 결정으로 별 노브 분리 (OCR vs LLM 의 cold start 패턴이 달라 독립 조절). v0.17.1 미진행 항목 closure. (2) `chunks_fts``heading_path` 컬럼이 JSON 표기 + path 세그먼트 까지 trigram 색인 → query false positive 가능 문제 closure. `lexical.rs::build_match_string` 가 non-raw 분기 결과를 `text : (<expr>)` 로 wrap — heading 색인 V007 verbatim 유지, 매칭만 text 한정. 사용자가 명시 heading 검색 하려면 raw mode `'heading_path : <token>'` escape hatch (SKILL.md 갱신). 둘 다 additive (옛 config 호환) / re-ingest 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 v0.17.2 두 entry).