docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)

Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:34:24 +00:00
parent 1d4e301e5e
commit 90726ab283
4 changed files with 58 additions and 4 deletions

View File

@@ -19,9 +19,10 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
| embedding | `fastembed-rs` (`multilingual-e5-small`, 384d) |
| LLM | Ollama HTTP (default `gemma4:e4b` ─ OCR / caption 와 family 통일. 사용자가 더 큰 variant `gemma4:26b` 등으로 override 가능) |
| 음성 ASR | `whisper.cpp` (via `whisper-rs`) — P8 보류, 시스템 dep brainstorm 후 |
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
| OCR (image) | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
| OCR (PDF, v0.20.0+) | Ollama vision LM (default `qwen2.5vl:3b`) — post-extract enrichment via `kebab-app::pdf_ocr_apply` (H-1 resolution). DCTDecode-only v1 (FlateDecode/CCITTFax skip + warning). family asymmetry vs image OCR: PoC alnum 94.79% (qwen2.5vl) >> 27% (gemma4:e4b 받침), 본 단계에서 PDF OCR 만 qwen2.5vl. |
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
| PDF parser | `lopdf` per-page 텍스트 + scanned-page image extract (`page_image::extract_dctdecode_page_image`, v0.20.0). `chunker_version = "pdf-page-v1"` 하드코딩 (HOTFIXES P7-3). `parser_version = "pdf-text-v1"` 보존 (v0.20 OCR 후에도) — provenance event 로 OCR 사용 차별화. force-reingest 가 v0.19 indexed scanned PDF 의 재처리에 필요. |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). **Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc. **Tier 1 family complete (p10-1D)**: C (`tree-sitter-c`, `code-c-ast-v1`, `.c`/`.h`) + C++ (`tree-sitter-cpp`, `code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`). C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → Tier 3 fallback. |
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |