docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.
J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
`kebab ingest --force` 필요").
J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
(v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
(H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).
J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
(qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
parser_version "pdf-text-v1" 보존 + provenance event 차별화.
J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.
J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
family asymmetry + dogfood/test 결과).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -326,6 +326,19 @@ max_pixels = 1600 # long-edge cap
|
||||
enabled = true # vision LM 으로 한 문장 객관 설명 생성
|
||||
max_pixels = 768
|
||||
prompt_template_version = "caption-v1"
|
||||
|
||||
[pdf.ocr]
|
||||
enabled = true # smoke test 의 OCR path 활성화 (manual invoke)
|
||||
always_on = false
|
||||
engine = "ollama-vision"
|
||||
model = "qwen2.5vl:3b"
|
||||
# endpoint = "http://192.168.0.47:11434" # 사용자 dogfood host
|
||||
languages = ["eng", "kor"]
|
||||
max_pixels = 2048
|
||||
request_timeout_secs = 600
|
||||
valid_ratio_threshold = 0.5
|
||||
min_char_count = 20
|
||||
lang_hint = "kor"
|
||||
```
|
||||
|
||||
이미지 자산 한 장당 OCR 1 호출 + Caption 1 호출 → ~3-6초 (`gemma4:e4b` 기준). 다이어그램 / 카메라 사진 / 스크린샷 위주 워크스페이스에 권장. 책 / 스캔본은 P7 PDF 라인으로.
|
||||
@@ -716,4 +729,20 @@ kebab --config /tmp/kebab-smoke/config.toml ingest
|
||||
kebab --config /tmp/kebab-smoke/config.toml eval run
|
||||
```
|
||||
|
||||
### v0.20 force-reingest (scanned PDF OCR)
|
||||
|
||||
v0.19 binary 로 indexed scanned PDF (책 스캔 등) 가 v0.20 upgrade 후 OCR path 진입 안 함 — `parser_version = "pdf-text-v1"` 보존이라 `try_skip_unchanged` 가 Unchanged 반환. 명시적 force:
|
||||
|
||||
```bash
|
||||
# v0.19 에서 scanned PDF 가 빈 block + "scanned candidate" warning 으로 indexed:
|
||||
KEBAB_PDF_OCR_ENABLED=false kebab --config /tmp/kebab-smoke/config.toml ingest
|
||||
|
||||
# v0.20 binary upgrade 후 OCR 활성화 (config 갱신 또는 env) + force-reingest:
|
||||
KEBAB_PDF_OCR_ENABLED=true kebab --config /tmp/kebab-smoke/config.toml ingest --force-reingest
|
||||
|
||||
# 결과: 이전 빈 block 들이 OCR text block 으로 replace, provenance.events 에
|
||||
# OcrApplied event 가 page 마다 추가. ingest_progress 의 pdf_ocr_started/finished
|
||||
# 가 stderr 에 emit.
|
||||
```
|
||||
|
||||
자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.
|
||||
|
||||
Reference in New Issue
Block a user