chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
# v0.20.0 — scanned PDF OCR via Ollama vision LLM v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증, 카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR / PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79% (page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default vision OCR. ## 1. OCR opt-in 사용법 `[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env 로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU, remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합. ```toml [pdf.ocr] enabled = true model = "qwen2.5vl:3b" # 다른 default 는 README 참조 ``` qwen2.5vl:3b 의 Ollama pull: ```bash ollama pull qwen2.5vl:3b # 3GB Ollama image ``` ## 2. v0.19 indexed scanned PDF 의 force-reingest v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안 함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의 trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config `pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가 OCR 실행을 skip. 명시적 재처리: ```bash kebab ingest --root /path/to/kb --force ``` ## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리) v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter == DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 + 해당 page skip. scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시: ```bash qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf ``` v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별 sub-item 에서 multi-filter 지원 검토. ## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b) image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20) 만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일 가능 단 default 는 family asymmetric 보존. ## Dogfood + test 결과 - workspace test: 178 result lines, 0 failure. - workspace clippy (-D warnings): exit 0. - alnum e2e (real Ollama, manual invoke): - F1 (한국어 page1): 94.79% (≥ 0.85 threshold). - F2 (받침-intensive): 81.56% (≥ 0.70 threshold). - integration smoke + vector PDF regression: pass. ## 변경된 surface - new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*. - new wire: IngestEvent::PdfOcrStarted/Finished (additive minor). - new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor). - new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)". - new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply. - dep: workspace lopdf = "0.32" 통합. - fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/. ## 변경되지 않은 surface (invariant) - Extractor::extract trait body byte-identical (PR #187). - PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리. - parser_version "pdf-text-v1" 보존. - chunker_version "pdf-page-v1" 보존. - workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존). ## sub-item 의 11 commit history9d7faabStep 1: foundation + cargo tree baselinesaeeff36Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)fb3952dStep 2 fix: F7 conversion engine record correctionc2cd3a7Step 3: page_image + text_quality modules (10 test)8d81bc1Step 3 fix: clippy pedantic in page_image9f003efStep 4: pdf_ocr_apply helper (10 test, F7 split + cancel)fd918a6Step 5: [pdf.ocr] config section + PdfOcrOpts doc4672cbaStep 5 fix: clippy::bool_assert_comparison in pdf_ocr testsb9ee09fStep 6: wire PDF OCR enrichment + cancel propagation4c5ccd5Step 7: wire schema additive — IngestEvent + IngestItem + skippedc9e0594Step 8: CLI printer activation + ingest_progress test + spec literal4819768Step 9: integration smoke + vector regression + alnum e2e1d4e301Step 9 follow-up: Cargo.lock for dev-dep additions90726abStep 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE) ## § Acceptance §9 verifier evidence K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고): - Row #4 (vector PDF byte-identical): pass. - Row #5 (Extractor::extract trait byte-identical): 0 line diff. - Row #6 (wire schema additive): jq + diff exit 0. - Row #7-#8 (clippy / workspace test): exit 0. - Row #9-#10 (dep graph baseline -e normal): empty diff. - Row #11 (docs sync): grep evidence. - Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22. - Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1. - Row #15 (DCTDecode-only v1, F6/F7 skip): test green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
44
Cargo.lock
generated
44
Cargo.lock
generated
@@ -4127,7 +4127,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-app"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"base64 0.22.1",
|
||||
@@ -4171,7 +4171,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-chunk"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4187,7 +4187,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-cli"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"clap",
|
||||
@@ -4208,7 +4208,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-config"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"dirs 5.0.1",
|
||||
@@ -4223,7 +4223,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-core"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4237,7 +4237,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-embed"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4251,7 +4251,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-embed-local"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"fastembed",
|
||||
@@ -4264,7 +4264,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-eval"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-app",
|
||||
@@ -4283,7 +4283,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-llm"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-core",
|
||||
@@ -4292,7 +4292,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-llm-local"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-config",
|
||||
@@ -4309,7 +4309,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-mcp"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-app",
|
||||
@@ -4327,7 +4327,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-nli"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"hf-hub",
|
||||
@@ -4342,7 +4342,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-code"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"gix",
|
||||
@@ -4365,7 +4365,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-image"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"ab_glyph",
|
||||
"anyhow",
|
||||
@@ -4389,7 +4389,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-md"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-core",
|
||||
@@ -4406,7 +4406,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-pdf"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4421,7 +4421,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-rag"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4443,7 +4443,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-search"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"globset",
|
||||
@@ -4462,7 +4462,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-source-fs"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4480,7 +4480,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-store-sqlite"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4500,7 +4500,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-store-vector"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"arrow",
|
||||
@@ -4524,7 +4524,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-tui"
|
||||
version = "0.19.0"
|
||||
version = "0.20.0"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"crossterm",
|
||||
|
||||
Reference in New Issue
Block a user