Files
kebab/Cargo.toml
altair823 b4d9e60816 chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
# v0.20.0 — scanned PDF OCR via Ollama vision LLM

v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.

## 1. OCR opt-in 사용법

`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.

```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```

qwen2.5vl:3b 의 Ollama pull:

```bash
ollama pull qwen2.5vl:3b   # 3GB Ollama image
```

## 2. v0.19 indexed scanned PDF 의 force-reingest

v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:

```bash
kebab ingest --root /path/to/kb --force
```

## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)

v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.

scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:

```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```

v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.

## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)

image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.

## Dogfood + test 결과

- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
  - F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
  - F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.

## 변경된 surface

- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.

## 변경되지 않은 surface (invariant)

- Extractor::extract trait body byte-identical (PR #187).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).

## sub-item 의 11 commit history

9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)

## § Acceptance §9 verifier evidence

K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:03:44 +00:00

220 lines
10 KiB
TOML

[workspace]
resolver = "3"
members = [
"crates/kebab-core",
"crates/kebab-config",
"crates/kebab-source-fs",
"crates/kebab-parse-md",
"crates/kebab-chunk",
"crates/kebab-store-sqlite",
"crates/kebab-store-vector",
"crates/kebab-search",
"crates/kebab-embed",
"crates/kebab-embed-local",
"crates/kebab-llm",
"crates/kebab-llm-local",
"crates/kebab-rag",
"crates/kebab-app",
"crates/kebab-cli",
"crates/kebab-eval",
"crates/kebab-parse-image",
"crates/kebab-parse-pdf",
"crates/kebab-tui",
"crates/kebab-mcp",
"crates/kebab-parse-code",
"crates/kebab-nli",
]
[workspace.package]
edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.20.0" # v0.20.0 sub-item 1 (scanned PDF OCR via qwen2.5vl:3b) — CLAUDE.md §Release 사용자 도그푸딩 트리거
# pre-v0.18 workspace-wide cleanup: enable clippy::pedantic group with
# intentional allow-list. The allowed lints are either cosmetic (doc style),
# informational (function size), or carry intentional truncation we accept
# (numeric casts in tokenizer/ONNX inputs, hash modular reduction, etc).
[workspace.lints.clippy]
pedantic = { level = "warn", priority = -1 }
# Intentional u32 ↔ i64 casts in kebab-nli (ONNX i64 inputs from tokenizer u32 ids).
# u64 ↔ usize across kebab-store-sqlite row counts. Wide truncation is auditable
# at use site, not lint-wide.
cast_possible_truncation = "allow"
cast_possible_wrap = "allow"
cast_sign_loss = "allow"
cast_precision_loss = "allow"
# Doc markdown style is cosmetic; we run rustdoc on demand.
doc_markdown = "allow"
missing_errors_doc = "allow"
missing_panics_doc = "allow"
# Informational only — splitting a long pipeline function isn't always cleaner.
too_many_lines = "allow"
# `Foo::default()` is concise and idiomatic here; `<Foo as Default>::default()`
# adds noise without surfacing intent.
default_trait_access = "allow"
# Module name prefix on public items keeps the wire/log surface readable
# (`refusal_reason::no_chunks` etc).
module_name_repetitions = "allow"
# We use `#[must_use]` deliberately on public results, not blanket.
must_use_candidate = "allow"
# `String` arg sometimes signals "I'll consume this" — let signature decide.
needless_pass_by_value = "allow"
# Idiomatic single-line bindings stay; let-else expansion isn't always clearer.
manual_let_else = "allow"
# `use` after `let` is a common kebab pattern (scoped imports next to use site).
items_after_statements = "allow"
# Naming pairs like `chunk_id` / `chunks_id` are intentional domain terms.
similar_names = "allow"
# `iter.map(format!).collect::<String>()` is idiomatic when the per-element
# string is genuinely independent — `fold` only wins on accumulation patterns.
format_collect = "allow"
# Exhaustive `match` with explicit variant arms (vs `_`) catches future
# variant additions at compile time (kebab core's `RefusalReason` pattern).
match_wildcard_for_single_variants = "allow"
# Copy types under `&self` keep call-site discipline; auto-deref noise > tiny perf gain.
trivially_copy_pass_by_ref = "allow"
# `unnecessary_wraps` flags helpers that could drop `Result`, but keeping the
# Result allows future error variants without churning callers.
unnecessary_wraps = "allow"
# NLI score / RRF fusion / similarity threshold comparisons are intentional —
# floats live in the `[0, 1]` band and are compared with explicit thresholds.
float_cmp = "allow"
# File-extension dispatch is keyed on ASCII conventions; case sensitivity
# is part of the spec for `.md`, `.pdf`, etc.
case_sensitive_file_extension_comparisons = "allow"
# Config / opts structs intentionally bundle boolean flags (ingest options,
# search modes, etc) — splitting them into enums would obscure the wire shape.
struct_excessive_bools = "allow"
# `bytecount` crate would be a new dep just for one-off ASCII counts.
naive_bytecount = "allow"
# `#[ignore]` annotations on tests document via the test name + nearby comment.
ignore_without_reason = "allow"
# `format!` push patterns are a hot path for kebab-tui's progressive rendering;
# `write!` rewrite needs a verified-equal benchmark before swapping.
format_push_string = "allow"
# Builder-style `with_*` methods return `Self`; the existing `#[must_use]`
# discipline lives on aggregate constructors, not every chainable setter.
return_self_not_must_use = "allow"
# Match arms grouped by side-effect over body equality (e.g. snake_case wire
# label tables) — fanning them out keeps adding a new variant trivial.
match_same_arms = "allow"
# Remaining style-only warnings: trailing `continue` is sometimes clearer than
# rewriting, `_x` underscored bindings document intent at the use site, and
# `!(a == b)` reads better than `a != b` when paired with a complementary check.
needless_continue = "allow"
used_underscore_binding = "allow"
nonminimal_bool = "allow"
# Other one-off cosmetic items: large literal formatting, doc link quoting,
# `Clone::clone_from` swap, `str::replace` chaining, `Iterator::any` ergonomics.
unreadable_literal = "allow"
many_single_char_names = "allow"
doc_link_with_quotes = "allow"
assigning_clones = "allow"
collapsible_str_replace = "allow"
trivial_regex = "allow"
elidable_lifetime_names = "allow"
range_plus_one = "allow"
explicit_iter_loop = "allow"
implicit_hasher = "allow"
ref_option = "allow"
[workspace.dependencies]
anyhow = "1"
thiserror = "2"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
# Golden-fixture loader (P5-1, kebab-eval) parses YAML; pinned in the
# workspace so future eval-adjacent crates share the same major.
serde_yaml = "0.9"
time = { version = "0.3", features = ["serde", "macros", "formatting", "parsing"] }
uuid = { version = "1", features = ["v7", "serde"] }
blake3 = "1"
tracing = "0.1"
# `bundled` ships SQLite source so the workspace doesn't depend on a
# system libsqlite3 (matches the kebab-store-sqlite feature set).
rusqlite = { version = "0.32", features = ["bundled"] }
globset = "0.4"
tempfile = "3"
proptest = "1"
# p9-fb-19: LRU cache for `App::search` results. Bounded capacity
# from `config.search.cache_capacity` (default 256, ~1.3 MB cap).
lru = "0.12"
lopdf = "0.32"
# fastembed-rs ships ONNX runtime via the `ort-download-binaries` feature
# in its default set (which also pulls `hf-hub` for first-run model
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release
# remains untested for this workspace).
fastembed = "4.9"
# LanceDB embedded vector store (P3-3). 0.23.x pulls arrow / arrow-array /
# arrow-schema 56.x transitively (via lance 1.0); the kebab-store-vector
# crate matches that major to share the same Arrow types without a
# re-export adapter.
lancedb = { version = "0.23", default-features = false }
arrow = "56"
arrow-array = "56"
arrow-schema = "56"
tokio = { version = "1", features = ["rt", "macros"] }
futures = "0.3"
# Strict citation-marker extraction in kebab-rag (P4-3) needs a single regex
# pass; pulled into the workspace deps so future crates can share the
# same major.
regex = "1"
# MCP (Model Context Protocol) SDK. server + macros + transport-io provide
# stdio JSON-RPC transport for `kebab-mcp` (p9-fb-30). schemars feature
# exposes the derive macro used by tool input schemas.
rmcp = { version = "1.6", default-features = false, features = ["server", "macros", "transport-io", "schemars"] }
# Dev-only HTTP mock server for kebab-llm-local Ollama adapter tests. Requires
# a tokio runtime to host its mock server (the runtime adapter crate stays
# sync via reqwest::blocking — wiremock is dev-only there).
wiremock = "0.6"
base64 = "0.22"
# Pure-Rust git library for repo metadata detection (kebab-parse-code).
# No `git` binary required. Default features include thread-safety + most
# object-reading capabilities needed for HEAD name + commit SHA queries.
gix = { version = "0.70", default-features = false, features = ["revision"] }
# Rust source parsing for code ingest (kebab-parse-code, p10-1A-2). The
# chunker stays tree-sitter-free — AST work is parser-side per design §6.3.
tree-sitter = "0.26"
tree-sitter-rust = "0.24"
# Python / TS / JS grammars for code ingest (kebab-parse-code, p10-1B).
tree-sitter-python = "0.25.0"
tree-sitter-typescript = "0.23.2"
tree-sitter-javascript = "0.25.0"
# Go grammar for code ingest (kebab-parse-code, p10-1C-Go).
tree-sitter-go = "0.25.0"
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "0.23.5"
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "0.24.2"
tree-sitter-cpp = "0.23.4"
# fb-41 PR-9 (kebab-nli): mDeBERTa-v3 XNLI verifier deps. Versions match
# the fastembed 4.9 transitive set so the ONNX Runtime + tokenizer stack
# stays single-versioned across the workspace. ort `default-features=false`
# drops the bundled binary downloader (fastembed already provides one);
# tokenizers `default-features=false, onig` swaps the default `esaxx` regex
# backend for `onig` so the build doesn't need libstdc++ headers (verified
# via PR-9a pre-flight: SentencePiece tokenizer.json loads + KR/EN encode).
# hf-hub uses `ureq + rustls-tls` to stay aligned with kebab-embed-local's
# pure-Rust TLS stack.
ort = { version = "=2.0.0-rc.9", default-features = false, features = ["ndarray"] }
tokenizers = { version = "0.21", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", default-features = false, features = ["ureq", "rustls-tls"] }
ndarray = "0.16"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line
# numbers kept, column numbers dropped) and split into separate
# `.dwo` files. backtrace stays useful (function + line). release
# profile is untouched, so CI / `--release` runs are byte-identical
# to upstream defaults.
[profile.dev]
debug = "line-tables-only"
split-debuginfo = "unpacked"
[profile.test]
debug = "line-tables-only"
split-debuginfo = "unpacked"