Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
cancel=true + dummy endpoint, no panic/deadlock verify).
I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
전체 부재 — 신규 12-line).
I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
baseline -e normal 보존).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
33 lines
1.2 KiB
TOML
33 lines
1.2 KiB
TOML
[package]
|
|
name = "kebab-parse-pdf"
|
|
version = { workspace = true }
|
|
edition = { workspace = true }
|
|
rust-version = { workspace = true }
|
|
license = { workspace = true }
|
|
repository = { workspace = true }
|
|
description = "Text PDF extractor + scanned-page image extract helpers for the kebab pipeline (P7-1 + v0.20.0 sub-item 1)"
|
|
|
|
[dependencies]
|
|
kebab-core = { path = "../kebab-core" }
|
|
anyhow = { workspace = true }
|
|
serde_json = { workspace = true }
|
|
time = { workspace = true }
|
|
tracing = { workspace = true }
|
|
# Per-page text extraction. `lopdf::Document::extract_text(&[page])`
|
|
# is the only stable per-page API across the pdf-extract / lopdf
|
|
# pair (pdf-extract 0.7 still exposes only whole-document calls).
|
|
# pdf-extract is intentionally NOT pulled in here — its ~150 transitive
|
|
# crates (pom, postscript, type1-encoding-parser, …) buy us nothing
|
|
# at v1 (we don't call its whole-doc API), and the future scanned-PDF
|
|
# OCR fallback can re-add it when it actually needs it.
|
|
lopdf = { workspace = true }
|
|
|
|
[dev-dependencies]
|
|
anyhow = { workspace = true }
|
|
blake3 = { workspace = true }
|
|
kebab-parse-image = { path = "../kebab-parse-image" }
|
|
strsim = "0.11"
|
|
|
|
[lints]
|
|
workspace = true
|