Files
kebab/crates/kebab-parse-pdf/Cargo.toml
altair823 c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00

30 lines
1.1 KiB
TOML

[package]
name = "kebab-parse-pdf"
version = { workspace = true }
edition = { workspace = true }
rust-version = { workspace = true }
license = { workspace = true }
repository = { workspace = true }
description = "Text PDF extractor + scanned-page image extract helpers for the kebab pipeline (P7-1 + v0.20.0 sub-item 1)"
[dependencies]
kebab-core = { path = "../kebab-core" }
anyhow = { workspace = true }
serde_json = { workspace = true }
time = { workspace = true }
tracing = { workspace = true }
# Per-page text extraction. `lopdf::Document::extract_text(&[page])`
# is the only stable per-page API across the pdf-extract / lopdf
# pair (pdf-extract 0.7 still exposes only whole-document calls).
# pdf-extract is intentionally NOT pulled in here — its ~150 transitive
# crates (pom, postscript, type1-encoding-parser, …) buy us nothing
# at v1 (we don't call its whole-doc API), and the future scanned-PDF
# OCR fallback can re-add it when it actually needs it.
lopdf = "0.32"
[dev-dependencies]
blake3 = { workspace = true }
[lints]
workspace = true