Files
kebab/crates/kebab-parse-pdf/tests/page_image.rs
altair823 c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00

25 lines
1.0 KiB
Rust

// crates/kebab-parse-pdf/tests/page_image.rs (신규)
use lopdf::Document;
use kebab_parse_pdf::extract_dctdecode_page_image;
// happy path — F1 fixture (DCTDecode JPEG passthrough)
#[test]
fn f1_fixture_yields_dctdecode_jpeg_bytes() {
let bytes = include_bytes!("fixtures/scanned_page1.pdf");
let doc = Document::load_mem(bytes).unwrap();
let result = extract_dctdecode_page_image(&doc, 1).unwrap();
let jpeg = result.expect("F1 의 page 1 이 DCTDecode image 보유");
assert!(jpeg.starts_with(b"\xFF\xD8"), "JPEG magic missing");
assert!(jpeg.len() > 1000, "JPEG bytes too small (got {})", jpeg.len());
}
// negative path — F6 fixture (FlateDecode raw pixel — Ok(None))
#[test]
fn flate_raw_fixture_yields_none() {
let bytes = include_bytes!("fixtures/flate_raw.pdf");
let doc = Document::load_mem(bytes).unwrap();
let result = extract_dctdecode_page_image(&doc, 1).unwrap();
assert!(result.is_none(), "FlateDecode page 가 Ok(None) 반환 — DCTDecode-only v1 invariant");
}