Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
cancel=true + dummy endpoint, no panic/deadlock verify).
I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
전체 부재 — 신규 12-line).
I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
baseline -e normal 보존).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
63 lines
2.3 KiB
Rust
63 lines
2.3 KiB
Rust
// § Acceptance §9 #3: real Ollama qwen2.5vl:3b 의 alnum accuracy.
|
|
// F1 ≥ 0.85, F2 ≥ 0.70. real Ollama 의존 — `#[ignore]` default.
|
|
//
|
|
// Manual invoke:
|
|
// KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
|
|
// cargo test -p kebab-parse-pdf --test ocr_e2e --ignored -j 4
|
|
|
|
use kebab_core::Lang;
|
|
use kebab_parse_image::{OcrEngine, OllamaVisionOcr};
|
|
use kebab_parse_pdf::extract_dctdecode_page_image;
|
|
use lopdf::Document;
|
|
|
|
fn run_real_ollama_ocr(pdf: &[u8], page: u32) -> anyhow::Result<String> {
|
|
let endpoint = std::env::var("KEBAB_PDF_OCR_ENDPOINT")
|
|
.unwrap_or_else(|_| "http://localhost:11434".to_string());
|
|
let doc = Document::load_mem(pdf)?;
|
|
let jpeg = extract_dctdecode_page_image(&doc, page)?
|
|
.ok_or_else(|| anyhow::anyhow!("page {page} 의 DCTDecode image XObject 부재"))?;
|
|
|
|
let engine = OllamaVisionOcr::from_parts(
|
|
endpoint,
|
|
"qwen2.5vl:3b".to_string(),
|
|
vec!["eng".to_string(), "kor".to_string()],
|
|
2048,
|
|
600,
|
|
)?;
|
|
|
|
let result = engine.recognize(&jpeg, Some(&Lang("kor".into())))?;
|
|
Ok(result.joined)
|
|
}
|
|
|
|
fn alnum_accuracy(actual: &str, expected: &str) -> f32 {
|
|
let a: String = actual.chars().filter(|c| c.is_alphanumeric()).collect();
|
|
let e: String = expected.chars().filter(|c| c.is_alphanumeric()).collect();
|
|
if e.is_empty() {
|
|
return 0.0;
|
|
}
|
|
let dist = strsim::levenshtein(&a, &e) as f32;
|
|
((e.chars().count() as f32 - dist) / e.chars().count() as f32).max(0.0)
|
|
}
|
|
|
|
#[test]
|
|
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
|
|
fn f1_alnum_accuracy_ge_85() {
|
|
let pdf = include_bytes!("fixtures/scanned_page1.pdf");
|
|
let ocr = run_real_ollama_ocr(pdf, 1).expect("OCR");
|
|
let expected = include_str!("fixtures/scanned_page1_truth.txt");
|
|
let accuracy = alnum_accuracy(&ocr, expected);
|
|
println!("F1 alnum accuracy = {accuracy:.4}");
|
|
assert!(accuracy >= 0.85, "F1 alnum accuracy {accuracy:.4} < 0.85");
|
|
}
|
|
|
|
#[test]
|
|
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
|
|
fn f2_alnum_accuracy_ge_70() {
|
|
let pdf = include_bytes!("fixtures/scanned_page2.pdf");
|
|
let ocr = run_real_ollama_ocr(pdf, 1).expect("OCR");
|
|
let expected = include_str!("fixtures/scanned_page2_truth.txt");
|
|
let accuracy = alnum_accuracy(&ocr, expected);
|
|
println!("F2 alnum accuracy = {accuracy:.4}");
|
|
assert!(accuracy >= 0.70, "F2 alnum accuracy {accuracy:.4} < 0.70");
|
|
}
|