test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1

Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:10:58 +00:00
parent c9e05941c5
commit 48197687b7
7 changed files with 384 additions and 1 deletions

View File

@@ -0,0 +1,120 @@
//! Integration smoke tests for the PDF OCR pipeline (§ Acceptance §9 #1 + #2).
//!
//! Tests 1 and 2 require a live Ollama endpoint — `#[ignore]` by default.
//! Manual invoke:
//! KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
//! cargo test -p kebab-app --test ingest_pdf_ocr_smoke --ignored -j 4
//!
//! Test 3 (cancel) uses a dummy endpoint + pre-set cancel — runs by default
//! to verify the cancel wiring doesn't panic/deadlock.
mod common;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use common::TestEnv;
fn ollama_endpoint() -> String {
std::env::var("KEBAB_PDF_OCR_ENDPOINT")
.unwrap_or_else(|_| "http://localhost:11434".to_string())
}
fn make_ocr_env_real() -> TestEnv {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some(ollama_endpoint());
env.config.models.embedding.provider = "none".to_string();
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
env
}
/// § Acceptance §9 #1 — real Ollama OCR + IngestItem.pdf_ocr_pages = Some(1).
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ingest_with_mock_ocr_yields_pdf_ocr_summary() {
let env = make_ocr_env_real();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest");
assert!(report.new >= 1, "at least one PDF ingested: {report:?}");
let items = report.items.unwrap_or_default();
let pdf_item = items.iter().find(|i| i.doc_path.0.ends_with(".pdf"));
assert!(
pdf_item.is_some(),
"PDF item must appear in ingest report items: {items:?}"
);
let pdf_item = pdf_item.unwrap();
assert!(
pdf_item.pdf_ocr_pages.is_some(),
"pdf_ocr_pages must be set for scanned PDF: {pdf_item:?}"
);
assert_eq!(
pdf_item.pdf_ocr_pages.unwrap(),
1,
"scanned_page1.pdf has exactly 1 page"
);
}
/// § Acceptance §9 #2 — OCR text indexed and retrievable via lexical search.
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ocr_text_indexed_and_searchable() {
let env = make_ocr_env_real();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest");
// Search for a Korean morpheme expected to appear in qwen2.5vl:3b OCR
// output of the PoC ground-truth page. "다음" is a high-frequency token
// in page1.txt truth file.
let query = common::lexical_query("다음");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search");
assert!(
!hits.is_empty(),
"OCR-indexed text must surface in lexical search results"
);
}
/// Production cancel wiring smoke — pre-set cancel exits before any OCR call.
/// Dummy endpoint (port 1 = connection-refused) means OCR HTTP calls would
/// fail, but cancel=true prevents the loop from reaching OCR at all.
/// Verifies no panic/deadlock regardless of Ok/Err outcome.
#[test]
fn ingest_with_cancel_aborts_mid_pdf() {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some("http://127.0.0.1:1".to_string());
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
let cancel = Arc::new(AtomicBool::new(true)); // pre-set — abort immediately
let result = kebab_app::ingest_with_config_cancellable(
env.config.clone(),
env.scope(),
false,
None,
Some(cancel),
);
// Both Ok (pre-cancel exit) and Err (eager OCR engine fail) are acceptable —
// key assertion is no panic/deadlock.
let _ = result;
}