test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1

Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규): - ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama, ingest_with_config production path + IngestItem.pdf_ocr_pages verify. - ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search 의 OCR text indexed verify (§ Acceptance #2). - ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set cancel=true + dummy endpoint, no panic/deadlock verify). I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규): - vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant). - baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create). - normalize_provenance_timestamps inline helper (R-3 mitigation, workspace 전체 부재 — 신규 12-line). I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규): - f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation. - alnum metric = strsim::levenshtein (dev-dep 추가). - truth file copy from PoC scratch (page1.txt + page2-batchim.txt) → scanned_page1_truth.txt + scanned_page2_truth.txt. - kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용). parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph baseline -e normal 보존). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5) prior: c9e0594 (Step 8 CLI printer) contract: §9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 10:10:58 +00:00
parent c9e05941c5
commit 48197687b7
7 changed files with 384 additions and 1 deletions
--- a/crates/kebab-parse-pdf/tests/text_extractor_regression.rs
+++ b/crates/kebab-parse-pdf/tests/text_extractor_regression.rs
@@ -0,0 +1,70 @@
+//! Byte-identical regression for the vector PDF extraction path (spec §5.4).
+//! Uses F4 (mojibake.pdf) — the only fixture with extractable text content.
+//! First invocation creates the baseline snapshot; subsequent runs verify
+//! identity to detect silent regressions across all Step 1-8 changes.
+
+use std::path::Path;
+
+use kebab_core::{
+    AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, MediaType, RawAsset,
+    SourceUri, WorkspacePath, id_for_asset,
+};
+use kebab_parse_pdf::PdfTextExtractor;
+use time::OffsetDateTime;
+
+/// Normalize all provenance timestamps to UNIX_EPOCH so the snapshot is
+/// byte-stable across runs (R-3 mitigation — no workspace helper exists).
+fn normalize_provenance_timestamps(doc: &mut kebab_core::CanonicalDocument) {
+    for event in &mut doc.provenance.events {
+        event.at = OffsetDateTime::UNIX_EPOCH;
+    }
+}
+
+fn make_raw_asset(path: &str) -> RawAsset {
+    let fake_hash = "0".repeat(64);
+    let asset_id = id_for_asset(&fake_hash);
+    RawAsset {
+        asset_id,
+        source_uri: SourceUri::File(std::path::PathBuf::from(path)),
+        workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
+        media_type: MediaType::Pdf,
+        byte_len: 0,
+        checksum: Checksum(fake_hash),
+        discovered_at: OffsetDateTime::UNIX_EPOCH,
+        stored: AssetStorage::Copied {
+            path: std::path::PathBuf::from(path),
+        },
+    }
+}
+
+#[test]
+fn vector_pdf_extract_byte_identical_to_baseline() {
+    let bytes = include_bytes!("fixtures/mojibake.pdf");
+    let asset = make_raw_asset("mojibake.pdf");
+    let workspace_root = Path::new("/");
+    let config = ExtractConfig::default();
+    let ctx = ExtractContext {
+        asset: &asset,
+        workspace_root,
+        config: &config,
+    };
+
+    let mut canonical = PdfTextExtractor::new()
+        .extract(&ctx, bytes)
+        .expect("PdfTextExtractor::extract");
+    normalize_provenance_timestamps(&mut canonical);
+
+    let actual = serde_json::to_string_pretty(&canonical).expect("serialize canonical");
+
+    let baseline_path = "tests/snapshots/vector_pdf_canonical.json";
+    let baseline = std::fs::read_to_string(baseline_path).unwrap_or_else(|_| {
+        std::fs::create_dir_all("tests/snapshots").ok();
+        std::fs::write(baseline_path, &actual).expect("write baseline snapshot");
+        actual.clone()
+    });
+
+    assert_eq!(
+        actual, baseline,
+        "vector PDF canonical must be byte-identical to baseline (Step 1-8 regression)"
+    );
+}