test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
cancel=true + dummy endpoint, no panic/deadlock verify).
I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
전체 부재 — 신규 12-line).
I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
baseline -e normal 보존).
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
70
crates/kebab-parse-pdf/tests/text_extractor_regression.rs
Normal file
70
crates/kebab-parse-pdf/tests/text_extractor_regression.rs
Normal file
@@ -0,0 +1,70 @@
|
||||
//! Byte-identical regression for the vector PDF extraction path (spec §5.4).
|
||||
//! Uses F4 (mojibake.pdf) — the only fixture with extractable text content.
|
||||
//! First invocation creates the baseline snapshot; subsequent runs verify
|
||||
//! identity to detect silent regressions across all Step 1-8 changes.
|
||||
|
||||
use std::path::Path;
|
||||
|
||||
use kebab_core::{
|
||||
AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, MediaType, RawAsset,
|
||||
SourceUri, WorkspacePath, id_for_asset,
|
||||
};
|
||||
use kebab_parse_pdf::PdfTextExtractor;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
/// Normalize all provenance timestamps to UNIX_EPOCH so the snapshot is
|
||||
/// byte-stable across runs (R-3 mitigation — no workspace helper exists).
|
||||
fn normalize_provenance_timestamps(doc: &mut kebab_core::CanonicalDocument) {
|
||||
for event in &mut doc.provenance.events {
|
||||
event.at = OffsetDateTime::UNIX_EPOCH;
|
||||
}
|
||||
}
|
||||
|
||||
fn make_raw_asset(path: &str) -> RawAsset {
|
||||
let fake_hash = "0".repeat(64);
|
||||
let asset_id = id_for_asset(&fake_hash);
|
||||
RawAsset {
|
||||
asset_id,
|
||||
source_uri: SourceUri::File(std::path::PathBuf::from(path)),
|
||||
workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
|
||||
media_type: MediaType::Pdf,
|
||||
byte_len: 0,
|
||||
checksum: Checksum(fake_hash),
|
||||
discovered_at: OffsetDateTime::UNIX_EPOCH,
|
||||
stored: AssetStorage::Copied {
|
||||
path: std::path::PathBuf::from(path),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn vector_pdf_extract_byte_identical_to_baseline() {
|
||||
let bytes = include_bytes!("fixtures/mojibake.pdf");
|
||||
let asset = make_raw_asset("mojibake.pdf");
|
||||
let workspace_root = Path::new("/");
|
||||
let config = ExtractConfig::default();
|
||||
let ctx = ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root,
|
||||
config: &config,
|
||||
};
|
||||
|
||||
let mut canonical = PdfTextExtractor::new()
|
||||
.extract(&ctx, bytes)
|
||||
.expect("PdfTextExtractor::extract");
|
||||
normalize_provenance_timestamps(&mut canonical);
|
||||
|
||||
let actual = serde_json::to_string_pretty(&canonical).expect("serialize canonical");
|
||||
|
||||
let baseline_path = "tests/snapshots/vector_pdf_canonical.json";
|
||||
let baseline = std::fs::read_to_string(baseline_path).unwrap_or_else(|_| {
|
||||
std::fs::create_dir_all("tests/snapshots").ok();
|
||||
std::fs::write(baseline_path, &actual).expect("write baseline snapshot");
|
||||
actual.clone()
|
||||
});
|
||||
|
||||
assert_eq!(
|
||||
actual, baseline,
|
||||
"vector PDF canonical must be byte-identical to baseline (Step 1-8 regression)"
|
||||
);
|
||||
}
|
||||
Reference in New Issue
Block a user