test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1

Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:10:58 +00:00
parent c9e05941c5
commit 48197687b7
7 changed files with 384 additions and 1 deletions

View File

@@ -0,0 +1,43 @@
{
"doc_id": "c90fae7576fe514fb08190cb29d1ef5d",
"source_asset_id": "babe9824b6b28237c0898575a40ba48d",
"workspace_path": "mojibake.pdf",
"title": "mojibake",
"lang": "und",
"blocks": [],
"metadata": {
"aliases": [],
"tags": [],
"created_at": "1970-01-01T00:00:00Z",
"updated_at": "1970-01-01T00:00:00Z",
"source_type": "paper",
"trust_level": "primary",
"user_id_alias": null,
"user": {
"pdf": {
"page_count": 0
}
}
},
"provenance": {
"events": [
{
"at": "1970-01-01T00:00:00Z",
"agent": "kb-source-fs",
"kind": "discovered",
"note": null
},
{
"at": "1970-01-01T00:00:00Z",
"agent": "kb-parse-pdf",
"kind": "parsed",
"note": "parser_version=pdf-text-v1; page_count=0"
}
]
},
"parser_version": "pdf-text-v1",
"schema_version": 1,
"doc_version": 1,
"last_chunker_version": null,
"last_embedding_version": null
}