Commit Graph

3 Commits

Author SHA1 Message Date
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
241ded59df test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:45:38 +00:00
9f003ef1cd feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 06:42:01 +00:00