feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)

Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 08:18:34 +00:00
parent 4672cba6c6
commit b9ee09f176
10 changed files with 149 additions and 4 deletions

View File

@@ -141,6 +141,7 @@ proptest = "1"
# p9-fb-19: LRU cache for `App::search` results. Bounded capacity
# from `config.search.cache_capacity` (default 256, ~1.3 MB cap).
lru = "0.12"
lopdf = "0.32"
# fastembed-rs ships ONNX runtime via the `ort-download-binaries` feature
# in its default set (which also pulls `hf-hub` for first-run model
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release

View File

@@ -35,7 +35,7 @@ kebab-parse-image = { path = "../kebab-parse-image" }
# per-asset dispatch (see `ingest_one_asset` PDF branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::PdfPageV1Chunker`.
kebab-parse-pdf = { path = "../kebab-parse-pdf" }
lopdf = "0.32"
lopdf = { workspace = true }
# p10-1A-2: Rust AST extractor lives here. App threads it into the
# per-asset dispatch (see `ingest_one_asset` Code branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::CodeRustAstV1Chunker`.
@@ -76,7 +76,7 @@ image = { version = "0.25", default-features = false, features =
# lopdf builder pattern `kebab-parse-pdf::tests::common` uses; pinned
# to the same major (0.32) so byte output is identical between the two
# fixture surfaces.
lopdf = "0.32"
lopdf = { workspace = true }
# error_wire::tests::llm_unreachable_classifies_to_model_unreachable needs a real
# reqwest::Error (private constructor) — built from a connect-refused call.
reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }

View File

@@ -85,6 +85,15 @@ pub enum IngestEvent {
/// aggregate at the cancel boundary. Emitted by `p9-fb-04`; this
/// task never produces `Aborted`.
Aborted { counts: AggregateCounts },
/// PDF page 별 OCR 시작 시 emit. v0.20.0 sub-item 1.
PdfOcrStarted { page: u32 },
/// PDF page 별 OCR 종료 시 emit. v0.20.0 sub-item 1.
PdfOcrFinished {
page: u32,
ms: u64,
chars: u32,
ocr_engine: String,
},
}
/// Map a `MediaType` to the short label used by `IngestEvent::AssetStarted`.

View File

@@ -48,7 +48,7 @@ use kebab_core::{
SourceUri, VectorRecord, VectorStore,
};
use kebab_llm_local::OllamaLanguageModel;
use kebab_parse_image::{OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_image::{OcrEngine, OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_md::{BodyHints, build_canonical_document, parse_blocks, parse_frontmatter};
use kebab_source_fs::FsSourceConnector;
@@ -357,6 +357,29 @@ pub fn ingest_with_config_opts(
caption_llm: caption_llm.as_deref(),
};
// p10 / v0.20 sub-item 1: PDF OCR engine eager init (H-5 resolution).
// image OCR pattern mirror — per-ingest 1회 build, fallible → fail-fast.
let pdf_ocr_engine: Option<OllamaVisionOcr> =
if app.config.pdf.ocr.enabled || app.config.pdf.ocr.always_on {
let cfg = &app.config.pdf.ocr;
let endpoint = match cfg.endpoint.as_deref() {
Some(s) if !s.is_empty() => s.to_string(),
_ => app.config.models.llm.endpoint.clone(),
};
Some(
OllamaVisionOcr::from_parts(
endpoint,
cfg.model.clone(),
cfg.languages.clone(),
cfg.max_pixels,
cfg.request_timeout_secs,
)
.context("kb-app::ingest: build OllamaVisionOcr (pdf)")?,
)
} else {
None
};
// Pre-load every existing doc_id so we can label `IngestItem.kind`
// as `New` vs `Updated` correctly. `list_documents` returns one
// row per `(workspace_path, asset_id)` — index by the deterministic
@@ -448,6 +471,9 @@ pub fn ingest_with_config_opts(
&existing_doc_ids,
&image_pipeline,
force_reingest,
pdf_ocr_engine.as_ref(),
progress,
opts.cancel.as_ref(),
);
let item = match item {
@@ -476,6 +502,8 @@ pub fn ingest_with_config_opts(
parser_version: None,
chunker_version: None,
warnings: Vec::new(),
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: Some(format!("{e:#}")),
}
}
@@ -864,6 +892,8 @@ fn try_skip_unchanged(
parser_version: Some(existing_doc.parser_version.clone()),
chunker_version: existing_doc.last_chunker_version.clone(),
warnings: Vec::new(),
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
}));
}
@@ -922,6 +952,8 @@ fn try_skip_unchanged(
parser_version: Some(existing_doc.parser_version.clone()),
chunker_version: existing_doc.last_chunker_version.clone(),
warnings: Vec::new(),
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
}))
}
@@ -964,6 +996,9 @@ fn ingest_one_asset(
existing_doc_ids: &std::collections::HashSet<String>,
image_pipeline: &ImagePipeline<'_>,
force_reingest: bool,
pdf_ocr_engine: Option<&OllamaVisionOcr>,
progress: Option<&std::sync::mpsc::Sender<crate::ingest_progress::IngestEvent>>,
cancel: Option<&std::sync::Arc<std::sync::atomic::AtomicBool>>,
) -> anyhow::Result<kebab_core::IngestItem> {
tracing::debug!(
target: "kebab-app::ingest",
@@ -999,6 +1034,9 @@ fn ingest_one_asset(
vector_store,
existing_doc_ids,
force_reingest,
pdf_ocr_engine,
progress,
cancel,
);
}
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added. p10-3: shell added. p10-1D: c/cpp added.
@@ -1033,6 +1071,8 @@ fn ingest_one_asset(
parser_version: None,
chunker_version: None,
warnings: vec![unsupported_media_warning(&asset.workspace_path.0)],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
});
}
@@ -1052,6 +1092,8 @@ fn ingest_one_asset(
parser_version: None,
chunker_version: None,
warnings: vec!["kb:// URI not yet supported".to_string()],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
});
}
@@ -1201,6 +1243,8 @@ fn ingest_one_asset(
parser_version: Some(parser_version.clone()),
chunker_version: Some(MdHeadingV1Chunker.chunker_version()),
warnings: warning_notes,
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
})
}
@@ -1246,6 +1290,8 @@ fn ingest_one_image_asset(
warnings: vec![
"kb:// URI not yet supported".to_string(),
],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
});
}
@@ -1456,6 +1502,8 @@ fn ingest_one_image_asset(
parser_version: Some(canonical.parser_version.clone()),
chunker_version: Some(MdHeadingV1Chunker.chunker_version()),
warnings: warning_notes,
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
})
}
@@ -1726,6 +1774,9 @@ fn ingest_one_pdf_asset(
vector_store: Option<&Arc<kebab_store_vector::LanceVectorStore>>,
existing_doc_ids: &std::collections::HashSet<String>,
force_reingest: bool,
pdf_ocr_engine: Option<&OllamaVisionOcr>,
progress: Option<&std::sync::mpsc::Sender<crate::ingest_progress::IngestEvent>>,
cancel: Option<&std::sync::Arc<std::sync::atomic::AtomicBool>>,
) -> anyhow::Result<kebab_core::IngestItem> {
let path = match &asset.source_uri {
SourceUri::File(p) => p.clone(),
@@ -1743,6 +1794,8 @@ fn ingest_one_pdf_asset(
warnings: vec![
"kb:// URI not yet supported".to_string(),
],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
});
}
@@ -1779,6 +1832,62 @@ fn ingest_one_pdf_asset(
.extract_for(&asset.media_type, &ctx, &bytes)
.context("kb-app::extract_for (pdf)")?;
// v0.20 sub-item 1: post-extract OCR enrichment (PR #187 registry
// dispatch invariant 보존 — extract_for 가 normal entry).
let (pdf_ocr_pages, pdf_ocr_ms_total): (Option<u32>, Option<u64>) =
if app.config.pdf.ocr.enabled || app.config.pdf.ocr.always_on {
match pdf_ocr_engine {
Some(engine) => {
let ocr_opts = crate::pdf_ocr_apply::PdfOcrOpts {
enabled: app.config.pdf.ocr.enabled || app.config.pdf.ocr.always_on,
always_on: app.config.pdf.ocr.always_on,
valid_ratio_threshold: app.config.pdf.ocr.valid_ratio_threshold,
min_char_count: app.config.pdf.ocr.min_char_count,
lang_hint: app.config.pdf.ocr.lang_hint.clone().map(kebab_core::Lang),
cancel: cancel.cloned(),
};
let summary = crate::pdf_ocr_apply::apply_ocr_to_pdf_pages(
&mut canonical,
engine,
&bytes,
&ocr_opts,
|p| match p {
crate::pdf_ocr_apply::PdfOcrProgress::Started { page } => {
if let Some(sender) = progress {
let _ = sender.send(
crate::ingest_progress::IngestEvent::PdfOcrStarted {
page,
},
);
}
}
crate::pdf_ocr_apply::PdfOcrProgress::Finished {
page,
ms,
chars,
skipped: _,
} => {
if let Some(sender) = progress {
let _ = sender.send(
crate::ingest_progress::IngestEvent::PdfOcrFinished {
page,
ms,
chars,
ocr_engine: engine.engine_name().to_string(),
},
);
}
}
},
)?;
(Some(summary.pages_ocrd), Some(summary.ms_total))
}
None => (Some(0), Some(0)),
}
} else {
(None, None)
};
// Per-medium chunker selection: PDF docs always use pdf-page-v1
// regardless of `config.chunking.chunker_version`. The chunker
// validates every block carries `SourceSpan::Page`; failure here
@@ -1880,6 +1989,8 @@ fn ingest_one_pdf_asset(
parser_version: Some(canonical.parser_version.clone()),
chunker_version: Some(chunker.chunker_version()),
warnings,
pdf_ocr_pages,
pdf_ocr_ms_total,
error: None,
})
}
@@ -1921,6 +2032,8 @@ fn ingest_one_code_asset(
warnings: vec![
"kb:// URI not yet supported".to_string(),
],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
});
}
@@ -2227,6 +2340,8 @@ fn ingest_one_code_asset(
parser_version: Some(canonical.parser_version.clone()),
chunker_version: Some(chunker_version),
warnings,
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
})
}

View File

@@ -201,6 +201,9 @@ impl ProgressDisplay {
);
}
}
// v0.20.0 sub-item 1: per-page PDF OCR events — not surfaced in
// human-readable progress output (no TTY bar update needed).
IngestEvent::PdfOcrStarted { .. } | IngestEvent::PdfOcrFinished { .. } => {}
}
Ok(())
}

View File

@@ -83,6 +83,12 @@ pub struct IngestItem {
pub parser_version: Option<ParserVersion>,
pub chunker_version: Option<ChunkerVersion>,
pub warnings: Vec<String>,
/// v0.20.0 sub-item 1: number of PDF pages 가 OCR pipeline 통과.
/// `None` = OCR disabled or non-PDF asset.
pub pdf_ocr_pages: Option<u32>,
/// v0.20.0 sub-item 1: cumulative OCR engine wall-clock duration (ms).
/// `None` = OCR disabled or non-PDF asset.
pub pdf_ocr_ms_total: Option<u64>,
pub error: Option<String>,
}

View File

@@ -20,7 +20,7 @@ tracing = { workspace = true }
# crates (pom, postscript, type1-encoding-parser, …) buy us nothing
# at v1 (we don't call its whole-doc API), and the future scanned-PDF
# OCR fallback can re-add it when it actually needs it.
lopdf = "0.32"
lopdf = { workspace = true }
[dev-dependencies]
blake3 = { workspace = true }

View File

@@ -13,6 +13,8 @@
"error": null,
"kind": "new",
"parser_version": "md-frontmatter-v2",
"pdf_ocr_ms_total": null,
"pdf_ocr_pages": null,
"warnings": []
},
{
@@ -26,6 +28,8 @@
"error": null,
"kind": "updated",
"parser_version": "md-frontmatter-v2",
"pdf_ocr_ms_total": null,
"pdf_ocr_pages": null,
"warnings": [
"malformed frontmatter"
]

View File

@@ -54,6 +54,8 @@ fn fixture_report() -> IngestReport {
parser_version: Some(ParserVersion("md-frontmatter-v2".into())),
chunker_version: Some(ChunkerVersion("md-heading-v1".into())),
warnings: vec![],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
},
IngestItem {
@@ -67,6 +69,8 @@ fn fixture_report() -> IngestReport {
parser_version: Some(ParserVersion("md-frontmatter-v2".into())),
chunker_version: Some(ChunkerVersion("md-heading-v1".into())),
warnings: vec!["malformed frontmatter".into()],
pdf_ocr_pages: None,
pdf_ocr_ms_total: None,
error: None,
},
]),

View File

@@ -154,6 +154,9 @@ fn apply_event(state: &mut IngestState, event: IngestEvent) {
state.terminal_at = Some(std::time::Instant::now());
state.aborted = true;
}
// v0.20.0 sub-item 1: per-page PDF OCR events — TUI does not
// surface per-page OCR progress in v1; no counter to update.
IngestEvent::PdfOcrStarted { .. } | IngestEvent::PdfOcrFinished { .. } => {}
}
}