Define OcrEngine trait + a Tesseract-backed default implementation. Populate ImageRefBlock.ocr with OcrText { joined, regions, engine, engine_version }. Provide an apple-vision feature gate that switches to a sidecar binary on macOS.
Why now / why this size
Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker.
Languages from config.ocr.languages (default ["eng", "kor"]).
Recognition produces OcrRegion { bbox: (x, y, w, h), text, confidence } for each "word" or "line" (configurable; default "line").
Drop regions with confidence < config.ocr.min_confidence (default 60.0). If all dropped, return OcrText { joined: "", regions: vec![], engine, engine_version }.
joined = regions.iter().map(|r| r.text).join(" ") (no smart layout reconstruction in v1).
engine = "tesseract", engine_version = <tesseract version string>. The tesseract crate (0.13+) does NOT expose a stable Rust version() accessor. Use one of: (a) call libtesseract's TessVersion() via the bundled FFI surface, OR (b) at adapter construction, shell-out tesseract --version once and cache the parsed "5.3.4"-style string. Both are deterministic for a fixed install. Pin the chosen approach in the implementation PR.
Apple Vision sidecar (feature apple-vision):
Spawn a small Swift binary kebab-vision-ocr (path from config.ocr.apple_vision_binary) feeding the image via stdin and reading JSON { regions: [{x,y,w,h,text,confidence}, ...] } from stdout.
Same threshold and joined rules as Tesseract. engine = "apple-vision", engine_version = sidecar's --version.
This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in docs/spec/sidecar-vision.md (separate doc spec stub, optional).
apply_ocr calls engine.recognize, sets block.ocr = Some(text), and appends a Provenance::OcrApplied event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper).
Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with image::imageops::resize if larger.
Trust: OcrText is observed text (high trust). Captions (ModelCaption) are NOT generated here.
Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept.
Storage / wire effects
None.
Test plan
kind
description
fixture / data
unit
Tesseract recognizes English on fixtures/image/hello-world.png (joined contains "hello world")
fixture
unit
confidence threshold drops noise regions
fixture with low-quality text
unit
Korean text recognized when kor language enabled
fixtures/image/안녕.png
unit
empty result returns OcrText { joined: "", regions: [], .. } not error
fixtures/image/no-text.png
unit
apply_ocr mutates block.ocr from None → Some
inline
determinism
two runs of recognize on same input → identical OcrText