Files
kebab/tasks/p6/p6-2-ocr-adapter.md
altair823 4ed5536c92 feat(kebab-parse-image): P6-2 OCR adapter — Ollama-vision default
- 새 모듈 `crates/kebab-parse-image/src/ocr.rs` 추가. spec 의 `OcrEngine`
  trait 그대로 + `OllamaVisionOcr` default 구현 + `apply_ocr` 헬퍼.
- `OllamaVisionOcr`: `<endpoint>/api/generate` 비스트리밍 호출,
  `images: [base64]` 필드로 이미지 전달, 프롬프트는 언어 힌트
  + 화이트리스트 언어 목록 포함. 응답 prose 를 `OcrText.joined` 로,
  prepared image 전체 영역 단일 region (confidence 1.0) 으로 wrap.
  기본 모델 `gemma4:e4b`. endpoint 비어 있으면 `models.llm.endpoint`
  로 fallback.
- 이미지 전처리: long-edge `config.image.ocr.max_pixels` (기본 1600,
  256~4096 클램프) 초과 시 PNG 로 재인코딩 (image::imageops::resize,
  Triangle filter). PNG 입력이 max 이내면 zero-copy passthrough.
- `apply_ocr` 는 OCR 성공 시 block.ocr 를 Some 으로 채우고
  ProvenanceKind::OcrApplied 이벤트 추가. 실패 시 block.ocr 는
  None 그대로 + provenance 미기록 (부분 상태 누출 금지).
- `kebab-config`: 새 `ImageCfg.ocr: OcrCfg` 블록 (enabled/engine/model
  /endpoint/languages/max_pixels). `#[serde(default)]` 로 pre-P6
  TOML 호환. `KEBAB_IMAGE_OCR_*` 환경변수 5종 추가.

## Spec deviation

원래 P6-2 spec 은 Tesseract 를 default OCR 엔진으로 지정했으나, dev /
CI 호스트에서 `libtesseract-dev` 시스템 패키지 설치를 피하려고
Ollama-vision 으로 default 를 교체. `OcrEngine` trait 추상화는 spec
그대로 보존 — Tesseract / Apple Vision / PaddleOCR 어댑터는 같은
trait 으로 추후 feature-gate 추가 가능. 자세한 내역은
`tasks/HOTFIXES.md` 2026-05-02 항목 참조.

Trust 측면: vision LM 은 hallucinate 가능. `OcrText.engine = "ollama-vision"`
필드로 consumer 가 엔진 별 신뢰 분기 가능.

## 테스트

- 신규 (`tests/ocr.rs`, 8 + 1 ignored):
  - 200 happy → OcrText 디코딩 (joined / engine / engine_version /
    region count / bbox / confidence)
  - 빈 응답 → 빈 regions
  - 5xx → Err with status + body 포함
  - 200 error envelope → Err
  - apply_ocr → block.ocr Some + Provenance OcrApplied 1건
  - apply_ocr error → block.ocr None 유지 + events 미기록
  - 4000×3000 PNG → max_pixels=1024 까지 다운스케일, aspect ratio 보존
  - from_parts max_pixels 클램프
  - opt-in `KEBAB_OCR_INTEGRATION=1` 통합 (실제 192.168.0.47 Ollama
    `gemma4:e4b` 로 \"Hello World 2026\" 전사 검증 완료)
- 신규 (`src/ocr.rs` unit): truncate, build_prompt 언어/힌트 처리
- `kebab-config` 테스트 +3: defaults, env override, pre-P6 TOML 호환

전체: `cargo test -p kebab-parse-image` 28 pass + 1 ignored,
`cargo test -p kebab-config` 20 pass,
`cargo clippy --workspace --all-targets -- -D warnings` pass.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock.ocr, §3.7a OcrText / OcrRegion, §9.1 OCR
vs caption provenance.
2026-05-02 05:38:24 +00:00

6.5 KiB
Raw Permalink Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P6 kebab-parse-image (OCR adapter) p6-2 OcrEngine trait + Tesseract adapter (Apple Vision feature-gated) completed
p6-1
p6-3
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
§3.4 ImageRefBlock.ocr
§3.7a OcrText/OcrRegion
§9.1 OCR vs caption provenance

p6-2 — OCR adapter

Goal

Define OcrEngine trait + a Tesseract-backed default implementation. Populate ImageRefBlock.ocr with OcrText { joined, regions, engine, engine_version }. Provide an apple-vision feature gate that switches to a sidecar binary on macOS.

Why now / why this size

Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker.

Allowed dependencies

  • kebab-core
  • kebab-config
  • kebab-parse-image (consumes its types)
  • tesseract = "0.13" (feature tesseract, default ON)
  • For feature apple-vision: std::process::Command only (sidecar binary, not a Rust dep)
  • serde, serde_json
  • image
  • tracing
  • thiserror

Forbidden dependencies

  • kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-*, kebab-embed*, kebab-search, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop

Inputs

input type source
image bytes &[u8] from extractor
optional language hint kebab_core::Lang metadata
kebab-config OCR settings engine name, languages runtime

Outputs

output type downstream
OcrText kebab_core::OcrText merged into ImageRefBlock.ocr

Public surface (signatures only — no new types)

pub trait OcrEngine: Send + Sync {
    fn engine_name(&self) -> &'static str;
    fn engine_version(&self) -> String;
    fn recognize(&self, image_bytes: &[u8], lang_hint: Option<&kebab_core::Lang>) -> anyhow::Result<kebab_core::OcrText>;
}

pub struct TesseractOcr { /* internal: lazy api handle */ }
impl TesseractOcr { pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self>; }
impl OcrEngine for TesseractOcr { /* per trait */ }

#[cfg(feature = "apple-vision")]
pub struct AppleVisionOcr { /* sidecar path */ }
#[cfg(feature = "apple-vision")]
impl OcrEngine for AppleVisionOcr { /* per trait */ }

pub fn apply_ocr(
    engine: &dyn OcrEngine,
    image_bytes: &[u8],
    block: &mut kebab_core::ImageRefBlock,
    lang_hint: Option<&kebab_core::Lang>,
) -> anyhow::Result<()>;

Behavior contract

  • Tesseract:
    • Languages from config.ocr.languages (default ["eng", "kor"]).
    • Recognition produces OcrRegion { bbox: (x, y, w, h), text, confidence } for each "word" or "line" (configurable; default "line").
    • Drop regions with confidence < config.ocr.min_confidence (default 60.0). If all dropped, return OcrText { joined: "", regions: vec![], engine, engine_version }.
    • joined = regions.iter().map(|r| r.text).join(" ") (no smart layout reconstruction in v1).
    • engine = "tesseract", engine_version = <tesseract version string>. The tesseract crate (0.13+) does NOT expose a stable Rust version() accessor. Use one of: (a) call libtesseract's TessVersion() via the bundled FFI surface, OR (b) at adapter construction, shell-out tesseract --version once and cache the parsed "5.3.4"-style string. Both are deterministic for a fixed install. Pin the chosen approach in the implementation PR.
  • Apple Vision sidecar (feature apple-vision):
    • Spawn a small Swift binary kebab-vision-ocr (path from config.ocr.apple_vision_binary) feeding the image via stdin and reading JSON { regions: [{x,y,w,h,text,confidence}, ...] } from stdout.
    • Same threshold and joined rules as Tesseract. engine = "apple-vision", engine_version = sidecar's --version.
    • This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in docs/spec/sidecar-vision.md (separate doc spec stub, optional).
  • apply_ocr calls engine.recognize, sets block.ocr = Some(text), and appends a Provenance::OcrApplied event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper).
  • Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with image::imageops::resize if larger.
  • Trust: OcrText is observed text (high trust). Captions (ModelCaption) are NOT generated here.
  • Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept.

Storage / wire effects

  • None.

Test plan

kind description fixture / data
unit Tesseract recognizes English on fixtures/image/hello-world.png (joined contains "hello world") fixture
unit confidence threshold drops noise regions fixture with low-quality text
unit Korean text recognized when kor language enabled fixtures/image/안녕.png
unit empty result returns OcrText { joined: "", regions: [], .. } not error fixtures/image/no-text.png
unit apply_ocr mutates block.ocr from None → Some inline
determinism two runs of recognize on same input → identical OcrText fixture
#[cfg(feature = "apple-vision")] smoke sidecar invocation captured (mock binary echoes fixed JSON) inline mock

All tests under cargo test -p kebab-parse-image ocr. Tesseract install required on CI host.

Definition of Done

  • cargo check -p kebab-parse-image --features tesseract passes
  • cargo test -p kebab-parse-image ocr passes
  • apple-vision feature compiles on macOS and gracefully no-ops on Linux
  • No imports outside Allowed dependencies
  • PR links design §3.4, §3.7a, §9.1

Out of scope

  • Caption (p6-3).
  • Visual embedding (P+).
  • Layout-aware reading order (P+).
  • PaddleOCR / EasyOCR adapters.

Risks / notes

  • Tesseract performance varies wildly with image quality; document min_confidence and default page-segmentation mode.
  • Apple Vision sidecar requires code signing for distribution; for v1 dev builds, accept unsigned binary from ~/.local/bin/kebab-vision-ocr.
  • Large image downscale loses small-text recognition; expose config.ocr.max_pixels so power users can tune.