Files
kebab/tasks/p6/p6-2-ocr-adapter.md
altair823 4ed5536c92 feat(kebab-parse-image): P6-2 OCR adapter — Ollama-vision default
- 새 모듈 `crates/kebab-parse-image/src/ocr.rs` 추가. spec 의 `OcrEngine`
  trait 그대로 + `OllamaVisionOcr` default 구현 + `apply_ocr` 헬퍼.
- `OllamaVisionOcr`: `<endpoint>/api/generate` 비스트리밍 호출,
  `images: [base64]` 필드로 이미지 전달, 프롬프트는 언어 힌트
  + 화이트리스트 언어 목록 포함. 응답 prose 를 `OcrText.joined` 로,
  prepared image 전체 영역 단일 region (confidence 1.0) 으로 wrap.
  기본 모델 `gemma4:e4b`. endpoint 비어 있으면 `models.llm.endpoint`
  로 fallback.
- 이미지 전처리: long-edge `config.image.ocr.max_pixels` (기본 1600,
  256~4096 클램프) 초과 시 PNG 로 재인코딩 (image::imageops::resize,
  Triangle filter). PNG 입력이 max 이내면 zero-copy passthrough.
- `apply_ocr` 는 OCR 성공 시 block.ocr 를 Some 으로 채우고
  ProvenanceKind::OcrApplied 이벤트 추가. 실패 시 block.ocr 는
  None 그대로 + provenance 미기록 (부분 상태 누출 금지).
- `kebab-config`: 새 `ImageCfg.ocr: OcrCfg` 블록 (enabled/engine/model
  /endpoint/languages/max_pixels). `#[serde(default)]` 로 pre-P6
  TOML 호환. `KEBAB_IMAGE_OCR_*` 환경변수 5종 추가.

## Spec deviation

원래 P6-2 spec 은 Tesseract 를 default OCR 엔진으로 지정했으나, dev /
CI 호스트에서 `libtesseract-dev` 시스템 패키지 설치를 피하려고
Ollama-vision 으로 default 를 교체. `OcrEngine` trait 추상화는 spec
그대로 보존 — Tesseract / Apple Vision / PaddleOCR 어댑터는 같은
trait 으로 추후 feature-gate 추가 가능. 자세한 내역은
`tasks/HOTFIXES.md` 2026-05-02 항목 참조.

Trust 측면: vision LM 은 hallucinate 가능. `OcrText.engine = "ollama-vision"`
필드로 consumer 가 엔진 별 신뢰 분기 가능.

## 테스트

- 신규 (`tests/ocr.rs`, 8 + 1 ignored):
  - 200 happy → OcrText 디코딩 (joined / engine / engine_version /
    region count / bbox / confidence)
  - 빈 응답 → 빈 regions
  - 5xx → Err with status + body 포함
  - 200 error envelope → Err
  - apply_ocr → block.ocr Some + Provenance OcrApplied 1건
  - apply_ocr error → block.ocr None 유지 + events 미기록
  - 4000×3000 PNG → max_pixels=1024 까지 다운스케일, aspect ratio 보존
  - from_parts max_pixels 클램프
  - opt-in `KEBAB_OCR_INTEGRATION=1` 통합 (실제 192.168.0.47 Ollama
    `gemma4:e4b` 로 \"Hello World 2026\" 전사 검증 완료)
- 신규 (`src/ocr.rs` unit): truncate, build_prompt 언어/힌트 처리
- `kebab-config` 테스트 +3: defaults, env override, pre-P6 TOML 호환

전체: `cargo test -p kebab-parse-image` 28 pass + 1 ignored,
`cargo test -p kebab-config` 20 pass,
`cargo clippy --workspace --all-targets -- -D warnings` pass.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock.ocr, §3.7a OcrText / OcrRegion, §9.1 OCR
vs caption provenance.
2026-05-02 05:38:24 +00:00

134 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P6
component: kebab-parse-image (OCR adapter)
task_id: p6-2
title: "OcrEngine trait + Tesseract adapter (Apple Vision feature-gated)"
status: completed
depends_on: [p6-1]
unblocks: [p6-3]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 ImageRefBlock.ocr, §3.7a OcrText/OcrRegion, §9.1 OCR vs caption provenance]
---
# p6-2 — OCR adapter
## Goal
Define `OcrEngine` trait + a Tesseract-backed default implementation. Populate `ImageRefBlock.ocr` with `OcrText { joined, regions, engine, engine_version }`. Provide an `apple-vision` feature gate that switches to a sidecar binary on macOS.
## Why now / why this size
Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `kebab-parse-image` (consumes its types)
- `tesseract = "0.13"` (feature `tesseract`, default ON)
- For feature `apple-vision`: `std::process::Command` only (sidecar binary, not a Rust dep)
- `serde`, `serde_json`
- `image`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| image bytes | `&[u8]` | from extractor |
| optional language hint | `kebab_core::Lang` | metadata |
| `kebab-config` OCR settings | engine name, languages | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `OcrText` | `kebab_core::OcrText` | merged into `ImageRefBlock.ocr` |
## Public surface (signatures only — no new types)
```rust
pub trait OcrEngine: Send + Sync {
fn engine_name(&self) -> &'static str;
fn engine_version(&self) -> String;
fn recognize(&self, image_bytes: &[u8], lang_hint: Option<&kebab_core::Lang>) -> anyhow::Result<kebab_core::OcrText>;
}
pub struct TesseractOcr { /* internal: lazy api handle */ }
impl TesseractOcr { pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self>; }
impl OcrEngine for TesseractOcr { /* per trait */ }
#[cfg(feature = "apple-vision")]
pub struct AppleVisionOcr { /* sidecar path */ }
#[cfg(feature = "apple-vision")]
impl OcrEngine for AppleVisionOcr { /* per trait */ }
pub fn apply_ocr(
engine: &dyn OcrEngine,
image_bytes: &[u8],
block: &mut kebab_core::ImageRefBlock,
lang_hint: Option<&kebab_core::Lang>,
) -> anyhow::Result<()>;
```
## Behavior contract
- Tesseract:
- Languages from `config.ocr.languages` (default `["eng", "kor"]`).
- Recognition produces `OcrRegion { bbox: (x, y, w, h), text, confidence }` for each "word" or "line" (configurable; default "line").
- Drop regions with `confidence < config.ocr.min_confidence` (default 60.0). If all dropped, return `OcrText { joined: "", regions: vec![], engine, engine_version }`.
- `joined` = `regions.iter().map(|r| r.text).join(" ")` (no smart layout reconstruction in v1).
- `engine = "tesseract"`, `engine_version = <tesseract version string>`. The `tesseract` crate (0.13+) does NOT expose a stable Rust `version()` accessor. Use one of: (a) call libtesseract's `TessVersion()` via the bundled FFI surface, OR (b) at adapter construction, shell-out `tesseract --version` once and cache the parsed `"5.3.4"`-style string. Both are deterministic for a fixed install. Pin the chosen approach in the implementation PR.
- Apple Vision sidecar (feature `apple-vision`):
- Spawn a small Swift binary `kebab-vision-ocr` (path from `config.ocr.apple_vision_binary`) feeding the image via stdin and reading JSON `{ regions: [{x,y,w,h,text,confidence}, ...] }` from stdout.
- Same threshold and `joined` rules as Tesseract. `engine = "apple-vision"`, `engine_version = sidecar's --version`.
- This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in `docs/spec/sidecar-vision.md` (separate doc spec stub, optional).
- `apply_ocr` calls `engine.recognize`, sets `block.ocr = Some(text)`, and appends a `Provenance::OcrApplied` event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper).
- Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with `image::imageops::resize` if larger.
- Trust: `OcrText` is **observed text** (high trust). Captions (`ModelCaption`) are NOT generated here.
- Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept.
## Storage / wire effects
- None.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | Tesseract recognizes English on `fixtures/image/hello-world.png` (joined contains "hello world") | fixture |
| unit | confidence threshold drops noise regions | fixture with low-quality text |
| unit | Korean text recognized when `kor` language enabled | `fixtures/image/안녕.png` |
| unit | empty result returns `OcrText { joined: "", regions: [], .. }` not error | `fixtures/image/no-text.png` |
| unit | `apply_ocr` mutates block.ocr from None → Some | inline |
| determinism | two runs of recognize on same input → identical OcrText | fixture |
| `#[cfg(feature = "apple-vision")]` smoke | sidecar invocation captured (mock binary echoes fixed JSON) | inline mock |
All tests under `cargo test -p kebab-parse-image ocr`. Tesseract install required on CI host.
## Definition of Done
- [ ] `cargo check -p kebab-parse-image --features tesseract` passes
- [ ] `cargo test -p kebab-parse-image ocr` passes
- [ ] `apple-vision` feature compiles on macOS and gracefully no-ops on Linux
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4, §3.7a, §9.1
## Out of scope
- Caption (p6-3).
- Visual embedding (P+).
- Layout-aware reading order (P+).
- PaddleOCR / EasyOCR adapters.
## Risks / notes
- Tesseract performance varies wildly with image quality; document `min_confidence` and default page-segmentation mode.
- Apple Vision sidecar requires code signing for distribution; for v1 dev builds, accept unsigned binary from `~/.local/bin/kebab-vision-ocr`.
- Large image downscale loses small-text recognition; expose `config.ocr.max_pixels` so power users can tune.