tasks: add P6 component specs (image-exif, ocr, caption)

2026-04-27 12:06:20 +00:00
parent 597a848af9
commit c84ab03404
3 changed files with 369 additions and 0 deletions
--- a/tasks/p6/p6-1-image-extractor-exif.md
+++ b/tasks/p6/p6-1-image-extractor-exif.md
@@ -0,0 +1,114 @@
+---
+phase: P6
+component: kb-parse-image (image extractor + EXIF)
+task_id: p6-1
+title: "Image Extractor producing single-block CanonicalDocument + EXIF metadata"
+status: planned
+depends_on: [p0-1, p1-6]
+unblocks: [p6-2, p6-3]
+contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
+contract_sections: [§3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText/ModelCaption stubs, §9.1 image extraction policy, §9 versioning]
+---
+
+# p6-1 — Image extractor (EXIF + structure)
+
+## Goal
+
+Implement `Extractor` for `MediaType::Image(_)` that produces a `CanonicalDocument` whose body is exactly one `ImageRefBlock`. EXIF is captured into `metadata.user.exif`. OCR and caption are intentionally left `None`; later tasks (p6-2, p6-3) populate them.
+
+## Why now / why this size
+
+Establishes the image-as-document contract and decouples extraction (asset → ImageRefBlock) from analysis (OCR / caption). Keeps the multimodal merge surface small.
+
+## Allowed dependencies
+
+- `kb-core`
+- `kb-config`
+- `image = "0.25"` (decoding for size + format detect)
+- `kamadak-exif` for EXIF
+- `serde`, `serde_json`
+- `time`
+- `tracing`
+- `thiserror`
+
+## Forbidden dependencies
+
+- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, OCR libs, LLM libs
+
+## Inputs
+
+| input | type | source |
+|-------|------|--------|
+| `RawAsset` | `kb_core::RawAsset` | from `kb-source-fs` |
+| image bytes | `&[u8]` | filesystem |
+| `parser_version` | `kb_core::ParserVersion` | constant in this crate (`"image-meta-v1"`) |
+
+## Outputs
+
+| output | type | downstream |
+|--------|------|------------|
+| `CanonicalDocument` | `kb_core::CanonicalDocument` | `kb-chunk` (image-region chunker) → `kb-store-sqlite` |
+
+## Public surface (signatures only — no new types)
+
+```rust
+pub struct ImageExtractor;
+
+impl kb_core::Extractor for ImageExtractor {
+    fn supports(&self, m: &kb_core::MediaType) -> bool { matches!(m, kb_core::MediaType::Image(_)) }
+    fn parser_version(&self) -> kb_core::ParserVersion { kb_core::ParserVersion("image-meta-v1".into()) }
+    fn extract(&self, ctx: &kb_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kb_core::CanonicalDocument>;
+}
+```
+
+## Behavior contract
+
+- One asset → one document. `title` = filename without extension; `lang = Lang("und")`.
+- `blocks` contains exactly one entry: `Block::ImageRef(ImageRefBlock { common, asset_id: Some(asset.asset_id), src: workspace_path, alt: filename, ocr: None, caption: None })`.
+- `common.source_span` = `SourceSpan::Region { x:0, y:0, w: width, h: height }` covering the entire image (width/height obtained from `image::ImageReader::without_guessed_format().with_guessed_format()?.into_dimensions()`).
+- `metadata.source_type = SourceType::Reference` (per design enum); `trust_level = TrustLevel::Primary`; `tags`/`aliases` empty.
+- `metadata.user["exif"]` = JSON object with whitelisted EXIF tags (DateTimeOriginal, GPS lat/lon, Make, Model, Orientation, Software). Missing tags omitted.
+- `metadata.user["dimensions"] = { "w": <u32>, "h": <u32>, "format": "<png|jpeg|...>" }`.
+- `provenance` includes `Discovered`, `Parsed` events (no Normalized — ID assignment happens here directly per §3.4 stub from p1-4 logic, OR pipe through `kb-normalize` if available; this task's choice: emit a fully formed CanonicalDocument with deterministic IDs by calling `kb_core::id_for_doc` and `kb_core::id_for_block` directly).
+- Failure modes:
+  - Truncated/corrupt image → still emits a CanonicalDocument with `dimensions = null`, EXIF empty, `Provenance` warning event with the decoder error message.
+  - Unsupported format → `anyhow::Error` (caller skips).
+- Determinism: identical bytes + identical parser_version → identical `doc_id` and `block_id`.
+
+## Storage / wire effects
+
+- None directly (the caller persists via `kb-store-sqlite`).
+
+## Test plan
+
+| kind | description | fixture / data |
+|------|-------------|----------------|
+| unit | PNG decode produces correct dimensions in `metadata.user.dimensions` | `fixtures/image/red-100x50.png` |
+| unit | JPEG with EXIF GPS captured into `metadata.user.exif` | `fixtures/image/exif-with-gps.jpg` |
+| unit | image with no EXIF produces `metadata.user.exif = {}` | `fixtures/image/no-exif.png` |
+| unit | corrupt image: warning provenance, no panic | `fixtures/image/corrupt.png` |
+| determinism | identical bytes → identical `doc_id`, `block_id` across two runs | inline |
+| snapshot | `CanonicalDocument` JSON stable for fixture | `fixtures/image/red-100x50.png` |
+
+All tests under `cargo test -p kb-parse-image`.
+
+## Definition of Done
+
+- [ ] `cargo check -p kb-parse-image` passes
+- [ ] `cargo test -p kb-parse-image` passes
+- [ ] No OCR/caption/embedding code present
+- [ ] No imports outside Allowed dependencies
+- [ ] PR links design §3.4, §9.1
+
+## Out of scope
+
+- OCR text (p6-2).
+- Captioning (p6-3).
+- CLIP / visual embedding (P+).
+- HEIC / RAW formats (out of scope; record as Other and accept failure for v1).
+
+## Risks / notes
+
+- `image` crate doesn't decode HEIC; document and accept skip. Apple Vision sidecar (P+) can fill this gap.
+- EXIF whitelist keeps PII surface small (no thumbnails, no maker notes). Document the list in the spec section.
+- Cap decode dimensions to ~16k×16k; oversized → warning + null dimensions instead of attempted decode.
--- a/tasks/p6/p6-2-ocr-adapter.md
+++ b/tasks/p6/p6-2-ocr-adapter.md
@@ -0,0 +1,133 @@
+---
+phase: P6
+component: kb-parse-image (OCR adapter)
+task_id: p6-2
+title: "OcrEngine trait + Tesseract adapter (Apple Vision feature-gated)"
+status: planned
+depends_on: [p6-1]
+unblocks: [p6-3]
+contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
+contract_sections: [§3.4 ImageRefBlock.ocr, §3.7a OcrText/OcrRegion, §9.1 OCR vs caption provenance]
+---
+
+# p6-2 — OCR adapter
+
+## Goal
+
+Define `OcrEngine` trait + a Tesseract-backed default implementation. Populate `ImageRefBlock.ocr` with `OcrText { joined, regions, engine, engine_version }`. Provide an `apple-vision` feature gate that switches to a sidecar binary on macOS.
+
+## Why now / why this size
+
+Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker.
+
+## Allowed dependencies
+
+- `kb-core`
+- `kb-config`
+- `kb-parse-image` (consumes its types)
+- `tesseract = "0.13"` (feature `tesseract`, default ON)
+- For feature `apple-vision`: `std::process::Command` only (sidecar binary, not a Rust dep)
+- `serde`, `serde_json`
+- `image`
+- `tracing`
+- `thiserror`
+
+## Forbidden dependencies
+
+- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
+
+## Inputs
+
+| input | type | source |
+|-------|------|--------|
+| image bytes | `&[u8]` | from extractor |
+| optional language hint | `kb_core::Lang` | metadata |
+| `kb-config` OCR settings | engine name, languages | runtime |
+
+## Outputs
+
+| output | type | downstream |
+|--------|------|------------|
+| `OcrText` | `kb_core::OcrText` | merged into `ImageRefBlock.ocr` |
+
+## Public surface (signatures only — no new types)
+
+```rust
+pub trait OcrEngine: Send + Sync {
+    fn engine_name(&self) -> &'static str;
+    fn engine_version(&self) -> String;
+    fn recognize(&self, image_bytes: &[u8], lang_hint: Option<&kb_core::Lang>) -> anyhow::Result<kb_core::OcrText>;
+}
+
+pub struct TesseractOcr { /* internal: lazy api handle */ }
+impl TesseractOcr { pub fn new(config: &kb_config::Config) -> anyhow::Result<Self>; }
+impl OcrEngine for TesseractOcr { /* per trait */ }
+
+#[cfg(feature = "apple-vision")]
+pub struct AppleVisionOcr { /* sidecar path */ }
+#[cfg(feature = "apple-vision")]
+impl OcrEngine for AppleVisionOcr { /* per trait */ }
+
+pub fn apply_ocr(
+    engine: &dyn OcrEngine,
+    image_bytes: &[u8],
+    block: &mut kb_core::ImageRefBlock,
+    lang_hint: Option<&kb_core::Lang>,
+) -> anyhow::Result<()>;
+```
+
+## Behavior contract
+
+- Tesseract:
+  - Languages from `config.ocr.languages` (default `["eng", "kor"]`).
+  - Recognition produces `OcrRegion { bbox: (x, y, w, h), text, confidence }` for each "word" or "line" (configurable; default "line").
+  - Drop regions with `confidence < config.ocr.min_confidence` (default 60.0). If all dropped, return `OcrText { joined: "", regions: vec![], engine, engine_version }`.
+  - `joined` = `regions.iter().map(|r| r.text).join(" ")` (no smart layout reconstruction in v1).
+  - `engine = "tesseract"`, `engine_version = tesseract::version()`.
+- Apple Vision sidecar (feature `apple-vision`):
+  - Spawn a small Swift binary `kb-vision-ocr` (path from `config.ocr.apple_vision_binary`) feeding the image via stdin and reading JSON `{ regions: [{x,y,w,h,text,confidence}, ...] }` from stdout.
+  - Same threshold and `joined` rules as Tesseract. `engine = "apple-vision"`, `engine_version = sidecar's --version`.
+  - This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in `docs/spec/sidecar-vision.md` (separate doc spec stub, optional).
+- `apply_ocr` calls `engine.recognize`, sets `block.ocr = Some(text)`, and appends a `Provenance::OcrApplied` event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper).
+- Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with `image::imageops::resize` if larger.
+- Trust: `OcrText` is **observed text** (high trust). Captions (`ModelCaption`) are NOT generated here.
+- Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept.
+
+## Storage / wire effects
+
+- None.
+
+## Test plan
+
+| kind | description | fixture / data |
+|------|-------------|----------------|
+| unit | Tesseract recognizes English on `fixtures/image/hello-world.png` (joined contains "hello world") | fixture |
+| unit | confidence threshold drops noise regions | fixture with low-quality text |
+| unit | Korean text recognized when `kor` language enabled | `fixtures/image/안녕.png` |
+| unit | empty result returns `OcrText { joined: "", regions: [], .. }` not error | `fixtures/image/no-text.png` |
+| unit | `apply_ocr` mutates block.ocr from None → Some | inline |
+| determinism | two runs of recognize on same input → identical OcrText | fixture |
+| `#[cfg(feature = "apple-vision")]` smoke | sidecar invocation captured (mock binary echoes fixed JSON) | inline mock |
+
+All tests under `cargo test -p kb-parse-image ocr`. Tesseract install required on CI host.
+
+## Definition of Done
+
+- [ ] `cargo check -p kb-parse-image --features tesseract` passes
+- [ ] `cargo test -p kb-parse-image ocr` passes
+- [ ] `apple-vision` feature compiles on macOS and gracefully no-ops on Linux
+- [ ] No imports outside Allowed dependencies
+- [ ] PR links design §3.4, §3.7a, §9.1
+
+## Out of scope
+
+- Caption (p6-3).
+- Visual embedding (P+).
+- Layout-aware reading order (P+).
+- PaddleOCR / EasyOCR adapters.
+
+## Risks / notes
+
+- Tesseract performance varies wildly with image quality; document `min_confidence` and default page-segmentation mode.
+- Apple Vision sidecar requires code signing for distribution; for v1 dev builds, accept unsigned binary from `~/.local/bin/kb-vision-ocr`.
+- Large image downscale loses small-text recognition; expose `config.ocr.max_pixels` so power users can tune.
--- a/tasks/p6/p6-3-caption-adapter.md
+++ b/tasks/p6/p6-3-caption-adapter.md
@@ -0,0 +1,122 @@
+---
+phase: P6
+component: kb-parse-image (caption adapter)
+task_id: p6-3
+title: "ModelCaption adapter (LanguageModel-driven, feature-gated)"
+status: planned
+depends_on: [p6-1, p4-2]
+unblocks: []
+contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
+contract_sections: [§3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust)]
+---
+
+# p6-3 — Caption adapter
+
+## Goal
+
+Optionally populate `ImageRefBlock.caption` with `ModelCaption { text, model, model_version }` produced by a vision-capable LM (e.g., `qwen2.5-vl:7b` via Ollama). Feature-gated; default OFF.
+
+## Why now / why this size
+
+Captioning closes the multimodal loop. Strict separation from OCR keeps trust levels distinct: captions are generated, OCR is observed. Adapter is small — single trait method + one prompt.
+
+## Allowed dependencies
+
+- `kb-core`
+- `kb-config`
+- `kb-parse-image`
+- `kb-llm` (LanguageModel trait)
+- `base64`
+- `serde`, `serde_json`
+- `image` (resize for prompt cost control)
+- `tracing`
+- `thiserror`
+
+## Forbidden dependencies
+
+- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-rag`, `kb-llm-local` (only via trait), `kb-tui`, `kb-desktop`
+
+## Inputs
+
+| input | type | source |
+|-------|------|--------|
+| image bytes | `&[u8]` | extractor |
+| `dyn LanguageModel` (vision-capable) | runtime | injected |
+| `kb-config.image.caption` | `{ enabled, max_pixels, prompt_template_version }` | runtime |
+
+## Outputs
+
+| output | type | downstream |
+|--------|------|------------|
+| `ModelCaption` | `kb_core::ModelCaption` | merged into `ImageRefBlock.caption` |
+
+## Public surface (signatures only — no new types)
+
+```rust
+pub fn caption_image(
+    llm: &dyn kb_core::LanguageModel,
+    image_bytes: &[u8],
+    cfg: &kb_config::Config,
+) -> anyhow::Result<kb_core::ModelCaption>;
+
+pub fn apply_caption(
+    llm: &dyn kb_core::LanguageModel,
+    image_bytes: &[u8],
+    block: &mut kb_core::ImageRefBlock,
+    cfg: &kb_config::Config,
+) -> anyhow::Result<()>;
+```
+
+## Behavior contract
+
+- Feature gate: if `config.image.caption.enabled = false` (default), `apply_caption` is a no-op (returns `Ok(())` without invoking LM).
+- Pre-process: downscale image to `config.image.caption.max_pixels` (default 768×768 long edge) preserving aspect; encode as PNG.
+- Build prompt:
+  - `system = "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, 보이는 것만 적는다."`
+  - `user` = `[image_base64]\n\n위 이미지를 한국어로 한 문장으로 설명하라.` (if `lang` hint == "ko") or English variant otherwise.
+  - The base64 wrapper assumes the LM adapter routes vision inputs via Ollama's `images: [base64]` field (this is provider-specific; the adapter is responsible for rendering the prompt to wire). For non-vision LMs, return an error and skip.
+- Call `llm.generate_stream(GenerateRequest { system, user, stop: vec!["\n\n"], max_tokens: 96, temperature: 0.0, seed: Some(0) })`. Collect tokens until `Done`.
+- `ModelCaption { text: collected, model: llm.model_ref().id, model_version: llm.model_ref().provider }` (use provider as a coarse "version" proxy; if a vision model exposes a stable revision, prefer that).
+- `apply_caption` sets `block.caption = Some(...)` and appends `Provenance::CaptionApplied` event.
+- Trust: caption is **model-generated** and labeled `trust_level = TrustLevel::Generated` if the caller propagates trust into chunk-level UI; this task only emits the `ModelCaption`.
+- Failure modes:
+  - LM error → return `anyhow::Error`; caller may decide to skip (do not fail the entire ingest).
+  - Empty LM output → still set `block.caption = Some(ModelCaption { text: "" })` so downstream code can distinguish "captioning attempted, no result" from "captioning never attempted".
+- Determinism: `temperature=0` + `seed=0`. Tests use `MockLanguageModel` to assert deterministic captions.
+
+## Storage / wire effects
+
+- None directly. Caller persists via `kb-store-sqlite`.
+
+## Test plan
+
+| kind | description | fixture / data |
+|------|-------------|----------------|
+| unit | feature disabled → `apply_caption` no-op | inline (config.enabled = false) |
+| unit | mock LM emits "사진 한 장" → `block.caption.text = "사진 한 장"` | inline |
+| unit | mock LM emits empty token stream → `block.caption = Some(ModelCaption { text: "" })` | inline |
+| unit | Korean lang hint produces Korean prompt; English hint → English prompt | inline |
+| unit | downscale honors `max_pixels` (resulting bytes < some threshold) | fixture large image |
+| determinism | identical input + temperature=0 + seed=0 → identical caption (mock) | inline |
+
+All tests under `cargo test -p kb-parse-image caption` with mock LM only.
+
+## Definition of Done
+
+- [ ] `cargo check -p kb-parse-image --features caption` passes
+- [ ] `cargo test -p kb-parse-image caption` passes
+- [ ] No imports outside Allowed dependencies
+- [ ] Feature default OFF; only on when user opts in via config
+- [ ] PR links design §3.4 ImageRefBlock.caption, §9.1
+
+## Out of scope
+
+- Multimodal RAG that uses caption text in answer (P+).
+- CLIP / image embedding for cross-modal search (P+).
+- Caption translation (P+).
+
+## Risks / notes
+
+- Vision LMs hallucinate. The system prompt explicitly forbids guessing, but expect false captions; UI and RAG must always label captions as model-generated.
+- Ollama `qwen2.5-vl` accepts base64 images via `images:[]` — this is provider-specific; documenting the wire shape in the spec keeps adapter swaps cheap.
+- Large images bloat prompt costs; cap aggressively (768×768 long edge default).