tasks: add P6 component specs (image-exif, ocr, caption)

This commit is contained in:
kb
2026-04-27 12:06:20 +00:00
parent 597a848af9
commit c84ab03404
3 changed files with 369 additions and 0 deletions

View File

@@ -0,0 +1,114 @@
---
phase: P6
component: kb-parse-image (image extractor + EXIF)
task_id: p6-1
title: "Image Extractor producing single-block CanonicalDocument + EXIF metadata"
status: planned
depends_on: [p0-1, p1-6]
unblocks: [p6-2, p6-3]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText/ModelCaption stubs, §9.1 image extraction policy, §9 versioning]
---
# p6-1 — Image extractor (EXIF + structure)
## Goal
Implement `Extractor` for `MediaType::Image(_)` that produces a `CanonicalDocument` whose body is exactly one `ImageRefBlock`. EXIF is captured into `metadata.user.exif`. OCR and caption are intentionally left `None`; later tasks (p6-2, p6-3) populate them.
## Why now / why this size
Establishes the image-as-document contract and decouples extraction (asset → ImageRefBlock) from analysis (OCR / caption). Keeps the multimodal merge surface small.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `image = "0.25"` (decoding for size + format detect)
- `kamadak-exif` for EXIF
- `serde`, `serde_json`
- `time`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, OCR libs, LLM libs
## Inputs
| input | type | source |
|-------|------|--------|
| `RawAsset` | `kb_core::RawAsset` | from `kb-source-fs` |
| image bytes | `&[u8]` | filesystem |
| `parser_version` | `kb_core::ParserVersion` | constant in this crate (`"image-meta-v1"`) |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `CanonicalDocument` | `kb_core::CanonicalDocument` | `kb-chunk` (image-region chunker) → `kb-store-sqlite` |
## Public surface (signatures only — no new types)
```rust
pub struct ImageExtractor;
impl kb_core::Extractor for ImageExtractor {
fn supports(&self, m: &kb_core::MediaType) -> bool { matches!(m, kb_core::MediaType::Image(_)) }
fn parser_version(&self) -> kb_core::ParserVersion { kb_core::ParserVersion("image-meta-v1".into()) }
fn extract(&self, ctx: &kb_core::ExtractContext, bytes: &[u8]) -> anyhow::Result<kb_core::CanonicalDocument>;
}
```
## Behavior contract
- One asset → one document. `title` = filename without extension; `lang = Lang("und")`.
- `blocks` contains exactly one entry: `Block::ImageRef(ImageRefBlock { common, asset_id: Some(asset.asset_id), src: workspace_path, alt: filename, ocr: None, caption: None })`.
- `common.source_span` = `SourceSpan::Region { x:0, y:0, w: width, h: height }` covering the entire image (width/height obtained from `image::ImageReader::without_guessed_format().with_guessed_format()?.into_dimensions()`).
- `metadata.source_type = SourceType::Reference` (per design enum); `trust_level = TrustLevel::Primary`; `tags`/`aliases` empty.
- `metadata.user["exif"]` = JSON object with whitelisted EXIF tags (DateTimeOriginal, GPS lat/lon, Make, Model, Orientation, Software). Missing tags omitted.
- `metadata.user["dimensions"] = { "w": <u32>, "h": <u32>, "format": "<png|jpeg|...>" }`.
- `provenance` includes `Discovered`, `Parsed` events (no Normalized — ID assignment happens here directly per §3.4 stub from p1-4 logic, OR pipe through `kb-normalize` if available; this task's choice: emit a fully formed CanonicalDocument with deterministic IDs by calling `kb_core::id_for_doc` and `kb_core::id_for_block` directly).
- Failure modes:
- Truncated/corrupt image → still emits a CanonicalDocument with `dimensions = null`, EXIF empty, `Provenance` warning event with the decoder error message.
- Unsupported format → `anyhow::Error` (caller skips).
- Determinism: identical bytes + identical parser_version → identical `doc_id` and `block_id`.
## Storage / wire effects
- None directly (the caller persists via `kb-store-sqlite`).
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | PNG decode produces correct dimensions in `metadata.user.dimensions` | `fixtures/image/red-100x50.png` |
| unit | JPEG with EXIF GPS captured into `metadata.user.exif` | `fixtures/image/exif-with-gps.jpg` |
| unit | image with no EXIF produces `metadata.user.exif = {}` | `fixtures/image/no-exif.png` |
| unit | corrupt image: warning provenance, no panic | `fixtures/image/corrupt.png` |
| determinism | identical bytes → identical `doc_id`, `block_id` across two runs | inline |
| snapshot | `CanonicalDocument` JSON stable for fixture | `fixtures/image/red-100x50.png` |
All tests under `cargo test -p kb-parse-image`.
## Definition of Done
- [ ] `cargo check -p kb-parse-image` passes
- [ ] `cargo test -p kb-parse-image` passes
- [ ] No OCR/caption/embedding code present
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4, §9.1
## Out of scope
- OCR text (p6-2).
- Captioning (p6-3).
- CLIP / visual embedding (P+).
- HEIC / RAW formats (out of scope; record as Other and accept failure for v1).
## Risks / notes
- `image` crate doesn't decode HEIC; document and accept skip. Apple Vision sidecar (P+) can fill this gap.
- EXIF whitelist keeps PII surface small (no thumbnails, no maker notes). Document the list in the spec section.
- Cap decode dimensions to ~16k×16k; oversized → warning + null dimensions instead of attempted decode.

View File

@@ -0,0 +1,133 @@
---
phase: P6
component: kb-parse-image (OCR adapter)
task_id: p6-2
title: "OcrEngine trait + Tesseract adapter (Apple Vision feature-gated)"
status: planned
depends_on: [p6-1]
unblocks: [p6-3]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4 ImageRefBlock.ocr, §3.7a OcrText/OcrRegion, §9.1 OCR vs caption provenance]
---
# p6-2 — OCR adapter
## Goal
Define `OcrEngine` trait + a Tesseract-backed default implementation. Populate `ImageRefBlock.ocr` with `OcrText { joined, regions, engine, engine_version }`. Provide an `apple-vision` feature gate that switches to a sidecar binary on macOS.
## Why now / why this size
Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `kb-parse-image` (consumes its types)
- `tesseract = "0.13"` (feature `tesseract`, default ON)
- For feature `apple-vision`: `std::process::Command` only (sidecar binary, not a Rust dep)
- `serde`, `serde_json`
- `image`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| image bytes | `&[u8]` | from extractor |
| optional language hint | `kb_core::Lang` | metadata |
| `kb-config` OCR settings | engine name, languages | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `OcrText` | `kb_core::OcrText` | merged into `ImageRefBlock.ocr` |
## Public surface (signatures only — no new types)
```rust
pub trait OcrEngine: Send + Sync {
fn engine_name(&self) -> &'static str;
fn engine_version(&self) -> String;
fn recognize(&self, image_bytes: &[u8], lang_hint: Option<&kb_core::Lang>) -> anyhow::Result<kb_core::OcrText>;
}
pub struct TesseractOcr { /* internal: lazy api handle */ }
impl TesseractOcr { pub fn new(config: &kb_config::Config) -> anyhow::Result<Self>; }
impl OcrEngine for TesseractOcr { /* per trait */ }
#[cfg(feature = "apple-vision")]
pub struct AppleVisionOcr { /* sidecar path */ }
#[cfg(feature = "apple-vision")]
impl OcrEngine for AppleVisionOcr { /* per trait */ }
pub fn apply_ocr(
engine: &dyn OcrEngine,
image_bytes: &[u8],
block: &mut kb_core::ImageRefBlock,
lang_hint: Option<&kb_core::Lang>,
) -> anyhow::Result<()>;
```
## Behavior contract
- Tesseract:
- Languages from `config.ocr.languages` (default `["eng", "kor"]`).
- Recognition produces `OcrRegion { bbox: (x, y, w, h), text, confidence }` for each "word" or "line" (configurable; default "line").
- Drop regions with `confidence < config.ocr.min_confidence` (default 60.0). If all dropped, return `OcrText { joined: "", regions: vec![], engine, engine_version }`.
- `joined` = `regions.iter().map(|r| r.text).join(" ")` (no smart layout reconstruction in v1).
- `engine = "tesseract"`, `engine_version = tesseract::version()`.
- Apple Vision sidecar (feature `apple-vision`):
- Spawn a small Swift binary `kb-vision-ocr` (path from `config.ocr.apple_vision_binary`) feeding the image via stdin and reading JSON `{ regions: [{x,y,w,h,text,confidence}, ...] }` from stdout.
- Same threshold and `joined` rules as Tesseract. `engine = "apple-vision"`, `engine_version = sidecar's --version`.
- This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in `docs/spec/sidecar-vision.md` (separate doc spec stub, optional).
- `apply_ocr` calls `engine.recognize`, sets `block.ocr = Some(text)`, and appends a `Provenance::OcrApplied` event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper).
- Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with `image::imageops::resize` if larger.
- Trust: `OcrText` is **observed text** (high trust). Captions (`ModelCaption`) are NOT generated here.
- Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept.
## Storage / wire effects
- None.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | Tesseract recognizes English on `fixtures/image/hello-world.png` (joined contains "hello world") | fixture |
| unit | confidence threshold drops noise regions | fixture with low-quality text |
| unit | Korean text recognized when `kor` language enabled | `fixtures/image/안녕.png` |
| unit | empty result returns `OcrText { joined: "", regions: [], .. }` not error | `fixtures/image/no-text.png` |
| unit | `apply_ocr` mutates block.ocr from None → Some | inline |
| determinism | two runs of recognize on same input → identical OcrText | fixture |
| `#[cfg(feature = "apple-vision")]` smoke | sidecar invocation captured (mock binary echoes fixed JSON) | inline mock |
All tests under `cargo test -p kb-parse-image ocr`. Tesseract install required on CI host.
## Definition of Done
- [ ] `cargo check -p kb-parse-image --features tesseract` passes
- [ ] `cargo test -p kb-parse-image ocr` passes
- [ ] `apple-vision` feature compiles on macOS and gracefully no-ops on Linux
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4, §3.7a, §9.1
## Out of scope
- Caption (p6-3).
- Visual embedding (P+).
- Layout-aware reading order (P+).
- PaddleOCR / EasyOCR adapters.
## Risks / notes
- Tesseract performance varies wildly with image quality; document `min_confidence` and default page-segmentation mode.
- Apple Vision sidecar requires code signing for distribution; for v1 dev builds, accept unsigned binary from `~/.local/bin/kb-vision-ocr`.
- Large image downscale loses small-text recognition; expose `config.ocr.max_pixels` so power users can tune.

View File

@@ -0,0 +1,122 @@
---
phase: P6
component: kb-parse-image (caption adapter)
task_id: p6-3
title: "ModelCaption adapter (LanguageModel-driven, feature-gated)"
status: planned
depends_on: [p6-1, p4-2]
unblocks: []
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust)]
---
# p6-3 — Caption adapter
## Goal
Optionally populate `ImageRefBlock.caption` with `ModelCaption { text, model, model_version }` produced by a vision-capable LM (e.g., `qwen2.5-vl:7b` via Ollama). Feature-gated; default OFF.
## Why now / why this size
Captioning closes the multimodal loop. Strict separation from OCR keeps trust levels distinct: captions are generated, OCR is observed. Adapter is small — single trait method + one prompt.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `kb-parse-image`
- `kb-llm` (LanguageModel trait)
- `base64`
- `serde`, `serde_json`
- `image` (resize for prompt cost control)
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-rag`, `kb-llm-local` (only via trait), `kb-tui`, `kb-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| image bytes | `&[u8]` | extractor |
| `dyn LanguageModel` (vision-capable) | runtime | injected |
| `kb-config.image.caption` | `{ enabled, max_pixels, prompt_template_version }` | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `ModelCaption` | `kb_core::ModelCaption` | merged into `ImageRefBlock.caption` |
## Public surface (signatures only — no new types)
```rust
pub fn caption_image(
llm: &dyn kb_core::LanguageModel,
image_bytes: &[u8],
cfg: &kb_config::Config,
) -> anyhow::Result<kb_core::ModelCaption>;
pub fn apply_caption(
llm: &dyn kb_core::LanguageModel,
image_bytes: &[u8],
block: &mut kb_core::ImageRefBlock,
cfg: &kb_config::Config,
) -> anyhow::Result<()>;
```
## Behavior contract
- Feature gate: if `config.image.caption.enabled = false` (default), `apply_caption` is a no-op (returns `Ok(())` without invoking LM).
- Pre-process: downscale image to `config.image.caption.max_pixels` (default 768×768 long edge) preserving aspect; encode as PNG.
- Build prompt:
- `system = "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, 보이는 것만 적는다."`
- `user` = `[image_base64]\n\n위 이미지를 한국어로 한 문장으로 설명하라.` (if `lang` hint == "ko") or English variant otherwise.
- The base64 wrapper assumes the LM adapter routes vision inputs via Ollama's `images: [base64]` field (this is provider-specific; the adapter is responsible for rendering the prompt to wire). For non-vision LMs, return an error and skip.
- Call `llm.generate_stream(GenerateRequest { system, user, stop: vec!["\n\n"], max_tokens: 96, temperature: 0.0, seed: Some(0) })`. Collect tokens until `Done`.
- `ModelCaption { text: collected, model: llm.model_ref().id, model_version: llm.model_ref().provider }` (use provider as a coarse "version" proxy; if a vision model exposes a stable revision, prefer that).
- `apply_caption` sets `block.caption = Some(...)` and appends `Provenance::CaptionApplied` event.
- Trust: caption is **model-generated** and labeled `trust_level = TrustLevel::Generated` if the caller propagates trust into chunk-level UI; this task only emits the `ModelCaption`.
- Failure modes:
- LM error → return `anyhow::Error`; caller may decide to skip (do not fail the entire ingest).
- Empty LM output → still set `block.caption = Some(ModelCaption { text: "" })` so downstream code can distinguish "captioning attempted, no result" from "captioning never attempted".
- Determinism: `temperature=0` + `seed=0`. Tests use `MockLanguageModel` to assert deterministic captions.
## Storage / wire effects
- None directly. Caller persists via `kb-store-sqlite`.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | feature disabled → `apply_caption` no-op | inline (config.enabled = false) |
| unit | mock LM emits "사진 한 장" → `block.caption.text = "사진 한 장"` | inline |
| unit | mock LM emits empty token stream → `block.caption = Some(ModelCaption { text: "" })` | inline |
| unit | Korean lang hint produces Korean prompt; English hint → English prompt | inline |
| unit | downscale honors `max_pixels` (resulting bytes < some threshold) | fixture large image |
| determinism | identical input + temperature=0 + seed=0 → identical caption (mock) | inline |
All tests under `cargo test -p kb-parse-image caption` with mock LM only.
## Definition of Done
- [ ] `cargo check -p kb-parse-image --features caption` passes
- [ ] `cargo test -p kb-parse-image caption` passes
- [ ] No imports outside Allowed dependencies
- [ ] Feature default OFF; only on when user opts in via config
- [ ] PR links design §3.4 ImageRefBlock.caption, §9.1
## Out of scope
- Multimodal RAG that uses caption text in answer (P+).
- CLIP / image embedding for cross-modal search (P+).
- Caption translation (P+).
## Risks / notes
- Vision LMs hallucinate. The system prompt explicitly forbids guessing, but expect false captions; UI and RAG must always label captions as model-generated.
- Ollama `qwen2.5-vl` accepts base64 images via `images:[]` — this is provider-specific; documenting the wire shape in the spec keeps adapter swaps cheap.
- Large images bloat prompt costs; cap aggressively (768×768 long edge default).