From c84ab03404c470b10fa1942cf7c56d0712f05df5 Mon Sep 17 00:00:00 2001 From: kb Date: Mon, 27 Apr 2026 12:06:20 +0000 Subject: [PATCH] tasks: add P6 component specs (image-exif, ocr, caption) --- tasks/p6/p6-1-image-extractor-exif.md | 114 ++++++++++++++++++++++ tasks/p6/p6-2-ocr-adapter.md | 133 ++++++++++++++++++++++++++ tasks/p6/p6-3-caption-adapter.md | 122 +++++++++++++++++++++++ 3 files changed, 369 insertions(+) create mode 100644 tasks/p6/p6-1-image-extractor-exif.md create mode 100644 tasks/p6/p6-2-ocr-adapter.md create mode 100644 tasks/p6/p6-3-caption-adapter.md diff --git a/tasks/p6/p6-1-image-extractor-exif.md b/tasks/p6/p6-1-image-extractor-exif.md new file mode 100644 index 0000000..390f3aa --- /dev/null +++ b/tasks/p6/p6-1-image-extractor-exif.md @@ -0,0 +1,114 @@ +--- +phase: P6 +component: kb-parse-image (image extractor + EXIF) +task_id: p6-1 +title: "Image Extractor producing single-block CanonicalDocument + EXIF metadata" +status: planned +depends_on: [p0-1, p1-6] +unblocks: [p6-2, p6-3] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText/ModelCaption stubs, §9.1 image extraction policy, §9 versioning] +--- + +# p6-1 — Image extractor (EXIF + structure) + +## Goal + +Implement `Extractor` for `MediaType::Image(_)` that produces a `CanonicalDocument` whose body is exactly one `ImageRefBlock`. EXIF is captured into `metadata.user.exif`. OCR and caption are intentionally left `None`; later tasks (p6-2, p6-3) populate them. + +## Why now / why this size + +Establishes the image-as-document contract and decouples extraction (asset → ImageRefBlock) from analysis (OCR / caption). Keeps the multimodal merge surface small. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `image = "0.25"` (decoding for size + format detect) +- `kamadak-exif` for EXIF +- `serde`, `serde_json` +- `time` +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, OCR libs, LLM libs + +## Inputs + +| input | type | source | +|-------|------|--------| +| `RawAsset` | `kb_core::RawAsset` | from `kb-source-fs` | +| image bytes | `&[u8]` | filesystem | +| `parser_version` | `kb_core::ParserVersion` | constant in this crate (`"image-meta-v1"`) | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `CanonicalDocument` | `kb_core::CanonicalDocument` | `kb-chunk` (image-region chunker) → `kb-store-sqlite` | + +## Public surface (signatures only — no new types) + +```rust +pub struct ImageExtractor; + +impl kb_core::Extractor for ImageExtractor { + fn supports(&self, m: &kb_core::MediaType) -> bool { matches!(m, kb_core::MediaType::Image(_)) } + fn parser_version(&self) -> kb_core::ParserVersion { kb_core::ParserVersion("image-meta-v1".into()) } + fn extract(&self, ctx: &kb_core::ExtractContext, bytes: &[u8]) -> anyhow::Result; +} +``` + +## Behavior contract + +- One asset → one document. `title` = filename without extension; `lang = Lang("und")`. +- `blocks` contains exactly one entry: `Block::ImageRef(ImageRefBlock { common, asset_id: Some(asset.asset_id), src: workspace_path, alt: filename, ocr: None, caption: None })`. +- `common.source_span` = `SourceSpan::Region { x:0, y:0, w: width, h: height }` covering the entire image (width/height obtained from `image::ImageReader::without_guessed_format().with_guessed_format()?.into_dimensions()`). +- `metadata.source_type = SourceType::Reference` (per design enum); `trust_level = TrustLevel::Primary`; `tags`/`aliases` empty. +- `metadata.user["exif"]` = JSON object with whitelisted EXIF tags (DateTimeOriginal, GPS lat/lon, Make, Model, Orientation, Software). Missing tags omitted. +- `metadata.user["dimensions"] = { "w": , "h": , "format": "" }`. +- `provenance` includes `Discovered`, `Parsed` events (no Normalized — ID assignment happens here directly per §3.4 stub from p1-4 logic, OR pipe through `kb-normalize` if available; this task's choice: emit a fully formed CanonicalDocument with deterministic IDs by calling `kb_core::id_for_doc` and `kb_core::id_for_block` directly). +- Failure modes: + - Truncated/corrupt image → still emits a CanonicalDocument with `dimensions = null`, EXIF empty, `Provenance` warning event with the decoder error message. + - Unsupported format → `anyhow::Error` (caller skips). +- Determinism: identical bytes + identical parser_version → identical `doc_id` and `block_id`. + +## Storage / wire effects + +- None directly (the caller persists via `kb-store-sqlite`). + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | PNG decode produces correct dimensions in `metadata.user.dimensions` | `fixtures/image/red-100x50.png` | +| unit | JPEG with EXIF GPS captured into `metadata.user.exif` | `fixtures/image/exif-with-gps.jpg` | +| unit | image with no EXIF produces `metadata.user.exif = {}` | `fixtures/image/no-exif.png` | +| unit | corrupt image: warning provenance, no panic | `fixtures/image/corrupt.png` | +| determinism | identical bytes → identical `doc_id`, `block_id` across two runs | inline | +| snapshot | `CanonicalDocument` JSON stable for fixture | `fixtures/image/red-100x50.png` | + +All tests under `cargo test -p kb-parse-image`. + +## Definition of Done + +- [ ] `cargo check -p kb-parse-image` passes +- [ ] `cargo test -p kb-parse-image` passes +- [ ] No OCR/caption/embedding code present +- [ ] No imports outside Allowed dependencies +- [ ] PR links design §3.4, §9.1 + +## Out of scope + +- OCR text (p6-2). +- Captioning (p6-3). +- CLIP / visual embedding (P+). +- HEIC / RAW formats (out of scope; record as Other and accept failure for v1). + +## Risks / notes + +- `image` crate doesn't decode HEIC; document and accept skip. Apple Vision sidecar (P+) can fill this gap. +- EXIF whitelist keeps PII surface small (no thumbnails, no maker notes). Document the list in the spec section. +- Cap decode dimensions to ~16k×16k; oversized → warning + null dimensions instead of attempted decode. diff --git a/tasks/p6/p6-2-ocr-adapter.md b/tasks/p6/p6-2-ocr-adapter.md new file mode 100644 index 0000000..f03f7ce --- /dev/null +++ b/tasks/p6/p6-2-ocr-adapter.md @@ -0,0 +1,133 @@ +--- +phase: P6 +component: kb-parse-image (OCR adapter) +task_id: p6-2 +title: "OcrEngine trait + Tesseract adapter (Apple Vision feature-gated)" +status: planned +depends_on: [p6-1] +unblocks: [p6-3] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§3.4 ImageRefBlock.ocr, §3.7a OcrText/OcrRegion, §9.1 OCR vs caption provenance] +--- + +# p6-2 — OCR adapter + +## Goal + +Define `OcrEngine` trait + a Tesseract-backed default implementation. Populate `ImageRefBlock.ocr` with `OcrText { joined, regions, engine, engine_version }`. Provide an `apple-vision` feature gate that switches to a sidecar binary on macOS. + +## Why now / why this size + +Strict separation of OCR (observed text) from caption (model-generated). Confining engine choice to a single trait + adapter lets us swap to Apple Vision or PaddleOCR without touching the extractor or chunker. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-parse-image` (consumes its types) +- `tesseract = "0.13"` (feature `tesseract`, default ON) +- For feature `apple-vision`: `std::process::Command` only (sidecar binary, not a Rust dep) +- `serde`, `serde_json` +- `image` +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| image bytes | `&[u8]` | from extractor | +| optional language hint | `kb_core::Lang` | metadata | +| `kb-config` OCR settings | engine name, languages | runtime | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `OcrText` | `kb_core::OcrText` | merged into `ImageRefBlock.ocr` | + +## Public surface (signatures only — no new types) + +```rust +pub trait OcrEngine: Send + Sync { + fn engine_name(&self) -> &'static str; + fn engine_version(&self) -> String; + fn recognize(&self, image_bytes: &[u8], lang_hint: Option<&kb_core::Lang>) -> anyhow::Result; +} + +pub struct TesseractOcr { /* internal: lazy api handle */ } +impl TesseractOcr { pub fn new(config: &kb_config::Config) -> anyhow::Result; } +impl OcrEngine for TesseractOcr { /* per trait */ } + +#[cfg(feature = "apple-vision")] +pub struct AppleVisionOcr { /* sidecar path */ } +#[cfg(feature = "apple-vision")] +impl OcrEngine for AppleVisionOcr { /* per trait */ } + +pub fn apply_ocr( + engine: &dyn OcrEngine, + image_bytes: &[u8], + block: &mut kb_core::ImageRefBlock, + lang_hint: Option<&kb_core::Lang>, +) -> anyhow::Result<()>; +``` + +## Behavior contract + +- Tesseract: + - Languages from `config.ocr.languages` (default `["eng", "kor"]`). + - Recognition produces `OcrRegion { bbox: (x, y, w, h), text, confidence }` for each "word" or "line" (configurable; default "line"). + - Drop regions with `confidence < config.ocr.min_confidence` (default 60.0). If all dropped, return `OcrText { joined: "", regions: vec![], engine, engine_version }`. + - `joined` = `regions.iter().map(|r| r.text).join(" ")` (no smart layout reconstruction in v1). + - `engine = "tesseract"`, `engine_version = tesseract::version()`. +- Apple Vision sidecar (feature `apple-vision`): + - Spawn a small Swift binary `kb-vision-ocr` (path from `config.ocr.apple_vision_binary`) feeding the image via stdin and reading JSON `{ regions: [{x,y,w,h,text,confidence}, ...] }` from stdout. + - Same threshold and `joined` rules as Tesseract. `engine = "apple-vision"`, `engine_version = sidecar's --version`. + - This subagent task does NOT write the Swift sidecar; it only wires the Rust side. Document the expected sidecar interface in `docs/spec/sidecar-vision.md` (separate doc spec stub, optional). +- `apply_ocr` calls `engine.recognize`, sets `block.ocr = Some(text)`, and appends a `Provenance::OcrApplied` event in the caller's CanonicalDocument (caller responsibility — this task exposes a helper). +- Streaming / large images: cap decoded image size at 8192×8192 before passing to OCR; downscale with `image::imageops::resize` if larger. +- Trust: `OcrText` is **observed text** (high trust). Captions (`ModelCaption`) are NOT generated here. +- Determinism: Tesseract is deterministic for a fixed input + fixed page-segmentation mode; apply_ocr asserts this by calling twice in dev tests. Apple Vision is also deterministic in practice but may vary across macOS versions; document this and accept. + +## Storage / wire effects + +- None. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | Tesseract recognizes English on `fixtures/image/hello-world.png` (joined contains "hello world") | fixture | +| unit | confidence threshold drops noise regions | fixture with low-quality text | +| unit | Korean text recognized when `kor` language enabled | `fixtures/image/안녕.png` | +| unit | empty result returns `OcrText { joined: "", regions: [], .. }` not error | `fixtures/image/no-text.png` | +| unit | `apply_ocr` mutates block.ocr from None → Some | inline | +| determinism | two runs of recognize on same input → identical OcrText | fixture | +| `#[cfg(feature = "apple-vision")]` smoke | sidecar invocation captured (mock binary echoes fixed JSON) | inline mock | + +All tests under `cargo test -p kb-parse-image ocr`. Tesseract install required on CI host. + +## Definition of Done + +- [ ] `cargo check -p kb-parse-image --features tesseract` passes +- [ ] `cargo test -p kb-parse-image ocr` passes +- [ ] `apple-vision` feature compiles on macOS and gracefully no-ops on Linux +- [ ] No imports outside Allowed dependencies +- [ ] PR links design §3.4, §3.7a, §9.1 + +## Out of scope + +- Caption (p6-3). +- Visual embedding (P+). +- Layout-aware reading order (P+). +- PaddleOCR / EasyOCR adapters. + +## Risks / notes + +- Tesseract performance varies wildly with image quality; document `min_confidence` and default page-segmentation mode. +- Apple Vision sidecar requires code signing for distribution; for v1 dev builds, accept unsigned binary from `~/.local/bin/kb-vision-ocr`. +- Large image downscale loses small-text recognition; expose `config.ocr.max_pixels` so power users can tune. diff --git a/tasks/p6/p6-3-caption-adapter.md b/tasks/p6/p6-3-caption-adapter.md new file mode 100644 index 0000000..5616307 --- /dev/null +++ b/tasks/p6/p6-3-caption-adapter.md @@ -0,0 +1,122 @@ +--- +phase: P6 +component: kb-parse-image (caption adapter) +task_id: p6-3 +title: "ModelCaption adapter (LanguageModel-driven, feature-gated)" +status: planned +depends_on: [p6-1, p4-2] +unblocks: [] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust)] +--- + +# p6-3 — Caption adapter + +## Goal + +Optionally populate `ImageRefBlock.caption` with `ModelCaption { text, model, model_version }` produced by a vision-capable LM (e.g., `qwen2.5-vl:7b` via Ollama). Feature-gated; default OFF. + +## Why now / why this size + +Captioning closes the multimodal loop. Strict separation from OCR keeps trust levels distinct: captions are generated, OCR is observed. Adapter is small — single trait method + one prompt. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-parse-image` +- `kb-llm` (LanguageModel trait) +- `base64` +- `serde`, `serde_json` +- `image` (resize for prompt cost control) +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-rag`, `kb-llm-local` (only via trait), `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| image bytes | `&[u8]` | extractor | +| `dyn LanguageModel` (vision-capable) | runtime | injected | +| `kb-config.image.caption` | `{ enabled, max_pixels, prompt_template_version }` | runtime | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `ModelCaption` | `kb_core::ModelCaption` | merged into `ImageRefBlock.caption` | + +## Public surface (signatures only — no new types) + +```rust +pub fn caption_image( + llm: &dyn kb_core::LanguageModel, + image_bytes: &[u8], + cfg: &kb_config::Config, +) -> anyhow::Result; + +pub fn apply_caption( + llm: &dyn kb_core::LanguageModel, + image_bytes: &[u8], + block: &mut kb_core::ImageRefBlock, + cfg: &kb_config::Config, +) -> anyhow::Result<()>; +``` + +## Behavior contract + +- Feature gate: if `config.image.caption.enabled = false` (default), `apply_caption` is a no-op (returns `Ok(())` without invoking LM). +- Pre-process: downscale image to `config.image.caption.max_pixels` (default 768×768 long edge) preserving aspect; encode as PNG. +- Build prompt: + - `system = "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, 보이는 것만 적는다."` + - `user` = `[image_base64]\n\n위 이미지를 한국어로 한 문장으로 설명하라.` (if `lang` hint == "ko") or English variant otherwise. + - The base64 wrapper assumes the LM adapter routes vision inputs via Ollama's `images: [base64]` field (this is provider-specific; the adapter is responsible for rendering the prompt to wire). For non-vision LMs, return an error and skip. +- Call `llm.generate_stream(GenerateRequest { system, user, stop: vec!["\n\n"], max_tokens: 96, temperature: 0.0, seed: Some(0) })`. Collect tokens until `Done`. +- `ModelCaption { text: collected, model: llm.model_ref().id, model_version: llm.model_ref().provider }` (use provider as a coarse "version" proxy; if a vision model exposes a stable revision, prefer that). +- `apply_caption` sets `block.caption = Some(...)` and appends `Provenance::CaptionApplied` event. +- Trust: caption is **model-generated** and labeled `trust_level = TrustLevel::Generated` if the caller propagates trust into chunk-level UI; this task only emits the `ModelCaption`. +- Failure modes: + - LM error → return `anyhow::Error`; caller may decide to skip (do not fail the entire ingest). + - Empty LM output → still set `block.caption = Some(ModelCaption { text: "" })` so downstream code can distinguish "captioning attempted, no result" from "captioning never attempted". +- Determinism: `temperature=0` + `seed=0`. Tests use `MockLanguageModel` to assert deterministic captions. + +## Storage / wire effects + +- None directly. Caller persists via `kb-store-sqlite`. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | feature disabled → `apply_caption` no-op | inline (config.enabled = false) | +| unit | mock LM emits "사진 한 장" → `block.caption.text = "사진 한 장"` | inline | +| unit | mock LM emits empty token stream → `block.caption = Some(ModelCaption { text: "" })` | inline | +| unit | Korean lang hint produces Korean prompt; English hint → English prompt | inline | +| unit | downscale honors `max_pixels` (resulting bytes < some threshold) | fixture large image | +| determinism | identical input + temperature=0 + seed=0 → identical caption (mock) | inline | + +All tests under `cargo test -p kb-parse-image caption` with mock LM only. + +## Definition of Done + +- [ ] `cargo check -p kb-parse-image --features caption` passes +- [ ] `cargo test -p kb-parse-image caption` passes +- [ ] No imports outside Allowed dependencies +- [ ] Feature default OFF; only on when user opts in via config +- [ ] PR links design §3.4 ImageRefBlock.caption, §9.1 + +## Out of scope + +- Multimodal RAG that uses caption text in answer (P+). +- CLIP / image embedding for cross-modal search (P+). +- Caption translation (P+). + +## Risks / notes + +- Vision LMs hallucinate. The system prompt explicitly forbids guessing, but expect false captions; UI and RAG must always label captions as model-generated. +- Ollama `qwen2.5-vl` accepts base64 images via `images:[]` — this is provider-specific; documenting the wire shape in the spec keeps adapter swaps cheap. +- Large images bloat prompt costs; cap aggressively (768×768 long edge default).