Files
kebab/tasks/p6/p6-3-caption-adapter.md
altair823 cd2213e48d feat(kebab-parse-image): P6-3 caption adapter — vision LM via trait
- 신규 모듈 `crates/kebab-parse-image/src/caption.rs` 추가:
  • `caption_image(llm, bytes, lang_hint, cfg)` — `&dyn LanguageModel`
    위에서 동작. 비전 LM (예: gemma4:e4b) 이 한 문장 객관 설명
    출력. temperature=0 / seed=0 결정성.
  • `apply_caption(llm, bytes, block, lang_hint, cfg, events)` —
    `block.caption = Some(...)` 으로 채우고 ProvenanceKind::CaptionApplied
    이벤트 1건 추가. `image.caption.enabled = false` 면 클린 no-op
    (Ok(())). LM 실패 시 block.caption None 그대로 + events 미기록.
  • 다운스케일 long-edge `[128, 1536]` 클램프. PNG passthrough hot
    path 보존, 그 외는 단일 디코드 + PNG 재인코딩.
  • 한국어 / 영어 프롬프트 분기 (lang_hint=\"ko\"/\"kor\" → 한국어).
  • `ModelCaption.model_version = \"<provider>/<prompt_template_version>\"`
    (예: \"ollama/caption-v1\") — prompt 또는 모델 회귀 감사 가능.

## kebab-core / kebab-llm-local 변경

- `kebab_core::GenerateRequest` 에 `images: Vec<String>` 필드 추가.
  `#[serde(default)]` 으로 기존 wire 페이로드 / snapshot 호환.
- `kebab-llm-local::OllamaLanguageModel` 가 req.images 를 Ollama
  `images: [base64, ...]` 와이어 필드로 라우팅.
  `#[serde(skip_serializing_if = is_empty)]` 로 비어 있을 때 wire
  shape 가 pre-P6-3 와 byte-identical.

## kebab-config

- 신규 `ImageCfg.caption: CaptionCfg`:
  - `enabled: bool` (default false)
  - `max_pixels: u32` (default 768, 클램프 [128, 1536])
  - `prompt_template_version: String` (default \"caption-v1\")
- `KEBAB_IMAGE_CAPTION_{ENABLED,MAX_PIXELS,PROMPT_TEMPLATE_VERSION}`
  3종 환경변수 추가.

## Spec deviations

`tasks/HOTFIXES.md` 2026-05-02 항목 추가:
- Symptom 1: spec p6-3 시그니처가 `&dyn LanguageModel` 인데 frozen
  trait + GenerateRequest 가 vision 미지원. → trait 확장.
- Symptom 2: spec 의 cargo feature `caption` (default OFF at compile
  time) → runtime gate 1개로 통합. base64/image/kebab-llm 외 추가
  deps 없어 cargo feature 의 binary 절감 가치 미미.

p4-1 / p4-2 / p6-3 spec 의 amends 명시.

## 테스트

`cargo test -p kebab-parse-image --test caption` — 9건 + 1 ignored:
- feature gate (disabled → no-op / Err on direct call)
- happy path (block.caption Some + Provenance CaptionApplied)
- 빈 토큰 stream → empty text + caption.is_some()
- CapturingMock 으로 req.images 라우팅 검증 (base64 1개, decode 가능)
- 한국어 / 영어 프롬프트 분기 (CapturingMock 의 system 캡처)
- LM Err → block.caption None 유지 + events 미기록
- 결정성 (동일 mock 입력 → 동일 caption)
- max_pixels 클램프 (99999 → 1536, 4000×3000 PNG 다운스케일 검증)
- opt-in 통합 (실 192.168.0.47 Ollama / gemma4:e4b → \"The image is
  a solid red color.\" 검증 완료, 4.3초)

`cargo test --workspace --no-fail-fast -j 1` 전체 pass.
`cargo clippy --workspace --all-targets -- -D warnings` pass.

## 의존성 경계

- 추가 deps: `kebab-llm` (trait 만), `base64` (이미 P6-2 에서 추가).
- dev-deps: `kebab-llm/mock` 으로 `MockLanguageModel`,
  `kebab-llm-local` (통합 테스트 전용 — 런타임 deps 에는 없음).
- forbidden 침범 없음: `kebab-source-fs / parse-md / normalize /
  chunk / store-* / embed* / search / rag / UI` 미참조.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1
caption (model-generated, low trust).
2026-05-02 06:05:39 +00:00

123 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P6
component: kebab-parse-image (caption adapter)
task_id: p6-3
title: "ModelCaption adapter (LanguageModel-driven, feature-gated)"
status: completed
depends_on: [p6-1, p4-2]
unblocks: []
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust)]
---
# p6-3 — Caption adapter
## Goal
Optionally populate `ImageRefBlock.caption` with `ModelCaption { text, model, model_version }` produced by a vision-capable LM (e.g., `qwen2.5-vl:7b` via Ollama). Feature-gated; default OFF.
## Why now / why this size
Captioning closes the multimodal loop. Strict separation from OCR keeps trust levels distinct: captions are generated, OCR is observed. Adapter is small — single trait method + one prompt.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `kebab-parse-image`
- `kebab-llm` (LanguageModel trait)
- `base64`
- `serde`, `serde_json`
- `image` (resize for prompt cost control)
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-rag`, `kebab-llm-local` (only via trait), `kebab-tui`, `kebab-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| image bytes | `&[u8]` | extractor |
| `dyn LanguageModel` (vision-capable) | runtime | injected |
| `kebab-config.image.caption` | `{ enabled, max_pixels, prompt_template_version }` | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `ModelCaption` | `kebab_core::ModelCaption` | merged into `ImageRefBlock.caption` |
## Public surface (signatures only — no new types)
```rust
pub fn caption_image(
llm: &dyn kebab_core::LanguageModel,
image_bytes: &[u8],
cfg: &kebab_config::Config,
) -> anyhow::Result<kebab_core::ModelCaption>;
pub fn apply_caption(
llm: &dyn kebab_core::LanguageModel,
image_bytes: &[u8],
block: &mut kebab_core::ImageRefBlock,
cfg: &kebab_config::Config,
) -> anyhow::Result<()>;
```
## Behavior contract
- Feature gate: if `config.image.caption.enabled = false` (default), `apply_caption` is a no-op (returns `Ok(())` without invoking LM).
- Pre-process: downscale image to `config.image.caption.max_pixels` (default 768×768 long edge) preserving aspect; encode as PNG.
- Build prompt:
- `system = "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, 보이는 것만 적는다."`
- `user` = `[image_base64]\n\n위 이미지를 한국어로 한 문장으로 설명하라.` (if `lang` hint == "ko") or English variant otherwise.
- The base64 wrapper assumes the LM adapter routes vision inputs via Ollama's `images: [base64]` field (this is provider-specific; the adapter is responsible for rendering the prompt to wire). For non-vision LMs, return an error and skip.
- Call `llm.generate_stream(GenerateRequest { system, user, stop: vec!["\n\n"], max_tokens: 96, temperature: 0.0, seed: Some(0) })`. Collect tokens until `Done`.
- `ModelCaption { text: collected, model: llm.model_ref().id, model_version: llm.model_ref().provider }` (use provider as a coarse "version" proxy; if a vision model exposes a stable revision, prefer that).
- `apply_caption` sets `block.caption = Some(...)` and appends `Provenance::CaptionApplied` event.
- Trust: caption is **model-generated** and labeled `trust_level = TrustLevel::Generated` if the caller propagates trust into chunk-level UI; this task only emits the `ModelCaption`.
- Failure modes:
- LM error → return `anyhow::Error`; caller may decide to skip (do not fail the entire ingest).
- Empty LM output → still set `block.caption = Some(ModelCaption { text: "" })` so downstream code can distinguish "captioning attempted, no result" from "captioning never attempted".
- Determinism: `temperature=0` + `seed=0`. Tests use `MockLanguageModel` to assert deterministic captions.
## Storage / wire effects
- None directly. Caller persists via `kebab-store-sqlite`.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | feature disabled → `apply_caption` no-op | inline (config.enabled = false) |
| unit | mock LM emits "사진 한 장" → `block.caption.text = "사진 한 장"` | inline |
| unit | mock LM emits empty token stream → `block.caption = Some(ModelCaption { text: "" })` | inline |
| unit | Korean lang hint produces Korean prompt; English hint → English prompt | inline |
| unit | downscale honors `max_pixels` (resulting bytes < some threshold) | fixture large image |
| determinism | identical input + temperature=0 + seed=0 → identical caption (mock) | inline |
All tests under `cargo test -p kebab-parse-image caption` with mock LM only.
## Definition of Done
- [ ] `cargo check -p kebab-parse-image --features caption` passes
- [ ] `cargo test -p kebab-parse-image caption` passes
- [ ] No imports outside Allowed dependencies
- [ ] Feature default OFF; only on when user opts in via config
- [ ] PR links design §3.4 ImageRefBlock.caption, §9.1
## Out of scope
- Multimodal RAG that uses caption text in answer (P+).
- CLIP / image embedding for cross-modal search (P+).
- Caption translation (P+).
## Risks / notes
- Vision LMs hallucinate. The system prompt explicitly forbids guessing, but expect false captions; UI and RAG must always label captions as model-generated.
- Ollama `qwen2.5-vl` accepts base64 images via `images:[]` — this is provider-specific; documenting the wire shape in the spec keeps adapter swaps cheap.
- Large images bloat prompt costs; cap aggressively (768×768 long edge default).