Files
kebab/tasks/p6/p6-3-caption-adapter.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

123 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P6
component: kebab-parse-image (caption adapter)
task_id: p6-3
title: "ModelCaption adapter (LanguageModel-driven, feature-gated)"
status: planned
depends_on: [p6-1, p4-2]
unblocks: []
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust)]
---
# p6-3 — Caption adapter
## Goal
Optionally populate `ImageRefBlock.caption` with `ModelCaption { text, model, model_version }` produced by a vision-capable LM (e.g., `qwen2.5-vl:7b` via Ollama). Feature-gated; default OFF.
## Why now / why this size
Captioning closes the multimodal loop. Strict separation from OCR keeps trust levels distinct: captions are generated, OCR is observed. Adapter is small — single trait method + one prompt.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `kebab-parse-image`
- `kebab-llm` (LanguageModel trait)
- `base64`
- `serde`, `serde_json`
- `image` (resize for prompt cost control)
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-rag`, `kebab-llm-local` (only via trait), `kebab-tui`, `kebab-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| image bytes | `&[u8]` | extractor |
| `dyn LanguageModel` (vision-capable) | runtime | injected |
| `kebab-config.image.caption` | `{ enabled, max_pixels, prompt_template_version }` | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `ModelCaption` | `kebab_core::ModelCaption` | merged into `ImageRefBlock.caption` |
## Public surface (signatures only — no new types)
```rust
pub fn caption_image(
llm: &dyn kebab_core::LanguageModel,
image_bytes: &[u8],
cfg: &kebab_config::Config,
) -> anyhow::Result<kebab_core::ModelCaption>;
pub fn apply_caption(
llm: &dyn kebab_core::LanguageModel,
image_bytes: &[u8],
block: &mut kebab_core::ImageRefBlock,
cfg: &kebab_config::Config,
) -> anyhow::Result<()>;
```
## Behavior contract
- Feature gate: if `config.image.caption.enabled = false` (default), `apply_caption` is a no-op (returns `Ok(())` without invoking LM).
- Pre-process: downscale image to `config.image.caption.max_pixels` (default 768×768 long edge) preserving aspect; encode as PNG.
- Build prompt:
- `system = "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, 보이는 것만 적는다."`
- `user` = `[image_base64]\n\n위 이미지를 한국어로 한 문장으로 설명하라.` (if `lang` hint == "ko") or English variant otherwise.
- The base64 wrapper assumes the LM adapter routes vision inputs via Ollama's `images: [base64]` field (this is provider-specific; the adapter is responsible for rendering the prompt to wire). For non-vision LMs, return an error and skip.
- Call `llm.generate_stream(GenerateRequest { system, user, stop: vec!["\n\n"], max_tokens: 96, temperature: 0.0, seed: Some(0) })`. Collect tokens until `Done`.
- `ModelCaption { text: collected, model: llm.model_ref().id, model_version: llm.model_ref().provider }` (use provider as a coarse "version" proxy; if a vision model exposes a stable revision, prefer that).
- `apply_caption` sets `block.caption = Some(...)` and appends `Provenance::CaptionApplied` event.
- Trust: caption is **model-generated** and labeled `trust_level = TrustLevel::Generated` if the caller propagates trust into chunk-level UI; this task only emits the `ModelCaption`.
- Failure modes:
- LM error → return `anyhow::Error`; caller may decide to skip (do not fail the entire ingest).
- Empty LM output → still set `block.caption = Some(ModelCaption { text: "" })` so downstream code can distinguish "captioning attempted, no result" from "captioning never attempted".
- Determinism: `temperature=0` + `seed=0`. Tests use `MockLanguageModel` to assert deterministic captions.
## Storage / wire effects
- None directly. Caller persists via `kebab-store-sqlite`.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | feature disabled → `apply_caption` no-op | inline (config.enabled = false) |
| unit | mock LM emits "사진 한 장" → `block.caption.text = "사진 한 장"` | inline |
| unit | mock LM emits empty token stream → `block.caption = Some(ModelCaption { text: "" })` | inline |
| unit | Korean lang hint produces Korean prompt; English hint → English prompt | inline |
| unit | downscale honors `max_pixels` (resulting bytes < some threshold) | fixture large image |
| determinism | identical input + temperature=0 + seed=0 → identical caption (mock) | inline |
All tests under `cargo test -p kebab-parse-image caption` with mock LM only.
## Definition of Done
- [ ] `cargo check -p kebab-parse-image --features caption` passes
- [ ] `cargo test -p kebab-parse-image caption` passes
- [ ] No imports outside Allowed dependencies
- [ ] Feature default OFF; only on when user opts in via config
- [ ] PR links design §3.4 ImageRefBlock.caption, §9.1
## Out of scope
- Multimodal RAG that uses caption text in answer (P+).
- CLIP / image embedding for cross-modal search (P+).
- Caption translation (P+).
## Risks / notes
- Vision LMs hallucinate. The system prompt explicitly forbids guessing, but expect false captions; UI and RAG must always label captions as model-generated.
- Ollama `qwen2.5-vl` accepts base64 images via `images:[]` — this is provider-specific; documenting the wire shape in the spec keeps adapter swaps cheap.
- Large images bloat prompt costs; cap aggressively (768×768 long edge default).