feat(kebab-parse-image): P6-3 caption adapter — vision LM via trait

- 신규 모듈 `crates/kebab-parse-image/src/caption.rs` 추가: • `caption_image(llm, bytes, lang_hint, cfg)` — `&dyn LanguageModel` 위에서 동작. 비전 LM (예: gemma4:e4b) 이 한 문장 객관 설명 출력. temperature=0 / seed=0 결정성. • `apply_caption(llm, bytes, block, lang_hint, cfg, events)` — `block.caption = Some(...)` 으로 채우고 ProvenanceKind::CaptionApplied 이벤트 1건 추가. `image.caption.enabled = false` 면 클린 no-op (Ok(())). LM 실패 시 block.caption None 그대로 + events 미기록. • 다운스케일 long-edge `[128, 1536]` 클램프. PNG passthrough hot path 보존, 그 외는 단일 디코드 + PNG 재인코딩. • 한국어 / 영어 프롬프트 분기 (lang_hint=\"ko\"/\"kor\" → 한국어). • `ModelCaption.model_version = \"<provider>/<prompt_template_version>\"` (예: \"ollama/caption-v1\") — prompt 또는 모델 회귀 감사 가능. ## kebab-core / kebab-llm-local 변경 - `kebab_core::GenerateRequest` 에 `images: Vec<String>` 필드 추가. `#[serde(default)]` 으로 기존 wire 페이로드 / snapshot 호환. - `kebab-llm-local::OllamaLanguageModel` 가 req.images 를 Ollama `images: [base64, ...]` 와이어 필드로 라우팅. `#[serde(skip_serializing_if = is_empty)]` 로 비어 있을 때 wire shape 가 pre-P6-3 와 byte-identical. ## kebab-config - 신규 `ImageCfg.caption: CaptionCfg`: - `enabled: bool` (default false) - `max_pixels: u32` (default 768, 클램프 [128, 1536]) - `prompt_template_version: String` (default \"caption-v1\") - `KEBAB_IMAGE_CAPTION_{ENABLED,MAX_PIXELS,PROMPT_TEMPLATE_VERSION}` 3종 환경변수 추가. ## Spec deviations `tasks/HOTFIXES.md` 2026-05-02 항목 추가: - Symptom 1: spec p6-3 시그니처가 `&dyn LanguageModel` 인데 frozen trait + GenerateRequest 가 vision 미지원. → trait 확장. - Symptom 2: spec 의 cargo feature `caption` (default OFF at compile time) → runtime gate 1개로 통합. base64/image/kebab-llm 외 추가 deps 없어 cargo feature 의 binary 절감 가치 미미. p4-1 / p4-2 / p6-3 spec 의 amends 명시. ## 테스트 `cargo test -p kebab-parse-image --test caption` — 9건 + 1 ignored: - feature gate (disabled → no-op / Err on direct call) - happy path (block.caption Some + Provenance CaptionApplied) - 빈 토큰 stream → empty text + caption.is_some() - CapturingMock 으로 req.images 라우팅 검증 (base64 1개, decode 가능) - 한국어 / 영어 프롬프트 분기 (CapturingMock 의 system 캡처) - LM Err → block.caption None 유지 + events 미기록 - 결정성 (동일 mock 입력 → 동일 caption) - max_pixels 클램프 (99999 → 1536, 4000×3000 PNG 다운스케일 검증) - opt-in 통합 (실 192.168.0.47 Ollama / gemma4:e4b → \"The image is a solid red color.\" 검증 완료, 4.3초) `cargo test --workspace --no-fail-fast -j 1` 전체 pass. `cargo clippy --workspace --all-targets -- -D warnings` pass. ## 의존성 경계 - 추가 deps: `kebab-llm` (trait 만), `base64` (이미 P6-2 에서 추가). - dev-deps: `kebab-llm/mock` 으로 `MockLanguageModel`, `kebab-llm-local` (통합 테스트 전용 — 런타임 deps 에는 없음). - forbidden 침범 없음: `kebab-source-fs / parse-md / normalize / chunk / store-* / embed* / search / rag / UI` 미참조. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust).
2026-05-02 06:05:39 +00:00
parent e43d3bc697
commit cd2213e48d
15 changed files with 806 additions and 3 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3576,6 +3576,8 @@ dependencies = [
 "kamadak-exif",
 "kebab-config",
 "kebab-core",
+ "kebab-llm",
+ "kebab-llm-local",
 "reqwest",
 "serde",
 "serde_json",
--- a/crates/kebab-config/src/lib.rs
+++ b/crates/kebab-config/src/lib.rs
@@ -105,18 +105,20 @@ pub struct RagCfg {
 }

 /// Settings for the image ingest pipeline (P6). `ocr` controls OCR
-/// behaviour; future fields (e.g. `caption`) will join here as P6-3
-/// lands.
+/// behaviour (P6-2); `caption` controls vision-LM captioning (P6-3).
 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
 pub struct ImageCfg {
    #[serde(default = "OcrCfg::defaults")]
    pub ocr: OcrCfg,
+    #[serde(default = "CaptionCfg::defaults")]
+    pub caption: CaptionCfg,
 }

 impl ImageCfg {
    pub fn defaults() -> Self {
        Self {
            ocr: OcrCfg::defaults(),
+            caption: CaptionCfg::defaults(),
        }
    }
 }
@@ -162,6 +164,36 @@ impl OcrCfg {
    }
 }

+/// Caption settings (P6-3). Caption uses the same Ollama-vision /
+/// `LanguageModel` pipeline as the rest of the workspace; the trait
+/// abstraction is the part the spec demands. `enabled` defaults to
+/// `false` because captioning costs one model call per asset and the
+/// output is model-generated (low trust).
+#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
+pub struct CaptionCfg {
+    /// Run captioning on every image during ingest. Default `false`.
+    pub enabled: bool,
+    /// Cap the long edge of the image (in pixels) before sending. The
+    /// spec recommends an aggressive 768×768 cap because larger
+    /// vision-LM inputs translate directly into prompt cost. Default
+    /// `768`.
+    pub max_pixels: u32,
+    /// Caption prompt template version pinned into wire output via
+    /// `ModelCaption.model_version`. Bump when the prompt changes so
+    /// downstream eval can detect regressions.
+    pub prompt_template_version: String,
+}
+
+impl CaptionCfg {
+    pub fn defaults() -> Self {
+        Self {
+            enabled: false,
+            max_pixels: 768,
+            prompt_template_version: "caption-v1".to_string(),
+        }
+    }
+}
+
 impl Config {
    /// Defaults per design §6.4.
    pub fn defaults() -> Self {
@@ -417,6 +449,19 @@ impl Config {
                    }
                }

+                // image.caption (P6-3)
+                "KEBAB_IMAGE_CAPTION_ENABLED" => {
+                    self.image.caption.enabled = parse_bool(v);
+                }
+                "KEBAB_IMAGE_CAPTION_MAX_PIXELS" => {
+                    if let Ok(n) = v.parse::<u32>() {
+                        self.image.caption.max_pixels = n;
+                    }
+                }
+                "KEBAB_IMAGE_CAPTION_PROMPT_TEMPLATE_VERSION" => {
+                    self.image.caption.prompt_template_version = v.clone();
+                }
+
                // Unknown KEBAB_* keys are silently ignored — see
                // `env_unknown_key_is_ignored` test.
                _ => {}
@@ -608,6 +653,35 @@ mod tests {
    /// Pre-P6 config files don't have an `[image]` section. The
    /// `#[serde(default)]` attribute on `Config::image` must let those
    /// files load with `ImageCfg::defaults()` instead of erroring.
+    #[test]
+    fn image_caption_defaults_disabled() {
+        let c = Config::defaults();
+        assert!(!c.image.caption.enabled);
+        assert_eq!(c.image.caption.max_pixels, 768);
+        assert_eq!(c.image.caption.prompt_template_version, "caption-v1");
+    }
+
+    #[test]
+    fn image_caption_env_overrides() {
+        let mut env = HashMap::new();
+        env.insert(
+            "KEBAB_IMAGE_CAPTION_ENABLED".to_string(),
+            "true".to_string(),
+        );
+        env.insert(
+            "KEBAB_IMAGE_CAPTION_MAX_PIXELS".to_string(),
+            "1024".to_string(),
+        );
+        env.insert(
+            "KEBAB_IMAGE_CAPTION_PROMPT_TEMPLATE_VERSION".to_string(),
+            "caption-v2".to_string(),
+        );
+        let c = Config::defaults().apply_env(&env);
+        assert!(c.image.caption.enabled);
+        assert_eq!(c.image.caption.max_pixels, 1024);
+        assert_eq!(c.image.caption.prompt_template_version, "caption-v2");
+    }
+
    /// `KEBAB_IMAGE_OCR_ENDPOINT=""` (empty value) should map to `None`
    /// rather than to `Some("")` so the fallback to `models.llm.endpoint`
    /// kicks in. Covers the env-equivalent of a missing TOML key.
--- a/crates/kebab-core/src/traits.rs
+++ b/crates/kebab-core/src/traits.rs
@@ -69,6 +69,17 @@ pub struct GenerateRequest {
    pub max_tokens: usize,
    pub temperature: f32,
    pub seed: Option<u64>,
+    /// Vision inputs (base64-encoded, one per image). Empty for the
+    /// text-only path that P4-2 / P4-3 / RAG uses; non-empty when a
+    /// vision-capable adapter (P6-3 caption, future multimodal RAG)
+    /// drives the call. The LM adapter is responsible for routing
+    /// these onto the wire — Ollama uses `images: [base64, ...]`,
+    /// other backends may differ.
+    ///
+    /// Defaulted on deserialization so older `*.json` payloads /
+    /// snapshots that predate the field still parse.
+    #[serde(default)]
+    pub images: Vec<String>,
 }

 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
--- a/crates/kebab-llm-local/src/ollama.rs
+++ b/crates/kebab-llm-local/src/ollama.rs
@@ -140,9 +140,15 @@ impl LanguageModel for OllamaLanguageModel {
            format!("{}\n\n{}", req.system, req.user)
        };

+        // Vision inputs (P6-3) flow through the request via Ollama's
+        // `images: [base64, ...]` field. Empty for the text-only RAG
+        // path so older snapshots and JSON dumps stay byte-identical
+        // (the field is `#[serde(default)]` here so it's omitted from
+        // the wire when empty).
        let body = OllamaRequest {
            model: &self.model_id,
            prompt,
+            images: &req.images,
            stream: true,
            options: OllamaOptions {
                temperature: effective_temperature,
@@ -188,6 +194,13 @@ impl LanguageModel for OllamaLanguageModel {
 struct OllamaRequest<'a> {
    model: &'a str,
    prompt: String,
+    /// Skipped from the JSON when empty so the text-only path keeps
+    /// the same on-the-wire shape it had pre-P6-3 (`{"model": ...,
+    /// "prompt": ..., "stream": ..., "options": ...}` — no `images`
+    /// key). Vision-capable callers populate this with one or more
+    /// base64-encoded images.
+    #[serde(skip_serializing_if = "<[String]>::is_empty")]
+    images: &'a [String],
    stream: bool,
    options: OllamaOptions<'a>,
 }
--- a/crates/kebab-llm-local/tests/integration.rs
+++ b/crates/kebab-llm-local/tests/integration.rs
@@ -31,6 +31,7 @@ fn real_ollama_streams_non_empty_response() {
        max_tokens: 8,
        temperature: 0.0,
        seed: Some(0),
+        images: Vec::new(),
    };

    let stream = llm.generate_stream(req).expect("stream should start");
--- a/crates/kebab-llm-local/tests/streaming.rs
+++ b/crates/kebab-llm-local/tests/streaming.rs
@@ -35,6 +35,7 @@ fn sample_request() -> GenerateRequest {
        max_tokens: 64,
        temperature: 0.0,
        seed: Some(0),
+        images: Vec::new(),
    }
 }

--- a/crates/kebab-llm/tests/mock.rs
+++ b/crates/kebab-llm/tests/mock.rs
@@ -26,6 +26,7 @@ fn req_with_stop(stop: Vec<&str>) -> GenerateRequest {
        max_tokens: 64,
        temperature: 0.0,
        seed: None,
+        images: Vec::new(),
    }
 }

--- a/crates/kebab-llm/tests/reexports.rs
+++ b/crates/kebab-llm/tests/reexports.rs
@@ -55,6 +55,7 @@ fn dyn_dispatch_via_box_works() {
        max_tokens: 16,
        temperature: 0.0,
        seed: None,
+        images: Vec::new(),
    };
    let stream = m.generate_stream(req).expect("stream");
    let chunks: Vec<TokenChunk> = stream.map(|r| r.expect("ok chunk")).collect();
--- a/crates/kebab-parse-image/Cargo.toml
+++ b/crates/kebab-parse-image/Cargo.toml
@@ -10,6 +10,12 @@ description   = "Image extractor + EXIF + OCR (Ollama-vision) for the kebab pipe
 [dependencies]
 kebab-core   = { path = "../kebab-core" }
 kebab-config = { path = "../kebab-config" }
+# `kebab-llm` re-exports the trait crate (`kebab-core::LanguageModel`)
+# under a stable surface; the caption adapter consumes any
+# `dyn LanguageModel`. We do NOT depend on `kebab-llm-local` (forbidden
+# by p6-3 design §8) — the trait abstraction is exactly what spec
+# requires.
+kebab-llm    = { path = "../kebab-llm" }
 anyhow       = { workspace = true }
 serde        = { workspace = true }
 serde_json   = { workspace = true }
@@ -42,3 +48,10 @@ tokio        = { workspace = true, features = ["rt-multi-thread"] }
 # font rendering.
 ab_glyph     = "0.2"
 base64       = "0.22"
+# `kebab-llm/mock` exposes `MockLanguageModel` for hermetic caption
+# tests. Real adapters (Ollama) live in `kebab-llm-local`, which is
+# only allowed at the dev-dep level here — the runtime crate stays
+# trait-only, so the §8 forbidden-deps rule (no `kebab-llm-local`
+# at runtime) is preserved.
+kebab-llm        = { path = "../kebab-llm", features = ["mock"] }
+kebab-llm-local  = { path = "../kebab-llm-local" }
--- a/crates/kebab-parse-image/src/caption.rs
+++ b/crates/kebab-parse-image/src/caption.rs
@@ -0,0 +1,281 @@
+//! Caption adapter (P6-3).
+//!
+//! [`caption_image`] runs a vision-capable [`LanguageModel`] over an
+//! image and produces a [`ModelCaption`]. [`apply_caption`] is the
+//! helper that mutates an [`ImageRefBlock`] in place and emits a
+//! [`ProvenanceKind::CaptionApplied`] event.
+//!
+//! ## Trust note
+//!
+//! Captions are **model-generated** (`TrustLevel::Generated`), not
+//! observed text. Vision LMs hallucinate; the system prompt explicitly
+//! forbids guessing but expect false captions. Downstream UI / RAG
+//! must label captions as model-generated and surface the model id +
+//! prompt template version (carried in `ModelCaption.model_version`)
+//! so a regression in either is auditable.
+//!
+//! ## Spec deviation (cargo `caption` feature dropped)
+//!
+//! The original P6-3 spec asked for a cargo feature `caption` (default
+//! OFF at compile time). We collapse this into a single runtime gate
+//! (`config.image.caption.enabled = false`, default OFF). Reasoning:
+//! the captioning module's only extra deps are `base64` + `image` +
+//! `kebab-llm` trait — all already pulled in by the rest of the
+//! crate. A cargo feature would only complicate the build matrix
+//! without saving meaningful binary weight. See `tasks/HOTFIXES.md`
+//! (2026-05-02) for the deviation log.
+
+use std::io::Cursor;
+
+use anyhow::{Context, Result};
+use base64::Engine as _;
+use base64::engine::general_purpose::STANDARD as BASE64_STANDARD;
+use image::{ImageFormat, ImageReader};
+use kebab_core::{
+    FinishReason, GenerateRequest, ImageRefBlock, Lang, LanguageModel, ModelCaption,
+    ProvenanceEvent, ProvenanceKind, TokenChunk,
+};
+use time::OffsetDateTime;
+
+/// Long-edge clamp range for caption inputs. Smaller than OCR's
+/// `[256, 4096]` because vision LMs charge proportionally to input
+/// dimension — captions tolerate aggressive downscale better than
+/// OCR.
+const MIN_CAPTION_LONG_EDGE: u32 = 128;
+const MAX_CAPTION_LONG_EDGE: u32 = 1536;
+
+/// Token budget for captions. Captions are one-sentence by spec — 96
+/// tokens covers a 50-word English sentence or a 30-token Korean one
+/// with headroom for the LM's preamble before the stop sequence.
+const CAPTION_MAX_TOKENS: usize = 96;
+
+/// Run a caption pass and return the resulting `ModelCaption`. Honours
+/// `config.image.caption.enabled` — when disabled the function is a
+/// no-op and returns an `Err` so the caller can route the asset
+/// through `apply_caption` instead, which knows to short-circuit.
+///
+/// Direct callers should prefer [`apply_caption`] for end-to-end
+/// pipeline integration; this lower-level entry exists so tests can
+/// pin the produced `ModelCaption` independent of block mutation.
+pub fn caption_image(
+    llm: &dyn LanguageModel,
+    image_bytes: &[u8],
+    lang_hint: Option<&Lang>,
+    cfg: &kebab_config::Config,
+) -> Result<ModelCaption> {
+    if !cfg.image.caption.enabled {
+        anyhow::bail!(
+            "captioning is disabled (set image.caption.enabled = true in config to enable)"
+        );
+    }
+
+    let max_pixels = cfg
+        .image
+        .caption
+        .max_pixels
+        .clamp(MIN_CAPTION_LONG_EDGE, MAX_CAPTION_LONG_EDGE);
+    if max_pixels != cfg.image.caption.max_pixels {
+        tracing::warn!(
+            target: "kebab-parse-image",
+            "image.caption.max_pixels = {} clamped to {} (legal range [{}, {}])",
+            cfg.image.caption.max_pixels,
+            max_pixels,
+            MIN_CAPTION_LONG_EDGE,
+            MAX_CAPTION_LONG_EDGE
+        );
+    }
+
+    let prepared = downscale_to_png(image_bytes, max_pixels)
+        .context("preparing image for caption")?;
+    let b64 = BASE64_STANDARD.encode(&prepared);
+
+    let lang = lang_hint
+        .map(|l| l.0.as_str())
+        .filter(|s| !s.is_empty() && *s != "und");
+    let (system, user) = build_prompt(lang);
+
+    // Determinism — temperature 0.0 + seed 0, same convention as RAG
+    // and OCR. The LM adapter routes the base64 image via its
+    // provider-specific channel (Ollama: `images: [base64]`).
+    let req = GenerateRequest {
+        system,
+        user,
+        stop: vec!["\n\n".to_string()],
+        max_tokens: CAPTION_MAX_TOKENS,
+        temperature: 0.0,
+        seed: Some(0),
+        images: vec![b64],
+    };
+
+    let stream = llm
+        .generate_stream(req)
+        .context("captioning LM call failed")?;
+
+    let mut text = String::new();
+    let mut saw_done = false;
+    for chunk in stream {
+        match chunk? {
+            TokenChunk::Token(t) => {
+                text.push_str(&t);
+            }
+            TokenChunk::Done { finish_reason, .. } => {
+                saw_done = true;
+                if let FinishReason::Error(e) = finish_reason {
+                    anyhow::bail!("captioning LM ended with error: {e}");
+                }
+                break;
+            }
+        }
+    }
+    if !saw_done {
+        anyhow::bail!("captioning LM stream ended without a Done frame");
+    }
+
+    let caption_text = text.trim().to_string();
+
+    let model_ref = llm.model_ref();
+    let prompt_v = &cfg.image.caption.prompt_template_version;
+    let model_version = format!(
+        "{provider}/{prompt}",
+        provider = model_ref.provider,
+        prompt = prompt_v
+    );
+
+    tracing::debug!(
+        target: "kebab-parse-image",
+        "caption ok (model={}, prompt={}, chars={})",
+        model_ref.id,
+        prompt_v,
+        caption_text.chars().count()
+    );
+
+    Ok(ModelCaption {
+        text: caption_text,
+        model: model_ref.id,
+        model_version,
+    })
+}
+
+/// Mutate `block.caption` in place by running `caption_image` over
+/// `image_bytes`. When `config.image.caption.enabled = false` the
+/// function is a clean no-op (returns `Ok(())` without invoking the
+/// LM and without writing a Provenance event).
+///
+/// On LM failure, `block.caption` stays `None` — partial state is
+/// never written. The caller decides whether to skip the asset or
+/// surface the error.
+pub fn apply_caption(
+    llm: &dyn LanguageModel,
+    image_bytes: &[u8],
+    block: &mut ImageRefBlock,
+    lang_hint: Option<&Lang>,
+    cfg: &kebab_config::Config,
+    events: &mut Vec<ProvenanceEvent>,
+) -> Result<()> {
+    if !cfg.image.caption.enabled {
+        tracing::debug!(
+            target: "kebab-parse-image",
+            "captioning skipped — image.caption.enabled = false"
+        );
+        return Ok(());
+    }
+    let caption = caption_image(llm, image_bytes, lang_hint, cfg)?;
+    let model_label = caption.model.clone();
+    let model_version_label = caption.model_version.clone();
+    block.caption = Some(caption);
+    events.push(ProvenanceEvent {
+        at: OffsetDateTime::now_utc(),
+        agent: "kb-parse-image".to_string(),
+        kind: ProvenanceKind::CaptionApplied,
+        note: Some(format!(
+            "model={model_label} model_version={model_version_label}"
+        )),
+    });
+    Ok(())
+}
+
+/// Compose the `(system, user)` prompt pair for the caption call.
+/// Korean / English split keeps the model on the requested output
+/// language; everything else falls through to English.
+fn build_prompt(lang_hint: Option<&str>) -> (String, String) {
+    match lang_hint {
+        Some("ko") | Some("kor") => (
+            "이미지를 한 문장으로 객관적으로 설명한다. 추측은 피하고, \
+             보이는 것만 적는다. 마크다운 / 따옴표 / 부가 설명 없이 \
+             한 문장만 출력."
+                .to_string(),
+            "위 이미지를 한국어로 한 문장으로 설명하라.".to_string(),
+        ),
+        _ => (
+            "Describe the image in one objective sentence. Do not \
+             speculate; describe only what is visible. No markdown, \
+             no quotes, no commentary — output a single sentence."
+                .to_string(),
+            "Describe the image above in one English sentence.".to_string(),
+        ),
+    }
+}
+
+/// Decode `bytes`, downscale long-edge to `max_long_edge`, re-encode as
+/// PNG. Mirrors the OCR pipeline's pattern but with the caption-side
+/// long-edge bounds. PNG sources within the cap pass through without
+/// re-encode.
+fn downscale_to_png(bytes: &[u8], max_long_edge: u32) -> Result<Vec<u8>> {
+    let reader = ImageReader::new(Cursor::new(bytes))
+        .with_guessed_format()
+        .context("reading image header for caption")?;
+    let format = reader.format();
+    let (w, h) = reader
+        .into_dimensions()
+        .context("reading image dimensions for caption")?;
+
+    let long = w.max(h);
+    if long <= max_long_edge && format == Some(ImageFormat::Png) {
+        return Ok(bytes.to_vec());
+    }
+
+    let img = ImageReader::new(Cursor::new(bytes))
+        .with_guessed_format()
+        .context("re-reading image for caption decode")?
+        .decode()
+        .context("decoding image for caption")?;
+    let final_img = if long <= max_long_edge {
+        img
+    } else {
+        let scale = max_long_edge as f32 / long as f32;
+        let mut new_w = ((w as f32) * scale).round().max(1.0) as u32;
+        let mut new_h = ((h as f32) * scale).round().max(1.0) as u32;
+        if w >= h {
+            new_w = new_w.min(max_long_edge);
+        } else {
+            new_h = new_h.min(max_long_edge);
+        }
+        img.resize_exact(new_w, new_h, image::imageops::FilterType::Triangle)
+    };
+
+    let mut out = Cursor::new(Vec::new());
+    final_img
+        .write_to(&mut out, ImageFormat::Png)
+        .context("encoding image as PNG for caption")?;
+    Ok(out.into_inner())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn build_prompt_korean_for_ko_hint() {
+        let (sys, user) = build_prompt(Some("ko"));
+        assert!(sys.contains("이미지를 한 문장으로"));
+        assert!(user.contains("한국어로"));
+    }
+
+    #[test]
+    fn build_prompt_english_for_no_hint_or_und() {
+        let (sys, _) = build_prompt(None);
+        assert!(sys.contains("Describe the image"));
+        let (sys2, _) = build_prompt(Some("en"));
+        assert!(sys2.contains("Describe the image"));
+    }
+}
--- a/crates/kebab-parse-image/src/lib.rs
+++ b/crates/kebab-parse-image/src/lib.rs
@@ -13,14 +13,24 @@
 //! consumers can branch trust by engine (Tesseract / Apple Vision
 //! adapters, when added, will write a different `engine` string).
 //!
+//! P6-3 adds the [`caption`] module: [`caption_image`] /
+//! [`apply_caption`] route an image through any vision-capable
+//! [`kebab_core::LanguageModel`] (text-only LMs are not vision-aware
+//! and will surface a model-side error). Captions are explicitly
+//! marked **model-generated** — the trust gap between OCR (observed,
+//! engine-tagged) and caption (generated, prompt-tagged) is the
+//! workspace's central trust contract.
+//!
 //! Per design §3.4 (Block::ImageRef + ImageRefBlock), §3.7a (OcrText /
 //! ModelCaption stubs), §9.1 (image extraction policy / OCR vs caption
 //! provenance), §9 (versioning).

 mod dims;
 mod exif_extract;
+pub mod caption;
 pub mod ocr;

+pub use caption::{apply_caption, caption_image};
 pub use ocr::{OcrEngine, OllamaVisionOcr, apply_ocr};

 use anyhow::{Context, Result};
--- a/crates/kebab-parse-image/tests/caption.rs
+++ b/crates/kebab-parse-image/tests/caption.rs
@@ -0,0 +1,368 @@
+//! Integration tests for the caption adapter (P6-3).
+//!
+//! All hermetic tests use `MockLanguageModel` from `kebab-llm/mock`
+//! which captures `req.images` indirectly via the canned response. A
+//! single opt-in test (`#[ignore]`) wires the real
+//! `kebab-llm-local::OllamaLanguageModel` against the workspace's
+//! Ollama daemon to verify the `images: [base64]` round-trip.
+
+mod common;
+
+use std::sync::{Arc, Mutex};
+
+use kebab_config::Config;
+use kebab_core::{
+    AssetId, BlockId, CommonBlock, FinishReason, GenerateRequest, ImageRefBlock, Lang,
+    LanguageModel, ModelRef, ProvenanceEvent, ProvenanceKind, SourceSpan, TokenChunk,
+    TokenUsage,
+};
+use kebab_llm::MockLanguageModel;
+use kebab_parse_image::{apply_caption, caption_image};
+
+use crate::common::red_100x50_png;
+
+fn cfg_with_caption_enabled() -> Config {
+    let mut cfg = Config::defaults();
+    cfg.image.caption.enabled = true;
+    cfg.image.caption.max_pixels = 512;
+    cfg
+}
+
+fn empty_image_block() -> ImageRefBlock {
+    ImageRefBlock {
+        common: CommonBlock {
+            block_id: BlockId("0".repeat(32)),
+            heading_path: Vec::new(),
+            source_span: SourceSpan::Region {
+                x: 0,
+                y: 0,
+                w: 100,
+                h: 50,
+            },
+        },
+        asset_id: Some(AssetId("a".repeat(32))),
+        src: "img/x.png".to_string(),
+        alt: "x.png".to_string(),
+        ocr: None,
+        caption: None,
+    }
+}
+
+fn mk_mock(canned: &str) -> MockLanguageModel {
+    MockLanguageModel {
+        model_id: "vision-mock:1b".to_string(),
+        provider: "mock".to_string(),
+        context_tokens: 4096,
+        canned_response: canned.to_string(),
+        canned_finish: FinishReason::Stop,
+        canned_usage: TokenUsage {
+            prompt_tokens: 0,
+            completion_tokens: 0,
+            latency_ms: 0,
+        },
+    }
+}
+
+// ── Disabled feature gate ─────────────────────────────────────────────────
+
+#[test]
+fn apply_caption_no_op_when_feature_disabled() {
+    let mut cfg = Config::defaults();
+    cfg.image.caption.enabled = false;
+    let mock = mk_mock("ignored");
+    let mut block = empty_image_block();
+    let mut events: Vec<ProvenanceEvent> = Vec::new();
+    let bytes = red_100x50_png();
+    apply_caption(&mock, &bytes, &mut block, None, &cfg, &mut events)
+        .expect("disabled apply_caption must return Ok(())");
+    assert!(
+        block.caption.is_none(),
+        "disabled apply_caption must not write caption"
+    );
+    assert!(
+        events.is_empty(),
+        "disabled apply_caption must not append a Provenance event"
+    );
+}
+
+#[test]
+fn caption_image_errors_when_feature_disabled() {
+    let cfg = Config::defaults(); // enabled = false
+    let mock = mk_mock("ignored");
+    let bytes = red_100x50_png();
+    let r = caption_image(&mock, &bytes, None, &cfg);
+    assert!(
+        r.is_err(),
+        "caption_image must Err when image.caption.enabled = false"
+    );
+    let msg = format!("{:#}", r.unwrap_err());
+    assert!(
+        msg.contains("disabled"),
+        "error must mention disabled state: {msg}"
+    );
+}
+
+// ── Happy path ────────────────────────────────────────────────────────────
+
+#[test]
+fn apply_caption_sets_block_caption_and_appends_provenance() {
+    let cfg = cfg_with_caption_enabled();
+    let mock = mk_mock("사진 한 장");
+    let mut block = empty_image_block();
+    let mut events: Vec<ProvenanceEvent> = Vec::new();
+    let bytes = red_100x50_png();
+    apply_caption(
+        &mock,
+        &bytes,
+        &mut block,
+        Some(&Lang("ko".to_string())),
+        &cfg,
+        &mut events,
+    )
+    .expect("apply_caption must succeed");
+
+    let cap = block.caption.as_ref().expect("caption Some");
+    assert_eq!(cap.text, "사진 한 장");
+    assert_eq!(cap.model, "vision-mock:1b");
+    assert_eq!(cap.model_version, "mock/caption-v1");
+
+    assert_eq!(events.len(), 1);
+    assert_eq!(events[0].kind, ProvenanceKind::CaptionApplied);
+    assert_eq!(events[0].agent, "kb-parse-image");
+    let note = events[0].note.as_deref().unwrap_or("");
+    assert!(note.contains("vision-mock:1b") && note.contains("caption-v1"), "{note}");
+}
+
+// ── Empty token stream → empty caption text ──────────────────────────────
+
+#[test]
+fn caption_image_empty_stream_yields_empty_text() {
+    let cfg = cfg_with_caption_enabled();
+    let mock = mk_mock("");
+    let bytes = red_100x50_png();
+    let cap = caption_image(&mock, &bytes, None, &cfg).expect("empty stream must succeed");
+    assert_eq!(cap.text, "");
+    // Spec contract: caller can distinguish "captioning attempted, no
+    // result" from "captioning never attempted" by `caption.is_some()`.
+    // The text being empty does not erase the attempt.
+    assert!(!cap.model.is_empty());
+}
+
+// ── Korean vs English prompt selection ───────────────────────────────────
+
+/// `LanguageModel` impl that captures the `system` prompt sent to it
+/// so tests can verify the language branch picked by `build_prompt`
+/// (the function is private; this is the cleanest observable signal).
+struct CapturingMock {
+    captured_system: Arc<Mutex<Option<String>>>,
+    captured_images: Arc<Mutex<Vec<String>>>,
+}
+
+impl LanguageModel for CapturingMock {
+    fn model_ref(&self) -> ModelRef {
+        ModelRef {
+            id: "capture:1".to_string(),
+            provider: "mock".to_string(),
+            dimensions: None,
+        }
+    }
+    fn context_tokens(&self) -> usize {
+        4096
+    }
+    fn generate_stream(
+        &self,
+        req: GenerateRequest,
+    ) -> anyhow::Result<Box<dyn Iterator<Item = anyhow::Result<TokenChunk>> + Send>> {
+        *self.captured_system.lock().unwrap() = Some(req.system);
+        *self.captured_images.lock().unwrap() = req.images;
+        let chunks: Vec<TokenChunk> = vec![
+            TokenChunk::Token("ok".to_string()),
+            TokenChunk::Done {
+                finish_reason: FinishReason::Stop,
+                usage: TokenUsage {
+                    prompt_tokens: 0,
+                    completion_tokens: 0,
+                    latency_ms: 0,
+                },
+            },
+        ];
+        Ok(Box::new(chunks.into_iter().map(Ok)))
+    }
+}
+
+#[test]
+fn caption_image_routes_image_into_request_images_field() {
+    let cfg = cfg_with_caption_enabled();
+    let captured_system: Arc<Mutex<Option<String>>> = Arc::new(Mutex::new(None));
+    let captured_images: Arc<Mutex<Vec<String>>> = Arc::new(Mutex::new(Vec::new()));
+    let mock = CapturingMock {
+        captured_system: captured_system.clone(),
+        captured_images: captured_images.clone(),
+    };
+    let bytes = red_100x50_png();
+    let _ = caption_image(&mock, &bytes, Some(&Lang("ko".to_string())), &cfg)
+        .expect("caption succeeds");
+
+    let imgs = captured_images.lock().unwrap();
+    assert_eq!(imgs.len(), 1, "exactly one base64 image routed");
+    use base64::Engine as _;
+    let decoded = base64::engine::general_purpose::STANDARD
+        .decode(&imgs[0])
+        .expect("base64 decodes");
+    assert!(
+        !decoded.is_empty(),
+        "decoded image bytes must be non-empty"
+    );
+
+    let sys = captured_system.lock().unwrap().clone().unwrap();
+    assert!(
+        sys.contains("이미지를 한 문장으로"),
+        "Korean hint must produce Korean system prompt: {sys}"
+    );
+}
+
+#[test]
+fn caption_image_uses_english_prompt_for_undetermined_lang() {
+    let cfg = cfg_with_caption_enabled();
+    let captured_system: Arc<Mutex<Option<String>>> = Arc::new(Mutex::new(None));
+    let mock = CapturingMock {
+        captured_system: captured_system.clone(),
+        captured_images: Arc::new(Mutex::new(Vec::new())),
+    };
+    let bytes = red_100x50_png();
+    let _ = caption_image(&mock, &bytes, Some(&Lang("und".to_string())), &cfg)
+        .expect("caption succeeds");
+    let sys = captured_system.lock().unwrap().clone().unwrap();
+    assert!(sys.contains("Describe the image"), "{sys}");
+}
+
+// ── LM error propagates ──────────────────────────────────────────────────
+
+/// LM that returns Err immediately from `generate_stream` (before any
+/// token).
+struct FailingLm;
+impl LanguageModel for FailingLm {
+    fn model_ref(&self) -> ModelRef {
+        ModelRef {
+            id: "fail".into(),
+            provider: "mock".into(),
+            dimensions: None,
+        }
+    }
+    fn context_tokens(&self) -> usize {
+        0
+    }
+    fn generate_stream(
+        &self,
+        _req: GenerateRequest,
+    ) -> anyhow::Result<Box<dyn Iterator<Item = anyhow::Result<TokenChunk>> + Send>> {
+        Err(anyhow::anyhow!("simulated LM connection refused"))
+    }
+}
+
+#[test]
+fn apply_caption_lm_error_leaves_block_untouched() {
+    let cfg = cfg_with_caption_enabled();
+    let mut block = empty_image_block();
+    let mut events: Vec<ProvenanceEvent> = Vec::new();
+    let bytes = red_100x50_png();
+    let r = apply_caption(&FailingLm, &bytes, &mut block, None, &cfg, &mut events);
+    assert!(r.is_err());
+    assert!(
+        block.caption.is_none(),
+        "caption stays None when LM fails — partial state must not leak"
+    );
+    assert!(events.is_empty(), "no provenance event when LM fails");
+}
+
+// ── Determinism — identical mock input → identical caption ───────────────
+
+#[test]
+fn caption_image_deterministic_with_identical_inputs() {
+    let cfg = cfg_with_caption_enabled();
+    let bytes = red_100x50_png();
+    let mock1 = mk_mock("a deterministic caption");
+    let mock2 = mk_mock("a deterministic caption");
+    let cap1 = caption_image(&mock1, &bytes, None, &cfg).unwrap();
+    let cap2 = caption_image(&mock2, &bytes, None, &cfg).unwrap();
+    assert_eq!(cap1, cap2);
+}
+
+// ── max_pixels clamp ─────────────────────────────────────────────────────
+
+/// Out-of-range `max_pixels` is silently clamped at construction so a
+/// bad config can't kill ingest. The captured `images` field's
+/// decoded long edge confirms the clamp engaged.
+#[test]
+fn caption_image_clamps_oversized_max_pixels() {
+    let mut cfg = Config::defaults();
+    cfg.image.caption.enabled = true;
+    cfg.image.caption.max_pixels = 99_999; // way over MAX_CAPTION_LONG_EDGE
+    let captured_images: Arc<Mutex<Vec<String>>> = Arc::new(Mutex::new(Vec::new()));
+    let mock = CapturingMock {
+        captured_system: Arc::new(Mutex::new(None)),
+        captured_images: captured_images.clone(),
+    };
+    // 4000×3000 PNG well above the 1536 cap.
+    let bytes = common::large_blue_4000x3000_png();
+    let _ = caption_image(&mock, &bytes, None, &cfg).expect("caption succeeds");
+    let imgs = captured_images.lock().unwrap();
+    use base64::Engine as _;
+    let decoded = base64::engine::general_purpose::STANDARD
+        .decode(&imgs[0])
+        .unwrap();
+    let reader = image::ImageReader::new(std::io::Cursor::new(decoded))
+        .with_guessed_format()
+        .unwrap();
+    let (w, h) = reader.into_dimensions().unwrap();
+    let long = w.max(h);
+    assert!(
+        long <= 1536,
+        "max_pixels must clamp to MAX_CAPTION_LONG_EDGE=1536, got {long}"
+    );
+}
+
+// ── Real Ollama integration (opt-in) ─────────────────────────────────────
+
+/// End-to-end captioning against the workspace's real Ollama daemon
+/// via `kebab-llm-local::OllamaLanguageModel` (dev-dep). Skipped by
+/// default via `#[ignore]`; opt in with `--ignored`.
+///
+/// Run with:
+///
+/// ```sh
+/// KEBAB_MODELS_LLM_ENDPOINT=http://192.168.0.47:11434 \
+/// KEBAB_MODELS_LLM_MODEL=gemma4:e4b \
+/// cargo test -p kebab-parse-image --test caption \
+///   caption_integration -- --ignored --nocapture
+/// ```
+#[test]
+#[ignore = "hits a real Ollama daemon; opt in via `cargo test -- --ignored`"]
+fn caption_integration_real_ollama_describes_image() {
+    use kebab_llm_local::OllamaLanguageModel;
+
+    let mut cfg = Config::defaults();
+    cfg.image.caption.enabled = true;
+    cfg.image.caption.max_pixels = 768;
+    if let Ok(ep) = std::env::var("KEBAB_MODELS_LLM_ENDPOINT") {
+        cfg.models.llm.endpoint = ep;
+    } else {
+        cfg.models.llm.endpoint = "http://192.168.0.47:11434".to_string();
+    }
+    if let Ok(m) = std::env::var("KEBAB_MODELS_LLM_MODEL") {
+        cfg.models.llm.model = m;
+    } else {
+        cfg.models.llm.model = "gemma4:e4b".to_string();
+    }
+    cfg.models.llm.provider = "ollama".to_string();
+
+    let llm = OllamaLanguageModel::new(&cfg).expect("OllamaLanguageModel::new");
+    let bytes = red_100x50_png();
+    let cap = caption_image(&llm, &bytes, Some(&Lang("en".to_string())), &cfg)
+        .expect("real-Ollama caption_image must succeed");
+    eprintln!("integration caption: {}", cap.text);
+    assert!(!cap.text.is_empty(), "caption must be non-empty");
+    assert_eq!(cap.model, "gemma4:e4b");
+    assert!(cap.model_version.contains("ollama"));
+    assert!(cap.model_version.contains("caption-v1"));
+}
--- a/crates/kebab-rag/src/pipeline.rs
+++ b/crates/kebab-rag/src/pipeline.rs
@@ -195,6 +195,9 @@ impl RagPipeline {
            max_tokens: max_completion,
            temperature,
            seed,
+            // RAG is text-only — vision inputs only flow when a
+            // future multimodal pipeline injects images here.
+            images: Vec::new(),
        };

        let mut acc = String::new();
--- a/tasks/HOTFIXES.md
+++ b/tasks/HOTFIXES.md
@@ -14,6 +14,30 @@ historical contract that was implemented; this file accumulates the
 deltas so phase 5+ readers can find the live behavior without diffing
 git history.

+## 2026-05-02 — P6-3 caption: GenerateRequest.images + cargo feature dropped
+
+**Discovered**: P6-3 implementation start.
+
+**Symptom 1**: `tasks/p6/p6-3-caption-adapter.md` § Public surface declares `caption_image(llm: &dyn kebab_core::LanguageModel, ...)`, but the frozen `LanguageModel` trait + `GenerateRequest` from p4-1 carry no vision input. The spec's behavior contract ("the adapter is responsible for rendering the prompt to wire") implicitly relied on a trait extension that p4-1 never specced.
+
+**Symptom 2**: Spec § Definition of Done asks for `cargo check -p kebab-parse-image --features caption` — i.e. a cargo feature gate. The captioning module's only extra deps are `base64` + `image` + the `kebab-llm` trait, all already pulled in by P6-2. A cargo feature would only complicate the build matrix without saving meaningful binary weight.
+
+**Root cause**: Two small spec gaps that resolve cleanly together — extend the `LanguageModel` trait once for vision routing, and collapse compile-time + runtime gating into a single runtime gate.
+
+**Fix** (PR #34, feat/p6-3-caption-adapter):
+- `kebab-core::GenerateRequest` gains an `images: Vec<String>` field (`#[serde(default)]` for backward compat with pre-P6 wire payloads / snapshots). Empty for the text-only RAG path; populated with one or more base64 strings by vision-aware callers.
+- `kebab-llm-local::OllamaLanguageModel` routes `req.images` onto the wire as `images: [base64, ...]` (Ollama's vision channel). The wire shape stays byte-identical for empty `images` because the field uses `#[serde(skip_serializing_if = "<[String]>::is_empty")]`.
+- `kebab-parse-image::caption` module: `caption_image` / `apply_caption` build `GenerateRequest { images: vec![b64], temperature: 0.0, seed: 0, ... }` and accept any `&dyn LanguageModel`. Korean / English prompt branch picked from `lang_hint`.
+- Cargo feature `caption` is **not** introduced — the runtime gate `config.image.caption.enabled = false` (default OFF) suffices.
+- All existing `GenerateRequest { ... }` literals (kebab-rag, kebab-llm tests, kebab-llm-local tests) gained `images: Vec::new()` to satisfy the new field.
+
+**Trust note**: Captions stay explicitly model-generated. `ModelCaption.model_version` carries `"<provider>/<prompt_template_version>"` (e.g. `"ollama/caption-v1"`) so a regression in either prompt or model is auditable from the wire.
+
+**Amends**:
+- tasks/p4/p4-1-llm-trait.md (`GenerateRequest` schema gained `images: Vec<String>`).
+- tasks/p4/p4-2-ollama-adapter.md (request body now optionally includes `images: [...]`).
+- tasks/p6/p6-3-caption-adapter.md ("Definition of Done" cargo feature `caption` dropped; runtime gate is the only feature gate).
+
 ## 2026-05-02 — P6-2 default OCR engine: Tesseract → Ollama-vision

 **Discovered**: P6-2 implementation start.
--- a/tasks/p6/p6-3-caption-adapter.md
+++ b/tasks/p6/p6-3-caption-adapter.md
@@ -3,7 +3,7 @@ phase: P6
 component: kebab-parse-image (caption adapter)
 task_id: p6-3
 title: "ModelCaption adapter (LanguageModel-driven, feature-gated)"
-status: planned
+status: completed
 depends_on: [p6-1, p4-2]
 unblocks: []
 contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md