Files
kebab/tasks/p6/p6-4-image-ingest-wiring.md
altair823 ca0567c72b feat(kebab-app): P6-4 image ingest wiring — kebab ingest 가 PNG/JPEG 자산도 처리
P6-1/P6-2/P6-3 의 라이브러리 (`ImageExtractor`, `OllamaVisionOcr`,
`apply_caption`) 가 그동안 CLI 에서 보이지 않던 미완 구간을 완성.
이제 `kebab ingest` 가 markdown 외에 이미지 자산을 end-to-end 로
색인하고, `kebab search` / `kebab ask` 가 OCR 텍스트 + caption 으로
이미지를 매칭/인용한다.

## kebab-app

- `[dependencies]` 에 `kebab-parse-image` 추가.
- `ingest_with_config` 진입 시 `image.ocr.enabled` / `image.caption.enabled`
  플래그에 따라 `OllamaVisionOcr` / `OllamaLanguageModel` 을 **ingest
  세션당 1회** 빌드. 자산 루프에서 trait object 로 공유.
  reqwest::blocking::Client 의 내부 Arc 덕분에 알로케이션 비용은
  자산 수와 무관.
- 두 어댑터 + ImageExtractor 를 한 묶음으로 `ImagePipeline` 구조체에
  담아 `ingest_one_asset` 매개변수 폭증 차단 (clippy::too_many_arguments
  대응).
- `ingest_one_asset` 의 markdown-only 가드를 `match media_type` 으로
  교체 — Markdown 은 기존 경로, Image(_) 는 새 `ingest_one_image_asset`
  로 분기, PDF/Audio/Other 는 종전대로 skipped.
- 신규 `ingest_one_image_asset`:
  - bytes 읽기 → `ImageExtractor::extract` (실패 시 caller 가 errors+=1)
  - `apply_ocr` (Lenient — 실패 시 ProvenanceKind::Warning 이벤트 +
    `IngestItem.warnings` 에 \"ocr_failed: ...\", `block.ocr` 는 None
    유지)
  - `apply_caption` (동일 Lenient 정책)
  - 기존 `MdHeadingV1Chunker` 호출 — 청커는 이미 `Block::ImageRef` 를
    단일 청크로 emit
  - 기존 persist + embed 시퀀스 그대로 (markdown 과 byte-identical)
- `lang_hint_from_doc` — `Lang(\"und\")` 또는 빈 문자열을 None 으로
  매핑 (image-pipeline 어댑터의 build_prompt 가 \"und\" 를 silent drop
  하지 않도록 caller 측에서 미리).

## kebab-chunk

- `render_block_text` 의 `Block::ImageRef` 분기를 P6-4 (β) plain
  concat 정책으로 교체 — `[alt, ocr.joined, caption.text]` 를 `\\n\\n`
  로 join, 빈 부분은 drop. alt 가 비면 `src` 의 basename 으로 fallback
  (P6-1 contract 의 defensive guard).
- 신규 unit 테스트 `image_ref_p6_4_plain_concat_drops_empty_parts` —
  alt-only / alt+ocr / alt+caption / alt+ocr+caption / 빈 alt → src
  fallback 다섯 케이스 모두 검증.
- 기존 `image_ref_emits_own_chunk_zero_tokens` 그대로 통과 — 청커의
  per-block dispatch 는 변경 없음, text 렌더링만 갱신.

## 통합 테스트 (kebab-app/tests/image_pipeline.rs)

wiremock 으로 Ollama 를 stub. 5건:

1. OCR-only happy path — 1 PNG + ocr.enabled → 1 doc + 1 chunk emit,
   `block.ocr.joined` 가 mock 의 \"Hello World 2026\".
2. OCR + caption 동시 활성 — 두 필드 모두 채워지고 chunk text 에
   alt + ocr + caption 세 부분 모두 포함.
3. Lenient 실패 검증 — OCR 503 시 자산은 indexed (kind=New),
   `errors=0`, ProvenanceKind::Warning attributed to \"kb-app\",
   `IngestItem.warnings` 에 \"ocr_failed:\" 노트.
4. 양쪽 비활성 — `image.ocr.enabled=false && image.caption.enabled=false`
   여도 자산은 chunk 1개로 indexed (chunk text=filename), EXIF +
   dimensions 그대로 채워짐.
5. 결정성 (re-ingest) — 동일 PNG 두 번 ingest 시 두 번째는
   `Updated` + 동일 `doc_id`.

## SMOKE.md

`kebab search --mode lexical \"Hello World\"` 단계를 명령 시퀀스에
추가. `[image.ocr]` / `[image.caption]` config 절 예시 + ingest 시간
추정 (자산당 ~5-10초) 추가. \"책은 P7 PDF 라인으로\" 가이드를 검증
체크리스트 와 \"알려진 동작\" 양쪽에 박음.

## 실 Ollama 통합 검증

192.168.0.47 + gemma4:e4b 기준:

```
$ kebab --config /tmp/kebab-smoke/config.toml ingest
scanned 2  new 2  updated 0  skipped 0  errors 0  (18395 ms)

$ kebab inspect doc <image_doc_id>
parser_version: image-meta-v1
blocks: [{
  alt: \"hello.png\",
  ocr: \"Hello World 2026\",
  caption: \"The image displays the text \\\"Hello World 2026\\\" in a large, black, sans-serif font.\"
}]

$ kebab --json ask \"Hello World 텍스트가 어디에 있나?\" --mode hybrid
grounded: true
citations: [{marker: \"[1]\", doc_path: \"hello.png\"}]
```

## 검증

- `cargo test --workspace --no-fail-fast -j 1` — 전부 pass
- `cargo clippy --workspace --all-targets -- -D warnings` — pass
- `cargo test -p kebab-chunk image_ref` — 2 pass (P1-5 회귀 + P6-4
  신규 unit)
- `cargo test -p kebab-app --test image_pipeline` — 5 pass

## 의존성 경계

- `kebab-app` 이 `kebab-parse-image` 추가 — spec Allowed dep 그대로.
- 새 forbidden 침범 없음 (기존 `kebab-tui` / `kebab-desktop` /
  `kebab-eval` 미참조 유지).
- 본 task 가 신설하는 image-specific 비즈니스 로직 0줄 — 모두
  `kebab-parse-image` 에 위임.

`tasks/p6/p6-4-image-ingest-wiring.md` status: planned → completed.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2
Extractor/Chunker traits, §9.1 image extraction policy.
2026-05-02 07:37:56 +00:00

260 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P6
component: kebab-app (image ingest dispatch + chunking)
task_id: p6-4
title: "Wire ImageExtractor + OCR + caption into kebab-app::ingest end-to-end"
status: completed
depends_on: [p6-1, p6-2, p6-3, p1-6, p3-5]
unblocks: []
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2 Extractor/Chunker traits, §9.1 image extraction policy]
---
# p6-4 — Image ingest wiring (kebab-app + chunker)
## Goal
Make `kebab ingest` end-to-end functional for image assets. The library-level pipeline (`ImageExtractor`, `OllamaVisionOcr`, `apply_caption`) is complete and tested in isolation — this task connects the wires from `kebab-source-fs` (which already classifies `.png` / `.jpg` / `.webp` / `.gif` / `.tiff` as `MediaType::Image(_)`) through `kebab-app::ingest_with_config` to `kebab-store-sqlite` + `kebab-store-vector`, so a user running `kebab ingest` against a workspace containing diagrams / screenshots / camera photos sees those assets indexed and searchable.
## Why now / why this size
P6-1 / P6-2 / P6-3 each shipped a focused library and a passing test suite, but all three deliver value only after `kebab-app::ingest` learns how to dispatch on `MediaType::Image(_)`. The wiring is small (one new dispatch arm, one chunking branch, one LM-construction call site) but materially user-facing — without it, the entire P6 phase is invisible from the CLI. Pulling this into its own task keeps the P6-1/2/3 specs frozen as written while letting the integration evolve under its own contract.
## Allowed dependencies
`kebab-app` 의 현재 Cargo.toml 그대로의 surface — 본 task 는 그 위에 `kebab-parse-image` 한 줄을 추가합니다.
- `kebab-core`
- `kebab-config`
- `kebab-source-fs`
- `kebab-parse-md`, `kebab-parse-types`
- `kebab-normalize`
- `kebab-chunk` (image-document branch in `md-heading-v1` — this task extends it)
- `kebab-store-sqlite`, `kebab-store-vector`
- `kebab-search`
- `kebab-embed`, `kebab-embed-local`
- `kebab-llm`, `kebab-llm-local` (constructs `OllamaLanguageModel` for caption)
- `kebab-rag`
- **`kebab-parse-image` (NEW — added by this task)**
- `anyhow`, `serde_json`, `tracing`
## Forbidden dependencies
- `kebab-tui`, `kebab-desktop` (P9 미시작 — UI crate 가 ingest 를 호출하면 layering 위반)
- `kebab-eval` (cycle 위험 — eval 이 ingest 를 호출하므로 그 반대는 금지)
- 본 task 가 신설하는 자체 image extractor / OCR / caption 비즈니스 로직 — 모두 `kebab-parse-image` 에 이미 존재. `kebab-app` 안에 image-specific 로직 추가 금지 (얇은 dispatch + glue 만 허용).
## Inputs
| input | type | source |
|-------|------|--------|
| workspace assets | `RawAsset` stream | `kebab-source-fs::SourceConnector::scan` (already classifies image types) |
| image bytes | `&[u8]` | filesystem read in `kebab-app` |
| `Config` | `kebab_config::Config` | CLI `--config` flag → `Config::load` |
| OCR config | `config.image.ocr` | P6-2 |
| Caption config | `config.image.caption` | P6-3 |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `CanonicalDocument` per image | written via `DocumentStore::put_document` | `kebab-store-sqlite` |
| One synthesized chunk per image | `Vec<Chunk>` (length 1) | `kebab-store-sqlite::put_chunks` + `kebab-store-vector::upsert` |
| Updated `IngestReport` counters | `scanned / new / updated / skipped / errors` | wire output (`ingest_report.v1`) |
## Public surface
No new public types. The wiring exists inside `kebab-app::ingest_with_config` (and its private helpers). `kebab-chunk` gains one image-only branch in `md_heading_v1`:
```rust
// crates/kebab-chunk/src/md_heading_v1.rs (additions only — sketch)
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
if is_image_only_document(doc) {
return Ok(vec![image_chunk(doc, policy)?]);
}
// ... existing markdown heading logic untouched ...
}
/// Returns true iff `doc` is an image-only document (single `ImageRef`
/// block). P6-1's `ImageExtractor` already guarantees this shape today
/// — the predicate exists as a defensive guard against (a) a future
/// task that introduces multi-block image documents, and (b) accidental
/// route-through of a `[Block::Heading, Block::ImageRef, ...]` shape
/// that would look image-ish but should still flow through the
/// markdown chunker.
fn is_image_only_document(doc: &CanonicalDocument) -> bool {
doc.blocks.len() == 1
&& matches!(doc.blocks.first(), Some(Block::ImageRef(_)))
}
fn image_chunk(doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Chunk> {
let block = match &doc.blocks[0] {
Block::ImageRef(b) => b,
_ => unreachable!("guarded by is_image_only_document"),
};
let text = compose_image_chunk_text(block);
// chunk_id derived from doc_id + chunker_version + [block_id] + policy_hash
// (existing §4.2 recipe — no new ID kind).
Ok(Chunk { /* ... */ })
}
```
`compose_image_chunk_text` is the canonical place where the (β) plain-concatenation policy lives:
```rust
fn compose_image_chunk_text(block: &ImageRefBlock) -> String {
let alt = if block.alt.is_empty() {
// alt should never be empty at this stage because P6-1 falls
// back to the filename when it is, but the chunker stays
// defensive — an empty first line would degrade lexical
// search hits on filenames.
block.src.rsplit('/').next().unwrap_or("[image]").to_string()
} else {
block.alt.clone()
};
let ocr = block.ocr.as_ref().map(|o| o.joined.as_str()).unwrap_or("");
let cap = block.caption.as_ref().map(|c| c.text.as_str()).unwrap_or("");
[alt, ocr.to_string(), cap.to_string()]
.into_iter()
.filter(|s| !s.is_empty())
.collect::<Vec<_>>()
.join("\n\n")
}
```
## Behavior contract
### ingest dispatch (kebab-app)
- `kebab-app::ingest_with_config` reads the asset stream from `kebab-source-fs`. For each `RawAsset`:
- `MediaType::Markdown` → existing markdown extractor path (unchanged).
- `MediaType::Image(_)` → new image branch (this task).
- `MediaType::Pdf | MediaType::Audio(_) | MediaType::Other(_)``skipped += 1` (existing behaviour).
- Image branch:
1. Read bytes via `kebab-source-fs` (existing helper).
2. Build a `kebab_core::ExtractContext { asset, workspace_root, config: &ExtractConfig::default() }`.
3. Call `ImageExtractor::new().extract(&ctx, &bytes)`. Failure (`Err(_)`) → `errors += 1`, log, continue to next asset (do not abort the whole ingest).
4. Take a mutable borrow of `doc.blocks[0]` (must be `Block::ImageRef`); take a mutable borrow of `doc.provenance.events`.
5. If `config.image.ocr.enabled`:
- Build `OllamaVisionOcr::new(&cfg)`.
- Call `apply_ocr(&engine, &bytes, block, lang_hint, &mut events)`.
- Failure → log a `tracing::warn!`, append `ProvenanceKind::Warning` event with the error message via the helper, continue.
- `block.ocr` stays `None` on failure (P6-2 contract).
6. If `config.image.caption.enabled`:
- Build the LM **once per ingest session** (not per asset) — see "LM construction" below.
- Call `apply_caption(&*llm, &bytes, block, lang_hint, &cfg, &mut events)`.
- Failure → same lenient policy as OCR: warn, log, continue. `block.caption` stays `None`.
7. Pass the (possibly partially-populated) `CanonicalDocument` to the existing chunker → embedder → store path, identical to markdown.
### Parallelism
- The image branch shares the existing markdown worker pool — one asset dispatch unit (markdown OR image) per worker — so `config.indexing.max_parallel_extractors` keeps its current meaning. The current `kebab-app` ingest is sequential per-asset (single worker irrespective of the knob value); image branch adds zero new concurrency. A future P+ task may parallelise both branches, at which point the OCR / caption HTTP calls naturally become the throughput ceiling (one in-flight request per worker — `reqwest::blocking::Client` is shared but each call blocks its worker thread until response).
- Implication for sizing: a 5000-asset ingest with OCR enabled runs as roughly `5000 × (per-asset OCR latency)` end-to-end. With `gemma4:e4b` at ~3-5s per call this is the 4-7 hour range the brainstorming flagged. Books-as-PDF route bypasses this entirely.
### Lang hint
- `lang_hint: Option<&Lang>` passed to `apply_ocr` / `apply_caption` reads from `doc.lang` (set to `Lang("und")` by P6-1).
- `kebab-parse-image` already special-cases `"und"` so the prompt does not embed a misleading hint.
### LM / OCR engine construction
Both the caption LM and the OCR adapter wrap `reqwest::blocking::Client`, whose internal `Arc` makes a single instance cheap to share across all assets. Both are constructed **once per ingest invocation**, before the asset loop, gated on the matching `enabled` flag.
- **Caption** — when `config.image.caption.enabled = true`, build `OllamaLanguageModel::new(&cfg)` once. Stored as `Box<dyn LanguageModel>` (or trait object behind `&`) and passed to every `apply_caption` call.
- **OCR** — when `config.image.ocr.enabled = true`, build `OllamaVisionOcr::new(&cfg)` once. Same `Arc`-share property; passed by `&` to every `apply_ocr` call.
- Endpoints — `OllamaVisionOcr::new` already falls back to `models.llm.endpoint` when `image.ocr.endpoint` is `None`, so a single host typically serves both LLM and OCR. The two adapters can therefore share an Ollama host or run against separate hosts independently.
- **Construction failure** (e.g. invalid endpoint string, empty model id) → ingest aborts with the constructor's error before any asset is scanned. Never silently disables OCR / caption.
- Per-asset cost — only the HTTP call to Ollama. Adapter struct is reused.
### Chunking (kebab-chunk md-heading-v1)
- The chunker gains an image-only branch as sketched above. The branch:
- Returns exactly one `Chunk` per image document.
- `chunk.text` = (β) plain concatenation of `[alt, ocr_joined, caption_text]` joined by `\n\n`, dropping empty parts.
- `chunk.block_ids = vec![block.common.block_id.clone()]`.
- `chunk.heading_path = vec![]` (image documents have no heading hierarchy).
- `chunk.source_spans = vec![block.common.source_span.clone()]``Vec<SourceSpan>` per `kebab_core::Chunk` definition; image branch contributes one element holding the `Region { x, y, w, h }` from P6-1.
- `chunk.token_estimate` follows the existing `md-heading-v1` token-count convention (whitespace-segmented words clamped to `policy.target_tokens`).
- `chunk.policy_hash` is the existing `ChunkPolicy` hex digest the chunker already computes for markdown; image branch reuses the same value to keep policy edits invalidating image chunks alongside markdown chunks.
- Determinism: the existing `id_for_chunk(doc_id, chunker_version, &[block_id], policy_hash)` recipe applies unchanged.
- Oversized text: an image whose `ocr.joined` exceeds `policy.target_tokens` produces an oversized chunk. Acceptable in v1 — the user-facing scenario (diagrams / screenshots / camera photos) typically yields ≤ 1 page of OCR. Books are routed through P7 PDF instead (see "Out of scope"). A `tracing::warn!` fires when this happens so a future P+ task can quantify how often the boundary is hit.
### `enabled = false` for both
- When `config.image.ocr.enabled = false` AND `config.image.caption.enabled = false`, the image is still extracted, stored, and chunked — the user gets EXIF + dimensions + filename indexed, with empty OCR / caption.
- The synthesized chunk text falls back to just the filename. Lexical search on filenames still works; vector search produces a best-effort embedding from a one-line input.
### Failure semantics summary
| failure | counter | doc stored? | provenance |
|---|---|---|---|
| `ImageExtractor::extract` Err (decode fail / unrecognised) | `errors+=1` | no | n/a |
| `MediaType::Image(_)` but `supports()` returns false (won't happen with current trait — defensive) | `skipped+=1` | no | n/a |
| `apply_ocr` Err | unchanged | yes (block.ocr = None) | `ProvenanceKind::Warning`, agent `kb-app` |
| `apply_caption` Err | unchanged | yes (block.caption = None) | `ProvenanceKind::Warning`, agent `kb-app` |
| HEIC / RAW → `MediaType::Other(_)` | `skipped+=1` (existing) | no | n/a |
## Storage / wire effects
- `kebab-store-sqlite::documents` table gains rows whose `parser_version = "image-meta-v1"`.
- `kebab-store-sqlite::blocks` table gains rows of `block_kind = "imageref"`.
- `kebab-store-sqlite::chunks` table gains rows whose `chunker_version = "md-heading-v1"` AND whose `block_ids` reference a single `imageref` block.
- `kebab-store-vector::chunk_embeddings_<model>_<dim>` gains one vector per image chunk.
- `IngestReport` (wire `ingest_report.v1`) counters update naturally: `scanned` includes images, `new` / `updated` track image docs, `errors` counts decode failures, `skipped` counts unsupported formats.
- No new wire schemas or `kebab-core` types.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| integration | TempDir KB + 1 PNG (`hello-world.png` with text) + `image.ocr.enabled = true` + `image.caption.enabled = false` (mock LM unused) → `kebab-app::ingest` produces 1 doc + 1 chunk; chunk text contains the filename + OCR text | `kebab-app/tests/image_pipeline.rs` (new), wiremock for OCR Ollama call |
| integration | Same fixture but `caption.enabled = true` with mock LM returning `"a red square with text"` → chunk text contains alt + OCR + caption joined by `\n\n` | wiremock for Ollama (both `/api/generate` calls) |
| integration | Determinism: ingest the same PNG twice → identical `doc_id`, `chunk_id` (P1 idempotency contract holds) | inline |
| integration | OCR Ollama returns 503 → asset still indexed; `block.ocr = None`; provenance has Warning event; `errors` counter NOT incremented; ingest returns Ok | wiremock |
| integration | Caption Ollama returns 503 → asset still indexed; `block.caption = None`; provenance Warning; `errors` not incremented | wiremock |
| integration | `image.ocr.enabled = false` AND `image.caption.enabled = false` → image still indexed; chunk text = filename only | inline |
| integration | Hybrid search across mixed corpus (1 markdown + 1 PNG) returns image chunk for an OCR-text query | inline (real `multilingual-e5-small` embedding) |
| integration | `kebab inspect doc <image_doc_id>` returns the image `CanonicalDocument` with `block.ocr` / `block.caption` populated | inline |
| unit | `kebab-chunk::md_heading_v1::is_image_only_document` returns true for `[ImageRef]`, false for `[Heading, ImageRef]` (image embedded in markdown — currently a P+ case but the predicate must not misfire) | unit |
| unit | `compose_image_chunk_text` drops empty parts: alt-only, alt+ocr, alt+caption, alt+ocr+caption all formatted correctly | unit |
| smoke | Update `docs/SMOKE.md` so the runbook ingests at least one image fixture and verifies search-by-OCR-text works | docs change |
The opt-in real-Ollama integration test from P6-2 / P6-3 stays inside `kebab-parse-image`; this task's integration tests use wiremock so `cargo test --workspace` stays hermetic.
## Definition of Done
### Spec PR (this PR — `spec/p6-4-image-ingest-wiring`)
- [x] `tasks/p6/p6-4-image-ingest-wiring.md` 작성 + self-review (placeholder / 모순 / 모호성 / scope)
- [x] `tasks/INDEX.md` "P6 — 4 components" 반영
- [x] PR 본문에 design §3.4, §6.1, §9.1 링크
- [x] brainstorming 의 모든 결정 (옵션 A 청킹 / β 청크 텍스트 / Lenient 실패 정책 / LM 1회 빌드 / 책 P7 이관 / P6-5 미시작) 본문 반영
### Implementation PR (follow-up — `feat/p6-4-image-ingest-wiring`)
- [ ] `cargo check --workspace` passes
- [ ] `cargo test --workspace --no-fail-fast -j 1` passes (all new integration tests green)
- [ ] `cargo clippy --workspace --all-targets -- -D warnings` passes
- [ ] `kebab ingest` against a TempDir KB containing 1 markdown + 1 PNG produces `scanned 2 / new 2 / errors 0`
- [ ] `kebab search --mode lexical "<OCR text>"` returns the image chunk
- [ ] `kebab inspect doc <image_doc_id>` shows non-empty `block.ocr` / `block.caption`
- [ ] `docs/SMOKE.md` includes an image-fixture step
## Out of scope
- **Books / scanned PDFs** — routed through P7 PDF pipeline. P7's PDF text extractor handles text-embed PDFs natively; scanned PDFs are P7's internal concern (page render → OCR call to the same `kebab-parse-image::OllamaVisionOcr` adapter).
- **Image-region chunker** — current `OllamaVisionOcr` returns a single full-image region, so per-region chunking has no signal. When a region-aware engine (Tesseract / Apple Vision sidecar) lands, a separate `image-region-chunker` task can split on `OcrText.regions[]`.
- **Long-OCR splitting** — image documents whose OCR exceeds `target_tokens` produce oversized chunks. Acceptable for diagrams / screenshots / photos (the v1 user scenario per the brainstorming with the user). Books deliberately use the PDF path instead.
- **Retry mechanism for partial OCR / caption failures** — current escape hatch is "delete the asset, re-ingest". A `kebab ingest --retry-image-analysis` flag is a P+ enhancement once operational data shows it's needed.
- **Parallelism beyond `config.indexing.max_parallel_extractors`** — the existing knob applies. Per-image OCR-bound parallelism (multiple in-flight Ollama calls) is a P+ scale-hardening task that has been explicitly de-scoped because the user routes books through PDF instead of image.
- **Progress reporting** (`kebab ingest --progress`) — same reasoning as parallelism.
- **W3C Media Fragments citation form** (`path#xywh=...`) — `Citation` already carries `Region` source span; fragment URI rendering can be a wire-only follow-up when downstream UI needs it.
- **Wire schema bump** — no new wire types; `ingest_report.v1` counters absorb image events naturally.
## Risks / notes
- **OCR / caption failures are silent in `IngestReport` counters.** The user only sees them via `--debug` traces or `kebab inspect doc <id>` (Provenance Warning events). This is the intentional Lenient policy from the brainstorming; flag in the PR description so users know how to detect partial failures. A future spec extension could introduce a `image_ocr_failed` / `image_caption_failed` counter alongside `errors`.
- **LM endpoint validation runs once per ingest.** A misconfigured `models.llm.endpoint` aborts ingest before any asset is processed. This is correct fail-fast behaviour but means a single broken endpoint takes the whole ingest down — even markdown assets that don't need the LM. The user fix is `image.caption.enabled = false` or correcting the endpoint.
- **`unreachable!` in the chunker branch is guarded by `is_image_only_document`.** If a future task introduces multi-block image documents (e.g. embedded markdown caption alongside the image), the predicate must change in lockstep. Keep both functions in the same module so the invariant is local.
- **Determinism stress.** The existing markdown path's `kb-normalize::build_canonical_document` shares one `now_utc()` reading across the per-document Provenance events. The image path needs the same: P6-1 already does `let now = OffsetDateTime::now_utc();` once and reuses it for both events; this task's wiring must not introduce a second `now()` call between extract and apply_ocr/caption that would break per-document timestamp parity.