Files

altair823 ca0567c72b feat(kebab-app): P6-4 image ingest wiring — kebab ingest 가 PNG/JPEG 자산도 처리

P6-1/P6-2/P6-3 의 라이브러리 (`ImageExtractor`, `OllamaVisionOcr`,
`apply_caption`) 가 그동안 CLI 에서 보이지 않던 미완 구간을 완성.
이제 `kebab ingest` 가 markdown 외에 이미지 자산을 end-to-end 로
색인하고, `kebab search` / `kebab ask` 가 OCR 텍스트 + caption 으로
이미지를 매칭/인용한다.

## kebab-app

- `[dependencies]` 에 `kebab-parse-image` 추가.
- `ingest_with_config` 진입 시 `image.ocr.enabled` / `image.caption.enabled`
  플래그에 따라 `OllamaVisionOcr` / `OllamaLanguageModel` 을 **ingest
  세션당 1회** 빌드. 자산 루프에서 trait object 로 공유.
  reqwest::blocking::Client 의 내부 Arc 덕분에 알로케이션 비용은
  자산 수와 무관.
- 두 어댑터 + ImageExtractor 를 한 묶음으로 `ImagePipeline` 구조체에
  담아 `ingest_one_asset` 매개변수 폭증 차단 (clippy::too_many_arguments
  대응).
- `ingest_one_asset` 의 markdown-only 가드를 `match media_type` 으로
  교체 — Markdown 은 기존 경로, Image(_) 는 새 `ingest_one_image_asset`
  로 분기, PDF/Audio/Other 는 종전대로 skipped.
- 신규 `ingest_one_image_asset`:
  - bytes 읽기 → `ImageExtractor::extract` (실패 시 caller 가 errors+=1)
  - `apply_ocr` (Lenient — 실패 시 ProvenanceKind::Warning 이벤트 +
    `IngestItem.warnings` 에 \"ocr_failed: ...\", `block.ocr` 는 None
    유지)
  - `apply_caption` (동일 Lenient 정책)
  - 기존 `MdHeadingV1Chunker` 호출 — 청커는 이미 `Block::ImageRef` 를
    단일 청크로 emit
  - 기존 persist + embed 시퀀스 그대로 (markdown 과 byte-identical)
- `lang_hint_from_doc` — `Lang(\"und\")` 또는 빈 문자열을 None 으로
  매핑 (image-pipeline 어댑터의 build_prompt 가 \"und\" 를 silent drop
  하지 않도록 caller 측에서 미리).

## kebab-chunk

- `render_block_text` 의 `Block::ImageRef` 분기를 P6-4 (β) plain
  concat 정책으로 교체 — `[alt, ocr.joined, caption.text]` 를 `\\n\\n`
  로 join, 빈 부분은 drop. alt 가 비면 `src` 의 basename 으로 fallback
  (P6-1 contract 의 defensive guard).
- 신규 unit 테스트 `image_ref_p6_4_plain_concat_drops_empty_parts` —
  alt-only / alt+ocr / alt+caption / alt+ocr+caption / 빈 alt → src
  fallback 다섯 케이스 모두 검증.
- 기존 `image_ref_emits_own_chunk_zero_tokens` 그대로 통과 — 청커의
  per-block dispatch 는 변경 없음, text 렌더링만 갱신.

## 통합 테스트 (kebab-app/tests/image_pipeline.rs)

wiremock 으로 Ollama 를 stub. 5건:

1. OCR-only happy path — 1 PNG + ocr.enabled → 1 doc + 1 chunk emit,
   `block.ocr.joined` 가 mock 의 \"Hello World 2026\".
2. OCR + caption 동시 활성 — 두 필드 모두 채워지고 chunk text 에
   alt + ocr + caption 세 부분 모두 포함.
3. Lenient 실패 검증 — OCR 503 시 자산은 indexed (kind=New),
   `errors=0`, ProvenanceKind::Warning attributed to \"kb-app\",
   `IngestItem.warnings` 에 \"ocr_failed:\" 노트.
4. 양쪽 비활성 — `image.ocr.enabled=false && image.caption.enabled=false`
   여도 자산은 chunk 1개로 indexed (chunk text=filename), EXIF +
   dimensions 그대로 채워짐.
5. 결정성 (re-ingest) — 동일 PNG 두 번 ingest 시 두 번째는
   `Updated` + 동일 `doc_id`.

## SMOKE.md

`kebab search --mode lexical \"Hello World\"` 단계를 명령 시퀀스에
추가. `[image.ocr]` / `[image.caption]` config 절 예시 + ingest 시간
추정 (자산당 ~5-10초) 추가. \"책은 P7 PDF 라인으로\" 가이드를 검증
체크리스트 와 \"알려진 동작\" 양쪽에 박음.

## 실 Ollama 통합 검증

192.168.0.47 + gemma4:e4b 기준:

```
$ kebab --config /tmp/kebab-smoke/config.toml ingest
scanned 2  new 2  updated 0  skipped 0  errors 0  (18395 ms)

$ kebab inspect doc <image_doc_id>
parser_version: image-meta-v1
blocks: [{
  alt: \"hello.png\",
  ocr: \"Hello World 2026\",
  caption: \"The image displays the text \\\"Hello World 2026\\\" in a large, black, sans-serif font.\"
}]

$ kebab --json ask \"Hello World 텍스트가 어디에 있나?\" --mode hybrid
grounded: true
citations: [{marker: \"[1]\", doc_path: \"hello.png\"}]
```

## 검증

- `cargo test --workspace --no-fail-fast -j 1` — 전부 pass
- `cargo clippy --workspace --all-targets -- -D warnings` — pass
- `cargo test -p kebab-chunk image_ref` — 2 pass (P1-5 회귀 + P6-4
  신규 unit)
- `cargo test -p kebab-app --test image_pipeline` — 5 pass

## 의존성 경계

- `kebab-app` 이 `kebab-parse-image` 추가 — spec Allowed dep 그대로.
- 새 forbidden 침범 없음 (기존 `kebab-tui` / `kebab-desktop` /
  `kebab-eval` 미참조 유지).
- 본 task 가 신설하는 image-specific 비즈니스 로직 0줄 — 모두
  `kebab-parse-image` 에 위임.

`tasks/p6/p6-4-image-ingest-wiring.md` status: planned → completed.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2
Extractor/Chunker traits, §9.1 image extraction policy.

2026-05-02 07:37:56 +00:00

19 KiB

Raw Permalink Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections

phase

component

task_id

title

status

depends_on

unblocks

contract_source

contract_sections

kebab-app (image ingest dispatch + chunking)

p6-4

Wire ImageExtractor + OCR + caption into kebab-app::ingest end-to-end

completed

p6-1

p6-2

p6-3

p1-6

p3-5

../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md

§3.4 ImageRefBlock

§6.1 ingest pipeline

§7.2 Extractor/Chunker traits

§9.1 image extraction policy

p6-4 — Image ingest wiring (kebab-app + chunker)

Goal

Make kebab ingest end-to-end functional for image assets. The library-level pipeline (ImageExtractor, OllamaVisionOcr, apply_caption) is complete and tested in isolation — this task connects the wires from kebab-source-fs (which already classifies .png / .jpg / .webp / .gif / .tiff as MediaType::Image(_)) through kebab-app::ingest_with_config to kebab-store-sqlite + kebab-store-vector, so a user running kebab ingest against a workspace containing diagrams / screenshots / camera photos sees those assets indexed and searchable.

Why now / why this size

P6-1 / P6-2 / P6-3 each shipped a focused library and a passing test suite, but all three deliver value only after kebab-app::ingest learns how to dispatch on MediaType::Image(_). The wiring is small (one new dispatch arm, one chunking branch, one LM-construction call site) but materially user-facing — without it, the entire P6 phase is invisible from the CLI. Pulling this into its own task keeps the P6-1/2/3 specs frozen as written while letting the integration evolve under its own contract.

Allowed dependencies

kebab-app 의 현재 Cargo.toml 그대로의 surface — 본 task 는 그 위에 kebab-parse-image 한 줄을 추가합니다.

kebab-core
kebab-config
kebab-source-fs
kebab-parse-md, kebab-parse-types
kebab-normalize
kebab-chunk (image-document branch in md-heading-v1 — this task extends it)
kebab-store-sqlite, kebab-store-vector
kebab-search
kebab-embed, kebab-embed-local
kebab-llm, kebab-llm-local (constructs OllamaLanguageModel for caption)
kebab-rag
kebab-parse-image (NEW — added by this task)
anyhow, serde_json, tracing

Forbidden dependencies

kebab-tui, kebab-desktop (P9 미시작 — UI crate 가 ingest 를 호출하면 layering 위반)
kebab-eval (cycle 위험 — eval 이 ingest 를 호출하므로 그 반대는 금지)
본 task 가 신설하는 자체 image extractor / OCR / caption 비즈니스 로직 — 모두 kebab-parse-image 에 이미 존재. kebab-app 안에 image-specific 로직 추가 금지 (얇은 dispatch + glue 만 허용).

Inputs

input	type	source
workspace assets	`RawAsset` stream	`kebab-source-fs::SourceConnector::scan` (already classifies image types)
image bytes	`&[u8]`	filesystem read in `kebab-app`
`Config`	`kebab_config::Config`	CLI `--config` flag → `Config::load`
OCR config	`config.image.ocr`	P6-2
Caption config	`config.image.caption`	P6-3

Outputs

output	type	downstream
`CanonicalDocument` per image	written via `DocumentStore::put_document`	`kebab-store-sqlite`
One synthesized chunk per image	`Vec<Chunk>` (length 1)	`kebab-store-sqlite::put_chunks` + `kebab-store-vector::upsert`
Updated `IngestReport` counters	`scanned / new / updated / skipped / errors`	wire output (`ingest_report.v1`)

Public surface

No new public types. The wiring exists inside kebab-app::ingest_with_config (and its private helpers). kebab-chunk gains one image-only branch in md_heading_v1:

// crates/kebab-chunk/src/md_heading_v1.rs (additions only — sketch)
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
    if is_image_only_document(doc) {
        return Ok(vec![image_chunk(doc, policy)?]);
    }
    // ... existing markdown heading logic untouched ...
}

/// Returns true iff `doc` is an image-only document (single `ImageRef`
/// block). P6-1's `ImageExtractor` already guarantees this shape today
/// — the predicate exists as a defensive guard against (a) a future
/// task that introduces multi-block image documents, and (b) accidental
/// route-through of a `[Block::Heading, Block::ImageRef, ...]` shape
/// that would look image-ish but should still flow through the
/// markdown chunker.
fn is_image_only_document(doc: &CanonicalDocument) -> bool {
    doc.blocks.len() == 1
        && matches!(doc.blocks.first(), Some(Block::ImageRef(_)))
}

fn image_chunk(doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Chunk> {
    let block = match &doc.blocks[0] {
        Block::ImageRef(b) => b,
        _ => unreachable!("guarded by is_image_only_document"),
    };
    let text = compose_image_chunk_text(block);
    // chunk_id derived from doc_id + chunker_version + [block_id] + policy_hash
    // (existing §4.2 recipe — no new ID kind).
    Ok(Chunk { /* ... */ })
}

compose_image_chunk_text is the canonical place where the (β) plain-concatenation policy lives:

fn compose_image_chunk_text(block: &ImageRefBlock) -> String {
    let alt = if block.alt.is_empty() {
        // alt should never be empty at this stage because P6-1 falls
        // back to the filename when it is, but the chunker stays
        // defensive — an empty first line would degrade lexical
        // search hits on filenames.
        block.src.rsplit('/').next().unwrap_or("[image]").to_string()
    } else {
        block.alt.clone()
    };
    let ocr = block.ocr.as_ref().map(|o| o.joined.as_str()).unwrap_or("");
    let cap = block.caption.as_ref().map(|c| c.text.as_str()).unwrap_or("");
    [alt, ocr.to_string(), cap.to_string()]
        .into_iter()
        .filter(|s| !s.is_empty())
        .collect::<Vec<_>>()
        .join("\n\n")
}

Behavior contract

ingest dispatch (kebab-app)

kebab-app::ingest_with_config reads the asset stream from kebab-source-fs. For each RawAsset:
- MediaType::Markdown → existing markdown extractor path (unchanged).
- MediaType::Image(_) → new image branch (this task).
- MediaType::Pdf | MediaType::Audio(_) | MediaType::Other(_) → skipped += 1 (existing behaviour).
Image branch:
1. Read bytes via kebab-source-fs (existing helper).
2. Build a kebab_core::ExtractContext { asset, workspace_root, config: &ExtractConfig::default() }.
3. Call ImageExtractor::new().extract(&ctx, &bytes). Failure (Err(_)) → errors += 1, log, continue to next asset (do not abort the whole ingest).
4. Take a mutable borrow of doc.blocks[0] (must be Block::ImageRef); take a mutable borrow of doc.provenance.events.
5. If config.image.ocr.enabled:
  - Build OllamaVisionOcr::new(&cfg).
  - Call apply_ocr(&engine, &bytes, block, lang_hint, &mut events).
  - Failure → log a tracing::warn!, append ProvenanceKind::Warning event with the error message via the helper, continue.
  - block.ocr stays None on failure (P6-2 contract).
6. If config.image.caption.enabled:
  - Build the LM once per ingest session (not per asset) — see "LM construction" below.
  - Call apply_caption(&*llm, &bytes, block, lang_hint, &cfg, &mut events).
  - Failure → same lenient policy as OCR: warn, log, continue. block.caption stays None.
7. Pass the (possibly partially-populated) CanonicalDocument to the existing chunker → embedder → store path, identical to markdown.

Parallelism

The image branch shares the existing markdown worker pool — one asset dispatch unit (markdown OR image) per worker — so config.indexing.max_parallel_extractors keeps its current meaning. The current kebab-app ingest is sequential per-asset (single worker irrespective of the knob value); image branch adds zero new concurrency. A future P+ task may parallelise both branches, at which point the OCR / caption HTTP calls naturally become the throughput ceiling (one in-flight request per worker — reqwest::blocking::Client is shared but each call blocks its worker thread until response).
Implication for sizing: a 5000-asset ingest with OCR enabled runs as roughly 5000 × (per-asset OCR latency) end-to-end. With gemma4:e4b at ~3-5s per call this is the 4-7 hour range the brainstorming flagged. Books-as-PDF route bypasses this entirely.

Lang hint

lang_hint: Option<&Lang> passed to apply_ocr / apply_caption reads from doc.lang (set to Lang("und") by P6-1).
kebab-parse-image already special-cases "und" so the prompt does not embed a misleading hint.

LM / OCR engine construction

Both the caption LM and the OCR adapter wrap reqwest::blocking::Client, whose internal Arc makes a single instance cheap to share across all assets. Both are constructed once per ingest invocation, before the asset loop, gated on the matching enabled flag.

Caption — when config.image.caption.enabled = true, build OllamaLanguageModel::new(&cfg) once. Stored as Box<dyn LanguageModel> (or trait object behind &) and passed to every apply_caption call.
OCR — when config.image.ocr.enabled = true, build OllamaVisionOcr::new(&cfg) once. Same Arc-share property; passed by & to every apply_ocr call.
Endpoints — OllamaVisionOcr::new already falls back to models.llm.endpoint when image.ocr.endpoint is None, so a single host typically serves both LLM and OCR. The two adapters can therefore share an Ollama host or run against separate hosts independently.
Construction failure (e.g. invalid endpoint string, empty model id) → ingest aborts with the constructor's error before any asset is scanned. Never silently disables OCR / caption.
Per-asset cost — only the HTTP call to Ollama. Adapter struct is reused.

Chunking (kebab-chunk md-heading-v1)

The chunker gains an image-only branch as sketched above. The branch:
- Returns exactly one Chunk per image document.
- chunk.text = (β) plain concatenation of [alt, ocr_joined, caption_text] joined by \n\n, dropping empty parts.
- chunk.block_ids = vec![block.common.block_id.clone()].
- chunk.heading_path = vec![] (image documents have no heading hierarchy).
- chunk.source_spans = vec![block.common.source_span.clone()] — Vec<SourceSpan> per kebab_core::Chunk definition; image branch contributes one element holding the Region { x, y, w, h } from P6-1.
- chunk.token_estimate follows the existing md-heading-v1 token-count convention (whitespace-segmented words clamped to policy.target_tokens).
- chunk.policy_hash is the existing ChunkPolicy hex digest the chunker already computes for markdown; image branch reuses the same value to keep policy edits invalidating image chunks alongside markdown chunks.
Determinism: the existing id_for_chunk(doc_id, chunker_version, &[block_id], policy_hash) recipe applies unchanged.
Oversized text: an image whose ocr.joined exceeds policy.target_tokens produces an oversized chunk. Acceptable in v1 — the user-facing scenario (diagrams / screenshots / camera photos) typically yields ≤ 1 page of OCR. Books are routed through P7 PDF instead (see "Out of scope"). A tracing::warn! fires when this happens so a future P+ task can quantify how often the boundary is hit.

`enabled = false` for both

When config.image.ocr.enabled = false AND config.image.caption.enabled = false, the image is still extracted, stored, and chunked — the user gets EXIF + dimensions + filename indexed, with empty OCR / caption.
The synthesized chunk text falls back to just the filename. Lexical search on filenames still works; vector search produces a best-effort embedding from a one-line input.

Failure semantics summary

failure	counter	doc stored?	provenance
`ImageExtractor::extract` Err (decode fail / unrecognised)	`errors+=1`	no	n/a
`MediaType::Image(_)` but `supports()` returns false (won't happen with current trait — defensive)	`skipped+=1`	no	n/a
`apply_ocr` Err	unchanged	yes (block.ocr = None)	`ProvenanceKind::Warning`, agent `kb-app`
`apply_caption` Err	unchanged	yes (block.caption = None)	`ProvenanceKind::Warning`, agent `kb-app`
HEIC / RAW → `MediaType::Other(_)`	`skipped+=1` (existing)	no	n/a

Storage / wire effects

kebab-store-sqlite::documents table gains rows whose parser_version = "image-meta-v1".
kebab-store-sqlite::blocks table gains rows of block_kind = "imageref".
kebab-store-sqlite::chunks table gains rows whose chunker_version = "md-heading-v1" AND whose block_ids reference a single imageref block.
kebab-store-vector::chunk_embeddings_<model>_<dim> gains one vector per image chunk.
IngestReport (wire ingest_report.v1) counters update naturally: scanned includes images, new / updated track image docs, errors counts decode failures, skipped counts unsupported formats.
No new wire schemas or kebab-core types.

Test plan

kind	description	fixture / data
integration	TempDir KB + 1 PNG (`hello-world.png` with text) + `image.ocr.enabled = true` + `image.caption.enabled = false` (mock LM unused) → `kebab-app::ingest` produces 1 doc + 1 chunk; chunk text contains the filename + OCR text	`kebab-app/tests/image_pipeline.rs` (new), wiremock for OCR Ollama call
integration	Same fixture but `caption.enabled = true` with mock LM returning `"a red square with text"` → chunk text contains alt + OCR + caption joined by `\n\n`	wiremock for Ollama (both `/api/generate` calls)
integration	Determinism: ingest the same PNG twice → identical `doc_id`, `chunk_id` (P1 idempotency contract holds)	inline
integration	OCR Ollama returns 503 → asset still indexed; `block.ocr = None`; provenance has Warning event; `errors` counter NOT incremented; ingest returns Ok	wiremock
integration	Caption Ollama returns 503 → asset still indexed; `block.caption = None`; provenance Warning; `errors` not incremented	wiremock
integration	`image.ocr.enabled = false` AND `image.caption.enabled = false` → image still indexed; chunk text = filename only	inline
integration	Hybrid search across mixed corpus (1 markdown + 1 PNG) returns image chunk for an OCR-text query	inline (real `multilingual-e5-small` embedding)
integration	`kebab inspect doc <image_doc_id>` returns the image `CanonicalDocument` with `block.ocr` / `block.caption` populated	inline
unit	`kebab-chunk::md_heading_v1::is_image_only_document` returns true for `[ImageRef]`, false for `[Heading, ImageRef]` (image embedded in markdown — currently a P+ case but the predicate must not misfire)	unit
unit	`compose_image_chunk_text` drops empty parts: alt-only, alt+ocr, alt+caption, alt+ocr+caption all formatted correctly	unit
smoke	Update `docs/SMOKE.md` so the runbook ingests at least one image fixture and verifies search-by-OCR-text works	docs change

The opt-in real-Ollama integration test from P6-2 / P6-3 stays inside kebab-parse-image; this task's integration tests use wiremock so cargo test --workspace stays hermetic.

Definition of Done

Spec PR (this PR — `spec/p6-4-image-ingest-wiring`)

tasks/p6/p6-4-image-ingest-wiring.md 작성 + self-review (placeholder / 모순 / 모호성 / scope)
tasks/INDEX.md "P6 — 4 components" 반영
PR 본문에 design §3.4, §6.1, §9.1 링크
brainstorming 의 모든 결정 (옵션 A 청킹 / β 청크 텍스트 / Lenient 실패 정책 / LM 1회 빌드 / 책 P7 이관 / P6-5 미시작) 본문 반영

Implementation PR (follow-up — `feat/p6-4-image-ingest-wiring`)

cargo check --workspace passes
cargo test --workspace --no-fail-fast -j 1 passes (all new integration tests green)
cargo clippy --workspace --all-targets -- -D warnings passes
kebab ingest against a TempDir KB containing 1 markdown + 1 PNG produces scanned 2 / new 2 / errors 0
kebab search --mode lexical "<OCR text>" returns the image chunk
kebab inspect doc <image_doc_id> shows non-empty block.ocr / block.caption
docs/SMOKE.md includes an image-fixture step

Out of scope

Books / scanned PDFs — routed through P7 PDF pipeline. P7's PDF text extractor handles text-embed PDFs natively; scanned PDFs are P7's internal concern (page render → OCR call to the same kebab-parse-image::OllamaVisionOcr adapter).
Image-region chunker — current OllamaVisionOcr returns a single full-image region, so per-region chunking has no signal. When a region-aware engine (Tesseract / Apple Vision sidecar) lands, a separate image-region-chunker task can split on OcrText.regions[].
Long-OCR splitting — image documents whose OCR exceeds target_tokens produce oversized chunks. Acceptable for diagrams / screenshots / photos (the v1 user scenario per the brainstorming with the user). Books deliberately use the PDF path instead.
Retry mechanism for partial OCR / caption failures — current escape hatch is "delete the asset, re-ingest". A kebab ingest --retry-image-analysis flag is a P+ enhancement once operational data shows it's needed.
Parallelism beyond config.indexing.max_parallel_extractors — the existing knob applies. Per-image OCR-bound parallelism (multiple in-flight Ollama calls) is a P+ scale-hardening task that has been explicitly de-scoped because the user routes books through PDF instead of image.
Progress reporting (kebab ingest --progress) — same reasoning as parallelism.
W3C Media Fragments citation form (path#xywh=...) — Citation already carries Region source span; fragment URI rendering can be a wire-only follow-up when downstream UI needs it.
Wire schema bump — no new wire types; ingest_report.v1 counters absorb image events naturally.

Risks / notes

OCR / caption failures are silent in IngestReport counters. The user only sees them via --debug traces or kebab inspect doc <id> (Provenance Warning events). This is the intentional Lenient policy from the brainstorming; flag in the PR description so users know how to detect partial failures. A future spec extension could introduce a image_ocr_failed / image_caption_failed counter alongside errors.
LM endpoint validation runs once per ingest. A misconfigured models.llm.endpoint aborts ingest before any asset is processed. This is correct fail-fast behaviour but means a single broken endpoint takes the whole ingest down — even markdown assets that don't need the LM. The user fix is image.caption.enabled = false or correcting the endpoint.
unreachable! in the chunker branch is guarded by is_image_only_document. If a future task introduces multi-block image documents (e.g. embedded markdown caption alongside the image), the predicate must change in lockstep. Keep both functions in the same module so the invariant is local.
Determinism stress. The existing markdown path's kb-normalize::build_canonical_document shares one now_utc() reading across the per-document Provenance events. The image path needs the same: P6-1 already does let now = OffsetDateTime::now_utc(); once and reuses it for both events; this task's wiring must not introduce a second now() call between extract and apply_ocr/caption that would break per-document timestamp parity.

19 KiB Raw Permalink Blame History Unescape Escape