Files
kebab/crates/kebab-parse-image/src/lib.rs
altair823 d11a810119 feat(kebab-parse-image): P6-1 image extractor + EXIF whitelist
- 새 crate kebab-parse-image 추가 (workspace 19개째). MediaType::Image(_)
  자산을 단일-블록 CanonicalDocument 로 변환하는 ImageExtractor 구현.
- parser_version "image-meta-v1" (§9 versioning).
- 본문은 Block::ImageRef 1건만 포함 — OCR / caption 필드는 None 으로
  남겨 두고 P6-2 / P6-3 에서 채운다.
- EXIF 화이트리스트 (§9.1, PII 표면 최소화):
  Make / Model / Software / DateTimeOriginal / Orientation /
  GPSLatitude(+Ref) / GPSLongitude(+Ref). MakerNote / Thumbnail / 기타
  태그는 폐기. DateTime 은 EXIF "YYYY:MM:DD HH:MM:SS" → ISO-8601 변환.
  GPS DMS triple + N/S/E/W ref → signed decimal degree.
- 차원: image::ImageReader 헤더만 읽어 (w, h, format) 획득. 16k×16k cap
  초과 또는 디코드 실패 → metadata.user.dimensions = null + Provenance
  Warning 이벤트 (Err 아님). 포맷 자체 인식 실패 → anyhow::Error
  (caller skip).
- SourceSpan::Region { 0, 0, w, h } 으로 전체 이미지 영역 표기. 결정성:
  동일 bytes + 동일 parser_version → 동일 doc_id + block_id (§4.2 ID
  recipe 그대로 사용).
- metadata.source_type = Reference, trust_level = Primary, lang = "und".
  title = 확장자 제외 파일명, alt = 파일명.
- 의존성 경계 (§8): kebab-core 만 + image 0.25 (default features off,
  png/jpeg/webp/gif/tiff 만), kamadak-exif 0.6, anyhow / serde /
  serde_json / time / tracing / thiserror. kebab-source-fs · parse-md ·
  store-* · embed* · llm* · rag · UI crate 미참조.
- 테스트 14개 (4 unit + 10 integration):
  • PNG 차원 추출, JPEG EXIF GPS 추출 (DMS → decimal 변환 정확도 1e-6),
    EXIF 없는 PNG → 빈 map, 손상 PNG → warning + null dims (panic 없음),
    인식 불가 bytes → Err, 결정성, 스냅샷, supports() 매칭, media_type
    불일치 거부.
  • 픽스처는 in-memory 생성 (PNG 는 image crate, EXIF JPEG 는 kamadak
    Writer 로 EXIF blob 만든 뒤 SOI 직후 APP1 splice) — 바이너리
    fixture 커밋 없음.
- HEIC / RAW 는 spec 상 v1 out of scope (image crate 미지원, Apple
  Vision sidecar 가 추후 P+ 에서 채움).
- tasks/p6/p6-1-image-extractor-exif.md status: planned → completed.

contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
sections: §3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText /
ModelCaption stubs, §9.1 image extraction policy, §9 versioning.
2026-05-02 05:05:47 +00:00

206 lines
7.0 KiB
Rust

//! `kebab-parse-image` — image extractor (P6-1).
//!
//! Implements [`kebab_core::Extractor`] for `MediaType::Image(_)`. One asset
//! produces one [`CanonicalDocument`] with a single
//! [`Block::ImageRef`](kebab_core::Block::ImageRef). EXIF is captured into
//! `metadata.user["exif"]`, dimensions into `metadata.user["dimensions"]`.
//! OCR / caption fields stay `None`; later tasks (P6-2 / P6-3) populate
//! them.
//!
//! Per design §3.4 (Block::ImageRef + ImageRefBlock), §3.7a (OcrText /
//! ModelCaption stubs), §9.1 (image extraction policy), §9 (versioning).
mod dims;
mod exif_extract;
use anyhow::{Context, Result};
use kebab_core::{
Block, CanonicalDocument, CommonBlock, Extractor, ImageRefBlock, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType,
TrustLevel, id_for_block, id_for_doc,
};
use serde_json::{Map, Value};
use time::OffsetDateTime;
/// Parser version label for the image extractor (§9 versioning).
pub const PARSER_VERSION: &str = "image-meta-v1";
/// Maximum decode dimension (per axis) before we refuse to read the image.
/// Matches the §9.1 "cap decode at ~16k" policy in the design doc.
pub const MAX_DECODE_DIM: u32 = 16_384;
/// Image extractor — produces a single-block `CanonicalDocument` whose body
/// is exactly one [`ImageRefBlock`].
pub struct ImageExtractor;
impl ImageExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for ImageExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for ImageExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Image(_))
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-image: unsupported media_type for ImageExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
// Dimensions / format. `Err` here means the bytes don't even resolve
// to a known image format — we propagate so the caller can skip the
// asset (per spec failure modes: "Unsupported format → anyhow::Error").
let dim_outcome = dims::probe(bytes).context("guessing image format")?;
// EXIF is best-effort regardless of dimension outcome. A corrupt
// pixel stream may still carry a readable EXIF block (and vice
// versa), so the two probes are independent.
let exif_map = exif_extract::extract_whitelisted(bytes);
let (span, dims_value, decode_warning) = match &dim_outcome {
dims::DimOutcome::Ok { width, height, format } => {
let mut dims = Map::new();
dims.insert("w".into(), Value::Number((*width).into()));
dims.insert("h".into(), Value::Number((*height).into()));
dims.insert("format".into(), Value::String((*format).to_string()));
(
SourceSpan::Region {
x: 0,
y: 0,
w: *width,
h: *height,
},
Value::Object(dims),
None,
)
}
dims::DimOutcome::Failed { reason } => (
SourceSpan::Region {
x: 0,
y: 0,
w: 0,
h: 0,
},
Value::Null,
Some(reason.clone()),
),
};
let block_id = id_for_block(&doc_id, "imageref", &[], 0, &span);
let workspace_path_str = asset.workspace_path.0.clone();
let filename = filename_from_workspace_path(&workspace_path_str);
let title = strip_extension(&filename);
let block = Block::ImageRef(ImageRefBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
asset_id: Some(asset.asset_id.clone()),
src: workspace_path_str,
alt: filename,
ocr: None,
caption: None,
});
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(3);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-image".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!("parser_version={}", parser_version.0)),
});
if let Some(reason) = decode_warning {
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-image".to_string(),
kind: ProvenanceKind::Warning,
note: Some(reason),
});
}
// Metadata. `created_at` / `updated_at` are sourced from the asset's
// `discovered_at` so the wire form does not embed a fresh timestamp
// for every extract call (which would break determinism).
let mut user = Map::new();
user.insert("exif".into(), Value::Object(exif_map));
user.insert("dimensions".into(), dims_value);
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Reference,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user,
};
tracing::debug!(
target: "kebab-parse-image",
"extracted image doc_id={} workspace_path={} dim_ok={}",
doc_id.0,
asset.workspace_path.0,
matches!(dim_outcome, dims::DimOutcome::Ok { .. })
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks: vec![block],
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
})
}
}
fn filename_from_workspace_path(p: &str) -> String {
p.rsplit('/').next().unwrap_or(p).to_string()
}
fn strip_extension(filename: &str) -> String {
match filename.rfind('.') {
Some(0) => filename.to_string(),
Some(idx) => filename[..idx].to_string(),
None => filename.to_string(),
}
}