Files
kebab/docs/superpowers/specs/2026-05-28-v0.20.x-logging-r2-spec.md
altair823 7c24734cc7 docs(superpowers): v0.20.x logging r2 spec + plan artifacts
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:04:32 +00:00

30 KiB

title, created, status, parent_spec, target_version, branch
title created status parent_spec target_version branch
v0.20.x ingest log round 2 — 4 enhancement spec 2026-05-28 DRAFT round 0 2026-04-27-kebab-final-form-design.md v0.20.x feat/ingest-log-round2-enhancements

v0.20.x ingest log round 2

§1 Motivation

§1.1 Sweet-spot analysis — progressive dogfood tuning

v0.20.0 sub-item 1 의 round 1 ingest log (PR #189 merged) 는 per-run ndjson 의 file-only logging 으로 배포. 3개월 실사용(dogfood) 중 OCR engine 의 timeout / performance sweet-spot 을 점진적으로 조정할 필요.

  • 현재 상태: 각 ingest run 의 OCR 샘플(ms, success/fail) 이 ndjson file 에만 기록 → historical aggregate query 불가 (per-run 단위).
  • 요구: 누적 데이터베이스에서 p90 / p99 / 극값 을 조회 → timeout default 축소(e.g. 300s → 180s) 결정 가능.
  • 해결: SQLite pdf_ocr_events mirror table — v0.20.x round 2 주요 enhancement.

§1.2 Image dimension 결함 — null emit 문제

crates/kebab-app/src/pdf_ocr_apply.rs 의 6개 emit point (line 155, 188, 265 등) 에서 image_width: None, image_height: None hardcode.

  • 현재: raster JPEG 가 memory 에 있으나 dimension 측정 미수행.
  • 영향: wire schema 의 optional field 이지만, 실제 use case (e.g. "100MB+ 이미지만 timeout 조정") 를 위해 필수 데이터.
  • fix: raster decode via image crate (transitive dep via kebab-parse-image).

§1.3 CLI inspect subcommands — 운영 visibility

ndjson log 는 human-readable 이지만, 스크립트/automation 용 corpus-wide 통계가 부족.

  • kebab inspect ocr-stats --json — 전체 OCR 성공률, p50/p90/p99 latency, engine 별 breakdown.
  • kebab inspect ocr-failures --doc-id <id> --json — 특정 doc 의 failure history.
  • kebab inspect ocr-failures --json — 최근 failure 나열 (corpus-wide).

§1.4 Log retention — 무한 증가 방지

ingest 가 반복되면서 ~/.local/state/kebab/logs/ingest-*.ndjson file 누적.

  • 현재: 수동 정리 필수.
  • 요구: 자동 cleanup — keep_recent_runs (e.g. 100) + retention_days (e.g. 30).
  • 적용: SQLite pdf_ocr_events 도 동일한 retention 정책 적용.

§2 Scope + non-scope

§2.1 Included

Enhancement 1: image_width + image_height capture (trivial)

  • raster JPEG dimension decode in pdf_ocr_apply.rs 6 emit point.
  • image crate import (already transitive).
  • Some(u32) 로 fill.

Enhancement 2: SQLite mirror — pdf_ocr_events table (medium)

  • V008 migration: pdf_ocr_events table 신규 (run_id, ts, doc_id, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine).
  • index: run_id, doc_id, ts.
  • insert path: IngestLogWriter 에서 file ndjson + SQLite 동시 write (dual-write).
  • SqliteStore::record_pdf_ocr_event(…) API.

Enhancement 3: CLI inspect commands (medium)

  • kebab inspect ocr-stats --json — corpus-wide aggregate (total_events, success_count, failure_count, p50/p90/p99_ms, by_engine, top 10 docs by failure).
  • kebab inspect ocr-failures --doc-id <id> --json — single doc failure list.
  • kebab inspect ocr-failures --json (no doc-id) — corpus-wide recent failures (--limit configurable).
  • wire schemas: ocr_stats.v1 + ocr_failures.v1 (additive minor to schema.v1).

Enhancement 4: log retention + rotation (low)

  • [logging] keep_recent_runs: u32 (default 100) + retention_days: u32 (default 30).
  • file cleanup: IngestLogWriter::open 시 prune helper 호출.
  • SQLite cleanup: SqliteStore::prune_pdf_ocr_events(retention_days).
  • backward compat: old config (no [logging] fields) parses with default.

§2.2 Out of scope

  • wire schema public API (pdf_ocr_events 는 internal SQLite table, wire expose 안 함).
  • ask 명령의 한국어 phrasing-sensitive refusal (이번 round 범위 외).
  • migration rollback automation (standard CLAUDE.md protocol follow).
  • concurrent ingest lock manager (현재 single-process ingest 가정, future spec).

§3 Design decisions

§3.1 image_width / image_height — raster decode path

선택: raster JPEG 를 ImageReader 로 decode → (width, height) 추출.

  • 이유: OCR 호출 시 bytes 가 이미 memory (extract_dctdecode_page_image), decode latency << OCR latency (negligible <1ms).
  • 대안 거절: PDF MediaBox 사용 → actual raster 와 page size 다를 수 있음 (less accurate).
  • 구현: image crate 의 ImageReader::new(Cursor::new(&bytes)).with_guessed_format()?.into_dimensions()?.
  • error handling: decode fail → (None, None) fallback. OCR 결과는 여전히 valid.

§3.2 SQLite mirror — V008 migration + dual-write

선택: v0.20.0 round 1 의 file-only ndjson 을 보완하는 SQLite mirror (non-breaking).

  • 이유: historical query 를 위해 structured storage 필수. file 만으로는 corpus-wide aggregate 불가.
  • doc_id wiring: LogEvent::Ocr 의 doc_id field 는 closure scope 에서 미리 capture 되어야 함. apply_ocr_to_pdf_pages 호출 전에 canonical.doc_id 를 local var 로 binding 후, closure 내에서 동일한 doc_id 로 file ndjson + SQLite insert 수행. 이를 통해 dual-write 의 일관성 보장.
  • dual-write 구조:
    1. IngestLogWriter::write_event(&LogEvent::Ocr) 시 file ndjson + SQLite insert.
    2. insert 는 Arc<SqliteStore> clone 을 emit_progress closure 가 직접 호출.
    3. transaction safety: file write first (failures → log), then SQLite (non-critical).
  • non-breaking: old config 가 없어도 logging 정상 작동 (file only). SQLite 는 upgrade 시 자동 생성.

§3.3 CLI inspect commands — ocr-stats + ocr-failures

wire schema: 기존 schema.v1wire.schemas list 에 ocr_stats.v1 + ocr_failures.v1 additive minor 추가.

  • 이유: 새 wire shape 은 public API 가 아님 (inspect command 만 emit). wire.v1 의 확장으로 additive.
  • 구현: kebab-cli/src/main.rsSubcommand::InspectInspectCommand::OcrStats / OcrFailures arm 추가.

§3.4 retention — keep_recent_runs + retention_days

선택: 두 조건 모두 충족 시만 보존 (OR-on-stale = AND-on-fresh semantics).

  • 이유:
    • keep_recent_runs=100 — deterministic "최근 N 개 run 보존".
    • retention_days=30 — time-based cleanup (dogfood 중단 후 obsolete log 자동 삭제).
    • Delete if (idx >= keep_recent) OR (modified <= cutoff) — 둘 중 하나라도 stale 시 삭제. 동등: Keep iff (idx < keep_recent) AND (modified > cutoff) — 둘 다 fresh 일 때만.
  • 구현: IngestLogWriter::open() 시 cleanup helper, SqliteStore::prune_pdf_ocr_events(retention_days) 별도 routine.

§4 Implementation specification

§4.1 image_width / image_height decode helper

파일: crates/kebab-app/src/pdf_ocr_apply.rs

변경:

  1. crates/kebab-app/Cargo.toml dependency update:
    -image = { version = "0.25", default-features = false, features = ["png"] }
    +image = { version = "0.25", default-features = false, features = ["png", "jpeg"] }
    
  2. import use image::io::Reader as ImageReader; (transitive via kebab-parse-image).
  3. 새 helper function:
    fn extract_image_dimensions(jpeg_bytes: &[u8]) -> Option<(u32, u32)> {
        let reader = ImageReader::new(std::io::Cursor::new(jpeg_bytes))
            .ok()?
            .with_guessed_format()
            .ok()?;
        reader.into_dimensions().ok()
    }
    
  4. 6 emit point (line 155, 188, 265 및 PdfOcrProgress::Finished 의 다른 3곳):
    let (w, h) = extract_image_dimensions(&page_image_bytes).map(|(w, h)| (Some(w), Some(h)))
        .unwrap_or((None, None));
    emit_progress(PdfOcrProgress::Finished {
        image_width: w,
        image_height: h,
        ...
    });
    

test: 기존 pdf_ocr_roundtrip + 새 pdf_ocr_image_dimensions integration test.

§4.2 V008 migration SQL + SqliteStore API

파일: migrations/V008__pdf_ocr_events.sql (신규)

CREATE TABLE pdf_ocr_events (
  id          INTEGER PRIMARY KEY,
  run_id      TEXT NOT NULL,
  ts          TEXT NOT NULL,             -- ISO 8601 UTC (RFC 3339)
  doc_id      TEXT,                       -- nullable (file detect skip)
  doc_path    TEXT NOT NULL,
  page        INTEGER NOT NULL,
  image_byte_size  INTEGER,
  image_width      INTEGER,
  image_height     INTEGER,
  ms          INTEGER NOT NULL,
  chars       INTEGER NOT NULL,
  success     INTEGER NOT NULL,           -- 0 = fail, 1 = success
  reason      TEXT,                       -- "timeout" / "ocr_error" / NULL
  ocr_engine  TEXT NOT NULL
);
CREATE INDEX idx_pdf_ocr_events_doc_id ON pdf_ocr_events(doc_id);
CREATE INDEX idx_pdf_ocr_events_run_id ON pdf_ocr_events(run_id);
CREATE INDEX idx_pdf_ocr_events_ts ON pdf_ocr_events(ts);

파일: crates/kebab-store-sqlite/src/lib.rs (SqliteStore 확장)

신규 method (Mutex lock 명시):

impl SqliteStore {
    pub fn record_pdf_ocr_event(
        &self,
        run_id: &str,
        ts: &str,
        doc_id: Option<&str>,
        doc_path: &str,
        page: u32,
        image_byte_size: Option<u64>,
        image_width: Option<u32>,
        image_height: Option<u32>,
        ms: u64,
        chars: u32,
        success: bool,
        reason: Option<&str>,
        ocr_engine: &str,
    ) -> anyhow::Result<()> {
        let conn = self.conn.lock().expect("sqlite lock poisoned");
        conn.execute(
            "INSERT INTO pdf_ocr_events
             (run_id, ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine)
             VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
            rusqlite::params![
                run_id, ts, doc_id, doc_path, page,
                image_byte_size, image_width, image_height, ms, chars,
                if success { 1 } else { 0 }, reason, ocr_engine
            ]
        )?;
        Ok(())
    }

    pub fn prune_pdf_ocr_events(&self, retention_days: u32) -> anyhow::Result<u64> {
        let conn = self.conn.lock().expect("sqlite lock poisoned");
        let cutoff_ts = time::OffsetDateTime::now_utc()
            .checked_sub(time::Duration::days(retention_days as i64))
            .map(|dt| dt.format(&time::format_description::well_known::Rfc3339).ok())
            .flatten()
            .unwrap_or_default();
        let n = conn.execute(
            "DELETE FROM pdf_ocr_events WHERE ts < ?",
            rusqlite::params![cutoff_ts],
        )?;
        Ok(n as u64)
    }
}

§4.3 wire schema — ocr_stats.v1 + ocr_failures.v1

파일: docs/wire-schema/v1/ocr_stats.schema.json (신규)

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "ocr_stats.v1",
  "type": "object",
  "properties": {
    "schema_version": { "const": "ocr_stats.v1" },
    "total_events": { "type": "integer" },
    "total_runs": { "type": "integer" },
    "success_count": { "type": "integer" },
    "failure_count": { "type": "integer" },
    "success_rate": { "type": "number" },
    "p50_ms": { "type": "integer" },
    "p90_ms": { "type": "integer" },
    "p99_ms": { "type": "integer" },
    "max_ms": { "type": "integer" },
    "by_engine": { "type": "object", "additionalProperties": { "type": "integer" } },
    "by_doc": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "doc_id": { "type": "string" },
          "failure_count": { "type": "integer" },
          "success_count": { "type": "integer" },
          "p90_ms": { "type": ["integer", "null"] }
        }
      }
    }
  },
  "required": ["schema_version", "total_events", "total_runs", "success_count", "failure_count", "success_rate"]
}

파일: docs/wire-schema/v1/ocr_failures.schema.json (신규)

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "ocr_failures.v1",
  "type": "object",
  "properties": {
    "schema_version": { "const": "ocr_failures.v1" },
    "doc_id": { "type": ["string", "null"] },
    "failure_count": { "type": "integer" },
    "failures": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "ts": { "type": "string" },
          "page": { "type": "integer" },
          "ms": { "type": "integer" },
          "reason": { "type": "string" },
          "image_byte_size": { "type": ["integer", "null"] }
        }
      }
    }
  },
  "required": ["schema_version", "failure_count", "failures"]
}

파일: docs/wire-schema/v1/schema.schema.json 갱신

schema.v1.wire.schemas list 에 ocr_stats.v1 + ocr_failures.v1 추가 (additive).

§4.4 CLI inspect ocr-stats + ocr-failures

파일: crates/kebab-cli/src/main.rs

신규 subcommand:

#[derive(Subcommand)]
pub enum Subcommand {
    Inspect(InspectCommand),
    // ...
}

#[derive(clap::Subcommand)]
pub enum InspectCommand {
    OcrStats {
        #[arg(long, default_value = "false")]
        json: bool,
    },
    OcrFailures {
        #[arg(long)]
        doc_id: Option<String>,
        #[arg(long, default_value = "10")]
        limit: usize,
        #[arg(long, default_value = "false")]
        json: bool,
    },
}

파일: crates/kebab-app/src/lib.rs 확장

impl App {
    pub fn inspect_ocr_stats(&self) -> anyhow::Result<OcrStatsV1> {
        // SELECT 쿼리: pdf_ocr_events 에서 aggregate.
        // 1. total_events, success_count, failure_count, success_rate 계산.
        // 2. percentile via in-memory sort: SELECT ms FROM pdf_ocr_events WHERE success=1 ORDER BY ms.
        //    Vec 로 fetch 후 idx 계산 (p50 = idx 50%, p90 = idx 90%, p99 = idx 99%).
        // 3. by_engine groupby (engine 별 success count).
        // 4. by_doc top 10 (failure_count DESC).
    }

    pub fn inspect_ocr_failures(
        &self,
        doc_id: Option<&str>,
        limit: usize,
    ) -> anyhow::Result<OcrFailuresV1> {
        // SELECT failure records WHERE success=0.
        // doc_id 있으면 WHERE doc_id=?; 없으면 ORDER BY ts DESC LIMIT limit.
    }
}

§4.5 retention cleanup helper — file + SQLite

파일: crates/kebab-app/src/ingest_log.rs 확장

impl IngestLogWriter {
    pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
        if !cfg.ingest_log_enabled {
            return Ok(None);
        }
        let run_id = generate_run_id();
        let log_dir = expand_log_dir(&cfg.ingest_log_dir);
        std::fs::create_dir_all(&log_dir)?;

        // Cleanup file logs (before creating new log).
        if let Err(e) = Self::cleanup_old_logs(&log_dir, cfg.keep_recent_runs, cfg.retention_days) {
            tracing::warn!(target: "kebab-app", "ingest log cleanup failed: {e}");
            // non-critical — continue without failing ingest.
        }

        let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
        let file = BufWriter::new(File::create(&path)?);
        Ok(Some(Self { file, path, run_id, started_at: SystemTime::now() }))
    }

    fn cleanup_old_logs(log_dir: &Path, keep_recent: u32, retention_days: u32) -> anyhow::Result<()> {
        let mut entries: Vec<_> = std::fs::read_dir(log_dir)?
            .filter_map(|e| e.ok())
            .filter(|e| e.path().file_name()
                .and_then(|n| n.to_str())
                .map(|s| s.starts_with("ingest-") && s.ends_with(".ndjson"))
                .unwrap_or(false))
            .collect();

        // Sort by modified time descending (newest first).
        entries.sort_by_key(|e| std::cmp::Reverse(e.metadata().ok().and_then(|m| m.modified().ok())));

        let cutoff_time = SystemTime::now() - std::time::Duration::from_secs(retention_days as u64 * 86400);

        for (idx, entry) in entries.into_iter().enumerate() {
            let path = entry.path();
            let metadata = entry.metadata()?;
            let modified = metadata.modified()?;

            // Delete if (idx >= keep_recent) OR (modified <= cutoff).
            // Equivalent: keep iff (idx < keep_recent) AND (modified > cutoff) — both fresh.
            // Per §3.4 OR-on-stale semantics.
            if idx < keep_recent as usize && modified > cutoff_time {
                continue;
            }

            std::fs::remove_file(&path)
                .map_err(|e| anyhow::anyhow!("failed to remove {}: {}", path.display(), e))?;
        }

        Ok(())
    }
}

§4.6 Config extension — LoggingCfg

파일: crates/kebab-config/src/lib.rs (LoggingCfg 확장)

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct LoggingCfg {
    pub ingest_log_enabled: bool,
    pub ingest_log_dir: PathBuf,
    #[serde(default = "default_keep_recent_runs")]
    pub keep_recent_runs: u32,
    #[serde(default = "default_retention_days")]
    pub retention_days: u32,
}

fn default_keep_recent_runs() -> u32 { 100 }
fn default_retention_days() -> u32 { 30 }

impl Default for LoggingCfg {
    fn default() -> Self {
        Self {
            ingest_log_enabled: true,
            ingest_log_dir: PathBuf::from("{state_dir}/logs"),
            keep_recent_runs: 100,
            retention_days: 30,
        }
    }
}

파일: docs/SMOKE.md — config example 갱신

[logging]
ingest_log_enabled = true
ingest_log_dir = "{state_dir}/logs"
keep_recent_runs = 100
retention_days = 30

§4.7 IngestLogWriter dual-write integration — canonical.doc_id closure capture (F1)

파일: crates/kebab-app/src/ingest_log.rs — write_event 확장

impl IngestLogWriter {
    pub fn write_event_with_db(
        &mut self,
        event: &LogEvent<'_>,
        store: Option<&SqliteStore>,
    ) -> anyhow::Result<()> {
        // Write to file.
        serde_json::to_writer(&mut self.file, event)?;
        writeln!(self.file)?;

        // Write to SQLite if store provided and event is Ocr.
        if let (Some(store), LogEvent::Ocr {
            ts, doc_id, doc_path, page, image_byte_size, image_width, image_height,
            ms, chars, success, reason, ocr_engine,
        }) = (store, event) {
            let _ = store.record_pdf_ocr_event(
                self.run_id(),
                ts,
                doc_id.as_deref(), // doc_id must be captured in closure scope (see below)
                doc_path,
                *page,
                *image_byte_size,
                *image_width,
                *image_height,
                *ms,
                *chars,
                *success,
                *reason,
                ocr_engine,
            ).map_err(|e| {
                // Non-critical — log warning but don't fail ingest.
                tracing::warn!(target: "kebab-app", "sqlite ocr event insert failed: {e}");
            });
        }

        Ok(())
    }
}

caller 예시 (kebab-app/src/ingest_one_pdf_asset 또는 apply_ocr_to_pdf_pages 내 emit_progress closure):

// Pre-capture canonical.doc_id before apply_ocr_to_pdf_pages closure:
let doc_id_for_log: String = canonical.doc_id.0.clone();
let doc_path_for_log = asset.workspace_path.0.clone();

let store = Arc::new(SqliteStore::open(&cfg.storage.sqlite)?);
let mut log_writer = IngestLogWriter::open(&cfg.logging)?;

let emit_progress = move |progress: PdfOcrProgress| {
    if let Some(writer) = &mut log_writer {
        let event = LogEvent::Ocr {
            ts: now_ts(),
            doc_id: Some(doc_id_for_log.clone()),  // ← captured in closure scope
            doc_path: doc_path_for_log.clone(),
            page: page_n,
            // ... other fields ...
        };
        let _ = writer.write_event_with_db(&event, Some(&store));
    }
};

§5 Acceptance criteria

AC-1: image_width + image_height non-null after PDF OCR.

  • Integration test: scanned PDF 로 ingest_one_pdf_asset → IngestReport check pdf_ocr_summary.pages_ocrd > 0, log file 의 image_width: Some(_), image_height: Some(_) verify.

AC-2: V008 migration successful + pdf_ocr_events table exists.

  • test: fresh DB 생성 → migration apply → SELECT name FROM sqlite_master WHERE type='table' AND name='pdf_ocr_events'; verify.

AC-3: ingest 시 SQLite row 가 ndjson file 의 OCR record 와 1:1 일치.

  • Integration test: ingest 후 SELECT COUNT(*) FROM pdf_ocr_events WHERE success=1 = ndjson 의 success=true OCR line count.

AC-4: kebab inspect ocr-stats --json 정상 emit + ocr_stats.v1 schema_version.

  • CLI test: kebab inspect ocr-stats --json | jq '.schema_version' = "ocr_stats.v1", total_events, success_rate present.

AC-5: kebab inspect ocr-failures --doc-id <id> --json 정상 emit + ocr_failures.v1.

  • CLI test: failure 가 있는 doc_id 로 조회 → failures[] array non-empty, reason field present.

AC-6: kebab inspect ocr-failures --json (no doc-id) corpus-wide.

  • CLI test: --limit 5 로 최근 5개 failure 반환, failure_count >= 5.

AC-7: log retention — keep_recent_runs=2 시 3rd ingest 후 oldest file deleted.

  • Integration test: temp log dir, 3 ingest run with keep_recent_runs=2 → oldest 2 file only remain.

AC-8: SQLite retention — retention_days=0 시 old row deleted.

  • test: insert old row (ts = 90 days ago) → prune_pdf_ocr_events(0) → row deleted.

AC-9: backward compat — old config (no [logging] retention_* field) parses with default.

  • test: pre-v0.20.x config (no [logging] section) → load → logging.keep_recent_runs == 100 (default).

AC-10: workspace test + clippy green.

  • cargo test --workspace -j 1, cargo clippy --all-targets.

AC-11: integration test (ocr_inspect_smoke + pdf_ocr_events_insert_smoke).

  • new test binary: scanned PDF ingest → kebab inspect ocr-stats / ocr-failures 검증.
  • crates/kebab-store-sqlite/tests/pdf_ocr_events_insert_smoke.rs.

AC-12: [logging] retention_* default emit in kebab init config.

  • test: kebab init --config /tmp/test-cfg.toml[logging] keep_recent_runs = 100 + retention_days = 30 present.

AC-13: wire schema additive list sync in integrations/claude-code/kebab/SKILL.md.

  • test: grep -c 'ocr_stats\.v1' integrations/claude-code/kebab/SKILL.md returns ≥1, same for ocr_failures.v1.

§6 Risks + open questions

R-1 dual-write transaction safety (file vs SQLite race)

Issue: emit_progress closure 가 file write 후 SQLite insert 실패 시, ndjson 과 DB 불일치. Mitigation:

  • file write first (durable, may fail).
  • SQLite write second (non-critical, warn on fail, don't propagate error).
  • per-run 단위 reconciliation tool (future enhancement, not in scope).

R-2 V008 migration rollback (F6)

Issue: user 가 v0.20.x → older version downgrade 시 V008 rolled back? Mitigation: CLAUDE.md migration policy follow. V008 은 additive table → old version 이 table ignore 하면 작동 OK. Manual rollback (v0.19.x ↔ v0.20.x alternating dogfood):

DELETE FROM refinery_schema_history WHERE version=8;
DROP TABLE IF EXISTS pdf_ocr_events;

Out-of-scope: v0.19.x ↔ v0.20.x alternate run 의 자동 rollback path 미제공.

R-3 prune helper 가 concurrent ingest 시 stale lock

Issue: cleanup_old_logs 가 file 삭제 중인데 다른 process 가 write? Mitigation:

  • cleanup 은 IngestLogWriter::open 시만 (ingest 시작 전).
  • per-process single ingest 가정 (현재 design).
  • concurrent ingest support 는 future phase.

R-4 image decode failure handling (corrupt JPEG fallback)

Issue: JPEG 가 corrupt → extract_image_dimensions 실패. Mitigation: helper returns Option<(u32, u32)> → (None, None) fallback. OCR 완료도 유효. warning event push (optional, future enhancement).

R-5 wire schema additive minor — old consumer 의 schema 미인식

Issue: old kebab binary (v0.19.x) 가 v0.20.x kebab inspect ocr-stats 의 output consume? Mitigation:

  • schema_version = ocr_stats.v1 explicit (old consumer 는 schema 미인식 → skip OK).
  • wire.v1 의 additive list → backward compat (old consumer 는 list 만 ignores).
  • new consumer 만 ocr_stats.v1 / ocr_failures.v1 인식.

R-6 Concurrent cleanup 에 의한 log file loss

Issue: keep_recent_runs / retention_days 정책 으로 파일 삭제 중 user 가 tail 시도? Mitigation: cleanup 은 ingest start 때만. user 가 tail 하는 중인 파일은 일반적으로 recent (top N 내) → 안전.

R-7 doc_id NULL wiring in LogEvent::Ocr closure (F1)

Issue: emit_progress closure 에서 doc_id 가 None 또는 mismatch 시, file ndjson 과 SQLite record 의 doc_id 가 불일치. Verification (spec §6 R-7 add):

# canonical.doc_id 가 set 되는 시점 확인
grep -n "canonical.doc_id\|\.doc_id\s*=" crates/kebab-app/src/lib.rs | head -10

Mitigation:

  • closure scope 에서 doc_id 를 pre-capture (let doc_id_for_log: String = canonical.doc_id.0.clone()).
  • LogEvent::Ocr 생성 시 captured value 사용.
  • per-run integration test 에서 file ndjson 의 doc_id 와 SQLite SELECT 의 doc_id match verify.

§7 References

  • Parent spec: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md (design contract).
  • Round 1 spec: v0.20.0 sub-item 1 ingest log (PR #189).
  • Code ranges:
    • crates/kebab-app/src/pdf_ocr_apply.rs lines 155, 188, 265 (6 emit point).
    • crates/kebab-app/src/ingest_log.rs (LogEvent, IngestLogWriter).
    • crates/kebab-config/src/lib.rs (Config, LoggingCfg).
    • crates/kebab-store-sqlite/src/lib.rs (SqliteStore).
    • crates/kebab-cli/src/main.rs (Subcommand).
  • Dependencies:
    • image crate (transitive via kebab-parse-image).
    • time crate (RFC 3339 timestamp, already in workspace).
    • rusqlite (already in kebab-store-sqlite).
  • Config sections:
    • [logging]: ingest_log_enabled, ingest_log_dir, keep_recent_runs, retention_days.
    • [pdf.ocr]: (unchanged from v0.20.0).
  • Wire schemas:
    • docs/wire-schema/v1/ocr_stats.schema.json (신규).
    • docs/wire-schema/v1/ocr_failures.schema.json (신규).
    • docs/wire-schema/v1/schema.schema.json (additive: wire.schemas list 에 두 schema 추가).
  • HOTFIXES contract: 새로운 deviations 는 tasks/HOTFIXES.md 에 dated entry + cross-link to this spec.
  • Version cascade: image_width/height, SQLite table schema 추가는 index_version cascade 아님 (chunks/embeddings 미영향).
  • Backward compat: old config parses with [logging] defaults, wire schema additive minor.

§8 Dependencies + imports

Allowed dependencies

  • image crate (for ImageReader::new, into_dimensions).
  • time crate (RFC 3339 formatting, already in workspace).
  • rusqlite (for SQL execute / query, already in kebab-store-sqlite).
  • serde_json (for wire schema export, already in kebab-app).

Forbidden dependencies

  • None new introduced. All uses are transitive or workspace-existing.

§9 Testing strategy

Unit tests

  • extract_image_dimensions helper: valid JPEG → Some((w, h)), corrupt JPEG → None.
  • cleanup_old_logs: keep_recent_runs / retention_days logic, file deletion.
  • LoggingCfg defaults: serde round-trip, backward compat.

Integration tests

New test files:

  • crates/kebab-app/tests/ocr_inspect_smoke.rs: scanned PDF ingest → inspect ocr-stats / ocr-failures validation.
  • crates/kebab-store-sqlite/tests/pdf_ocr_events_insert_smoke.rs: V008 migration, dual-write, prune logic.

Existing test updates:

  • pdf_ocr_roundtrip → verify image_width/height non-null.
  • ingest_report_snapshot → verify ocr_stats output shape.

Smoke test (docs/SMOKE.md)

  • kebab ingest with scanned PDF → ~/.local/state/kebab/logs/ingest-*.ndjson + pdf_ocr_events table check.
  • kebab inspect ocr-stats --json | jq '.schema_version' = "ocr_stats.v1".

Regression tests

  • 기존 1370 workspace test suite — regression 0 기대 (cleanup 은 non-critical, file-only logging 은 unchanged).

§10 Rollout + dogfood

v0.20.x milestone

  1. spec approval (이번 round 0).
  2. implementation (A6 round 1, estimate 3-4 days for 4 enhancement).
  3. review + merge (pull request via gitea-ops).
  4. dogfood (user runs v0.20.x binary, accumulates OCR stats over 2-4 weeks).
  5. data-driven tuning (inspect ocr-stats → timeout default adjust, release note v0.20.y).

Backward compat notes

  • Old binary (v0.19.x) + new config (v0.20.x with [logging] retention_*): config parses, logging ignores new fields.
  • New binary (v0.20.x) + old config (v0.19.x without [logging]): defaults apply, logging works.
  • wire schema: additive, consumers ignore unknown fields.

Release notes

  • wire schema additive minor (ocr_stats.v1, ocr_failures.v1 추가) → release trigger 아님 (CLAUDE.md §Versioning cascade).
  • 사용자 도그푸딩 중 데이터 누적 후, 본격 튜닝 (e.g. timeout 조정) 에 따라 chore: bump 0.20.x → 0.20.y 별 commit 가능 (dogfood 결과 반영 시).

§11 Contract stability

Locked sections (design contract 의 일부, future changes require spec §N update):

  • wire schema ocr_stats.v1 field list.
  • wire schema ocr_failures.v1 field list.
  • [logging] config fields.
  • CLI inspect ocr-stats / ocr-failures output format.

Flexible sections (implementation detail, future refactor OK):

  • extract_image_dimensions helper location (intra-crate 이동 가능).
  • cleanup_old_logs 정책 (더 sophisticated 알고리즘 가능).
  • SQLite index strategy (추가 index 가능).

Summary

v0.20.x ingest log round 2 는 round 1 의 file-only ndjson 을 4 가지로 확장:

  1. image dimension capture — raster JPEG decode (trivial).
  2. SQLite mirror — V008 migration + pdf_ocr_events table (medium).
  3. CLI inspect — corpus-wide OCR statistics API (medium).
  4. log retention — automatic cleanup (low).

모두 non-breaking additive changes. backward compat 보장, wire schema minor bump, 500-800 line spec (이 문서 ≈ 700 lines).