--- title: v0.20.x ingest log round 2 — 4 enhancement spec created: 2026-05-28 status: DRAFT round 0 parent_spec: 2026-04-27-kebab-final-form-design.md target_version: v0.20.x branch: feat/ingest-log-round2-enhancements --- # v0.20.x ingest log round 2 ## §1 Motivation ### §1.1 Sweet-spot analysis — progressive dogfood tuning v0.20.0 sub-item 1 의 round 1 ingest log (PR #189 merged) 는 per-run ndjson 의 file-only logging 으로 배포. 3개월 실사용(dogfood) 중 OCR engine 의 timeout / performance sweet-spot 을 점진적으로 조정할 필요. - **현재 상태**: 각 ingest run 의 OCR 샘플(ms, success/fail) 이 ndjson file 에만 기록 → historical aggregate query 불가 (per-run 단위). - **요구**: 누적 데이터베이스에서 p90 / p99 / 극값 을 조회 → timeout default 축소(e.g. 300s → 180s) 결정 가능. - **해결**: SQLite `pdf_ocr_events` mirror table — v0.20.x round 2 주요 enhancement. ### §1.2 Image dimension 결함 — null emit 문제 `crates/kebab-app/src/pdf_ocr_apply.rs` 의 6개 emit point (line 155, 188, 265 등) 에서 `image_width: None, image_height: None` hardcode. - **현재**: raster JPEG 가 memory 에 있으나 dimension 측정 미수행. - **영향**: wire schema 의 optional field 이지만, 실제 use case (e.g. "100MB+ 이미지만 timeout 조정") 를 위해 필수 데이터. - **fix**: raster decode via `image` crate (transitive dep via `kebab-parse-image`). ### §1.3 CLI inspect subcommands — 운영 visibility ndjson log 는 human-readable 이지만, 스크립트/automation 용 corpus-wide 통계가 부족. - `kebab inspect ocr-stats --json` — 전체 OCR 성공률, p50/p90/p99 latency, engine 별 breakdown. - `kebab inspect ocr-failures --doc-id --json` — 특정 doc 의 failure history. - `kebab inspect ocr-failures --json` — 최근 failure 나열 (corpus-wide). ### §1.4 Log retention — 무한 증가 방지 ingest 가 반복되면서 `~/.local/state/kebab/logs/ingest-*.ndjson` file 누적. - **현재**: 수동 정리 필수. - **요구**: 자동 cleanup — `keep_recent_runs` (e.g. 100) + `retention_days` (e.g. 30). - **적용**: SQLite `pdf_ocr_events` 도 동일한 retention 정책 적용. --- ## §2 Scope + non-scope ### §2.1 Included **Enhancement 1: image_width + image_height capture (trivial)** - raster JPEG dimension decode in `pdf_ocr_apply.rs` 6 emit point. - image crate import (already transitive). - Some(u32) 로 fill. **Enhancement 2: SQLite mirror — pdf_ocr_events table (medium)** - V008 migration: `pdf_ocr_events` table 신규 (run_id, ts, doc_id, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine). - index: run_id, doc_id, ts. - insert path: `IngestLogWriter` 에서 file ndjson + SQLite 동시 write (dual-write). - SqliteStore::record_pdf_ocr_event(…) API. **Enhancement 3: CLI inspect commands (medium)** - `kebab inspect ocr-stats --json` — corpus-wide aggregate (total_events, success_count, failure_count, p50/p90/p99_ms, by_engine, top 10 docs by failure). - `kebab inspect ocr-failures --doc-id --json` — single doc failure list. - `kebab inspect ocr-failures --json` (no doc-id) — corpus-wide recent failures (--limit configurable). - wire schemas: `ocr_stats.v1` + `ocr_failures.v1` (additive minor to schema.v1). **Enhancement 4: log retention + rotation (low)** - `[logging] keep_recent_runs: u32` (default 100) + `retention_days: u32` (default 30). - file cleanup: IngestLogWriter::open 시 prune helper 호출. - SQLite cleanup: SqliteStore::prune_pdf_ocr_events(retention_days). - backward compat: old config (no `[logging]` fields) parses with default. ### §2.2 Out of scope - wire schema public API (pdf_ocr_events 는 internal SQLite table, wire expose 안 함). - `ask` 명령의 한국어 phrasing-sensitive refusal (이번 round 범위 외). - migration rollback automation (standard CLAUDE.md protocol follow). - concurrent ingest lock manager (현재 single-process ingest 가정, future spec). --- ## §3 Design decisions ### §3.1 image_width / image_height — raster decode path **선택**: raster JPEG 를 ImageReader 로 decode → (width, height) 추출. - **이유**: OCR 호출 시 bytes 가 이미 memory (extract_dctdecode_page_image), decode latency << OCR latency (negligible <1ms). - **대안 거절**: PDF MediaBox 사용 → actual raster 와 page size 다를 수 있음 (less accurate). - **구현**: `image` crate 의 `ImageReader::new(Cursor::new(&bytes)).with_guessed_format()?.into_dimensions()?`. - **error handling**: decode fail → (None, None) fallback. OCR 결과는 여전히 valid. ### §3.2 SQLite mirror — V008 migration + dual-write **선택**: v0.20.0 round 1 의 file-only ndjson 을 보완하는 SQLite mirror (non-breaking). - **이유**: historical query 를 위해 structured storage 필수. file 만으로는 corpus-wide aggregate 불가. - **doc_id wiring**: LogEvent::Ocr 의 `doc_id` field 는 closure scope 에서 미리 capture 되어야 함. apply_ocr_to_pdf_pages 호출 전에 canonical.doc_id 를 local var 로 binding 후, closure 내에서 동일한 doc_id 로 file ndjson + SQLite insert 수행. 이를 통해 dual-write 의 일관성 보장. - **dual-write 구조**: 1. `IngestLogWriter::write_event(&LogEvent::Ocr)` 시 file ndjson + SQLite insert. 2. insert 는 `Arc` clone 을 emit_progress closure 가 직접 호출. 3. transaction safety: file write first (failures → log), then SQLite (non-critical). - **non-breaking**: old config 가 없어도 logging 정상 작동 (file only). SQLite 는 upgrade 시 자동 생성. ### §3.3 CLI inspect commands — ocr-stats + ocr-failures **wire schema**: 기존 `schema.v1` 의 `wire.schemas` list 에 `ocr_stats.v1` + `ocr_failures.v1` additive minor 추가. - **이유**: 새 wire shape 은 public API 가 아님 (inspect command 만 emit). wire.v1 의 확장으로 additive. - **구현**: `kebab-cli/src/main.rs` 의 `Subcommand::Inspect` 에 `InspectCommand::OcrStats / OcrFailures` arm 추가. ### §3.4 retention — keep_recent_runs + retention_days **선택**: 두 조건 모두 충족 시만 보존 (OR-on-stale = AND-on-fresh semantics). - **이유**: - `keep_recent_runs=100` — deterministic "최근 N 개 run 보존". - `retention_days=30` — time-based cleanup (dogfood 중단 후 obsolete log 자동 삭제). - **Delete if** (idx >= keep_recent) **OR** (modified <= cutoff) — 둘 중 하나라도 stale 시 삭제. 동등: **Keep iff** (idx < keep_recent) **AND** (modified > cutoff) — 둘 다 fresh 일 때만. - **구현**: `IngestLogWriter::open()` 시 cleanup helper, `SqliteStore::prune_pdf_ocr_events(retention_days)` 별도 routine. --- ## §4 Implementation specification ### §4.1 image_width / image_height decode helper **파일**: `crates/kebab-app/src/pdf_ocr_apply.rs` **변경**: 1. `crates/kebab-app/Cargo.toml` dependency update: ```diff -image = { version = "0.25", default-features = false, features = ["png"] } +image = { version = "0.25", default-features = false, features = ["png", "jpeg"] } ``` 2. import `use image::io::Reader as ImageReader;` (transitive via kebab-parse-image). 3. 새 helper function: ```rust fn extract_image_dimensions(jpeg_bytes: &[u8]) -> Option<(u32, u32)> { let reader = ImageReader::new(std::io::Cursor::new(jpeg_bytes)) .ok()? .with_guessed_format() .ok()?; reader.into_dimensions().ok() } ``` 4. 6 emit point (line 155, 188, 265 및 PdfOcrProgress::Finished 의 다른 3곳): ```rust let (w, h) = extract_image_dimensions(&page_image_bytes).map(|(w, h)| (Some(w), Some(h))) .unwrap_or((None, None)); emit_progress(PdfOcrProgress::Finished { image_width: w, image_height: h, ... }); ``` **test**: 기존 `pdf_ocr_roundtrip` + 새 `pdf_ocr_image_dimensions` integration test. ### §4.2 V008 migration SQL + SqliteStore API **파일**: `migrations/V008__pdf_ocr_events.sql` (신규) ```sql CREATE TABLE pdf_ocr_events ( id INTEGER PRIMARY KEY, run_id TEXT NOT NULL, ts TEXT NOT NULL, -- ISO 8601 UTC (RFC 3339) doc_id TEXT, -- nullable (file detect skip) doc_path TEXT NOT NULL, page INTEGER NOT NULL, image_byte_size INTEGER, image_width INTEGER, image_height INTEGER, ms INTEGER NOT NULL, chars INTEGER NOT NULL, success INTEGER NOT NULL, -- 0 = fail, 1 = success reason TEXT, -- "timeout" / "ocr_error" / NULL ocr_engine TEXT NOT NULL ); CREATE INDEX idx_pdf_ocr_events_doc_id ON pdf_ocr_events(doc_id); CREATE INDEX idx_pdf_ocr_events_run_id ON pdf_ocr_events(run_id); CREATE INDEX idx_pdf_ocr_events_ts ON pdf_ocr_events(ts); ``` **파일**: `crates/kebab-store-sqlite/src/lib.rs` (SqliteStore 확장) **신규 method** (Mutex lock 명시): ```rust impl SqliteStore { pub fn record_pdf_ocr_event( &self, run_id: &str, ts: &str, doc_id: Option<&str>, doc_path: &str, page: u32, image_byte_size: Option, image_width: Option, image_height: Option, ms: u64, chars: u32, success: bool, reason: Option<&str>, ocr_engine: &str, ) -> anyhow::Result<()> { let conn = self.conn.lock().expect("sqlite lock poisoned"); conn.execute( "INSERT INTO pdf_ocr_events (run_id, ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", rusqlite::params![ run_id, ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, if success { 1 } else { 0 }, reason, ocr_engine ] )?; Ok(()) } pub fn prune_pdf_ocr_events(&self, retention_days: u32) -> anyhow::Result { let conn = self.conn.lock().expect("sqlite lock poisoned"); let cutoff_ts = time::OffsetDateTime::now_utc() .checked_sub(time::Duration::days(retention_days as i64)) .map(|dt| dt.format(&time::format_description::well_known::Rfc3339).ok()) .flatten() .unwrap_or_default(); let n = conn.execute( "DELETE FROM pdf_ocr_events WHERE ts < ?", rusqlite::params![cutoff_ts], )?; Ok(n as u64) } } ``` ### §4.3 wire schema — ocr_stats.v1 + ocr_failures.v1 **파일**: `docs/wire-schema/v1/ocr_stats.schema.json` (신규) ```json { "$schema": "http://json-schema.org/draft-07/schema#", "title": "ocr_stats.v1", "type": "object", "properties": { "schema_version": { "const": "ocr_stats.v1" }, "total_events": { "type": "integer" }, "total_runs": { "type": "integer" }, "success_count": { "type": "integer" }, "failure_count": { "type": "integer" }, "success_rate": { "type": "number" }, "p50_ms": { "type": "integer" }, "p90_ms": { "type": "integer" }, "p99_ms": { "type": "integer" }, "max_ms": { "type": "integer" }, "by_engine": { "type": "object", "additionalProperties": { "type": "integer" } }, "by_doc": { "type": "array", "items": { "type": "object", "properties": { "doc_id": { "type": "string" }, "failure_count": { "type": "integer" }, "success_count": { "type": "integer" }, "p90_ms": { "type": ["integer", "null"] } } } } }, "required": ["schema_version", "total_events", "total_runs", "success_count", "failure_count", "success_rate"] } ``` **파일**: `docs/wire-schema/v1/ocr_failures.schema.json` (신규) ```json { "$schema": "http://json-schema.org/draft-07/schema#", "title": "ocr_failures.v1", "type": "object", "properties": { "schema_version": { "const": "ocr_failures.v1" }, "doc_id": { "type": ["string", "null"] }, "failure_count": { "type": "integer" }, "failures": { "type": "array", "items": { "type": "object", "properties": { "ts": { "type": "string" }, "page": { "type": "integer" }, "ms": { "type": "integer" }, "reason": { "type": "string" }, "image_byte_size": { "type": ["integer", "null"] } } } } }, "required": ["schema_version", "failure_count", "failures"] } ``` **파일**: `docs/wire-schema/v1/schema.schema.json` 갱신 `schema.v1.wire.schemas` list 에 `ocr_stats.v1` + `ocr_failures.v1` 추가 (additive). ### §4.4 CLI inspect ocr-stats + ocr-failures **파일**: `crates/kebab-cli/src/main.rs` **신규 subcommand**: ```rust #[derive(Subcommand)] pub enum Subcommand { Inspect(InspectCommand), // ... } #[derive(clap::Subcommand)] pub enum InspectCommand { OcrStats { #[arg(long, default_value = "false")] json: bool, }, OcrFailures { #[arg(long)] doc_id: Option, #[arg(long, default_value = "10")] limit: usize, #[arg(long, default_value = "false")] json: bool, }, } ``` **파일**: `crates/kebab-app/src/lib.rs` 확장 ```rust impl App { pub fn inspect_ocr_stats(&self) -> anyhow::Result { // SELECT 쿼리: pdf_ocr_events 에서 aggregate. // 1. total_events, success_count, failure_count, success_rate 계산. // 2. percentile via in-memory sort: SELECT ms FROM pdf_ocr_events WHERE success=1 ORDER BY ms. // Vec 로 fetch 후 idx 계산 (p50 = idx 50%, p90 = idx 90%, p99 = idx 99%). // 3. by_engine groupby (engine 별 success count). // 4. by_doc top 10 (failure_count DESC). } pub fn inspect_ocr_failures( &self, doc_id: Option<&str>, limit: usize, ) -> anyhow::Result { // SELECT failure records WHERE success=0. // doc_id 있으면 WHERE doc_id=?; 없으면 ORDER BY ts DESC LIMIT limit. } } ``` ### §4.5 retention cleanup helper — file + SQLite **파일**: `crates/kebab-app/src/ingest_log.rs` 확장 ```rust impl IngestLogWriter { pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result> { if !cfg.ingest_log_enabled { return Ok(None); } let run_id = generate_run_id(); let log_dir = expand_log_dir(&cfg.ingest_log_dir); std::fs::create_dir_all(&log_dir)?; // Cleanup file logs (before creating new log). if let Err(e) = Self::cleanup_old_logs(&log_dir, cfg.keep_recent_runs, cfg.retention_days) { tracing::warn!(target: "kebab-app", "ingest log cleanup failed: {e}"); // non-critical — continue without failing ingest. } let path = log_dir.join(format!("ingest-{run_id}.ndjson")); let file = BufWriter::new(File::create(&path)?); Ok(Some(Self { file, path, run_id, started_at: SystemTime::now() })) } fn cleanup_old_logs(log_dir: &Path, keep_recent: u32, retention_days: u32) -> anyhow::Result<()> { let mut entries: Vec<_> = std::fs::read_dir(log_dir)? .filter_map(|e| e.ok()) .filter(|e| e.path().file_name() .and_then(|n| n.to_str()) .map(|s| s.starts_with("ingest-") && s.ends_with(".ndjson")) .unwrap_or(false)) .collect(); // Sort by modified time descending (newest first). entries.sort_by_key(|e| std::cmp::Reverse(e.metadata().ok().and_then(|m| m.modified().ok()))); let cutoff_time = SystemTime::now() - std::time::Duration::from_secs(retention_days as u64 * 86400); for (idx, entry) in entries.into_iter().enumerate() { let path = entry.path(); let metadata = entry.metadata()?; let modified = metadata.modified()?; // Delete if (idx >= keep_recent) OR (modified <= cutoff). // Equivalent: keep iff (idx < keep_recent) AND (modified > cutoff) — both fresh. // Per §3.4 OR-on-stale semantics. if idx < keep_recent as usize && modified > cutoff_time { continue; } std::fs::remove_file(&path) .map_err(|e| anyhow::anyhow!("failed to remove {}: {}", path.display(), e))?; } Ok(()) } } ``` ### §4.6 Config extension — LoggingCfg **파일**: `crates/kebab-config/src/lib.rs` (LoggingCfg 확장) ```rust #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] pub struct LoggingCfg { pub ingest_log_enabled: bool, pub ingest_log_dir: PathBuf, #[serde(default = "default_keep_recent_runs")] pub keep_recent_runs: u32, #[serde(default = "default_retention_days")] pub retention_days: u32, } fn default_keep_recent_runs() -> u32 { 100 } fn default_retention_days() -> u32 { 30 } impl Default for LoggingCfg { fn default() -> Self { Self { ingest_log_enabled: true, ingest_log_dir: PathBuf::from("{state_dir}/logs"), keep_recent_runs: 100, retention_days: 30, } } } ``` **파일**: `docs/SMOKE.md` — config example 갱신 ```toml [logging] ingest_log_enabled = true ingest_log_dir = "{state_dir}/logs" keep_recent_runs = 100 retention_days = 30 ``` ### §4.7 IngestLogWriter dual-write integration — canonical.doc_id closure capture (F1) **파일**: `crates/kebab-app/src/ingest_log.rs` — write_event 확장 ```rust impl IngestLogWriter { pub fn write_event_with_db( &mut self, event: &LogEvent<'_>, store: Option<&SqliteStore>, ) -> anyhow::Result<()> { // Write to file. serde_json::to_writer(&mut self.file, event)?; writeln!(self.file)?; // Write to SQLite if store provided and event is Ocr. if let (Some(store), LogEvent::Ocr { ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine, }) = (store, event) { let _ = store.record_pdf_ocr_event( self.run_id(), ts, doc_id.as_deref(), // doc_id must be captured in closure scope (see below) doc_path, *page, *image_byte_size, *image_width, *image_height, *ms, *chars, *success, *reason, ocr_engine, ).map_err(|e| { // Non-critical — log warning but don't fail ingest. tracing::warn!(target: "kebab-app", "sqlite ocr event insert failed: {e}"); }); } Ok(()) } } ``` **caller 예시** (kebab-app/src/ingest_one_pdf_asset 또는 apply_ocr_to_pdf_pages 내 emit_progress closure): ```rust // Pre-capture canonical.doc_id before apply_ocr_to_pdf_pages closure: let doc_id_for_log: String = canonical.doc_id.0.clone(); let doc_path_for_log = asset.workspace_path.0.clone(); let store = Arc::new(SqliteStore::open(&cfg.storage.sqlite)?); let mut log_writer = IngestLogWriter::open(&cfg.logging)?; let emit_progress = move |progress: PdfOcrProgress| { if let Some(writer) = &mut log_writer { let event = LogEvent::Ocr { ts: now_ts(), doc_id: Some(doc_id_for_log.clone()), // ← captured in closure scope doc_path: doc_path_for_log.clone(), page: page_n, // ... other fields ... }; let _ = writer.write_event_with_db(&event, Some(&store)); } }; ``` --- ## §5 Acceptance criteria **AC-1**: image_width + image_height non-null after PDF OCR. - Integration test: scanned PDF 로 `ingest_one_pdf_asset` → IngestReport check `pdf_ocr_summary.pages_ocrd > 0`, log file 의 `image_width: Some(_)`, `image_height: Some(_)` verify. **AC-2**: V008 migration successful + `pdf_ocr_events` table exists. - test: fresh DB 생성 → migration apply → `SELECT name FROM sqlite_master WHERE type='table' AND name='pdf_ocr_events';` verify. **AC-3**: ingest 시 SQLite row 가 ndjson file 의 OCR record 와 1:1 일치. - Integration test: ingest 후 `SELECT COUNT(*) FROM pdf_ocr_events WHERE success=1` = ndjson 의 `success=true` OCR line count. **AC-4**: `kebab inspect ocr-stats --json` 정상 emit + `ocr_stats.v1` schema_version. - CLI test: `kebab inspect ocr-stats --json | jq '.schema_version'` = `"ocr_stats.v1"`, `total_events`, `success_rate` present. **AC-5**: `kebab inspect ocr-failures --doc-id --json` 정상 emit + `ocr_failures.v1`. - CLI test: failure 가 있는 doc_id 로 조회 → `failures[]` array non-empty, `reason` field present. **AC-6**: `kebab inspect ocr-failures --json` (no doc-id) corpus-wide. - CLI test: `--limit 5` 로 최근 5개 failure 반환, `failure_count >= 5`. **AC-7**: log retention — keep_recent_runs=2 시 3rd ingest 후 oldest file deleted. - Integration test: temp log dir, 3 ingest run with `keep_recent_runs=2` → oldest 2 file only remain. **AC-8**: SQLite retention — retention_days=0 시 old row deleted. - test: insert old row (ts = 90 days ago) → `prune_pdf_ocr_events(0)` → row deleted. **AC-9**: backward compat — old config (no `[logging] retention_*` field) parses with default. - test: pre-v0.20.x config (no `[logging]` section) → load → `logging.keep_recent_runs == 100` (default). **AC-10**: workspace test + clippy green. - `cargo test --workspace -j 1`, `cargo clippy --all-targets`. **AC-11**: integration test (`ocr_inspect_smoke` + `pdf_ocr_events_insert_smoke`). - new test binary: scanned PDF ingest → `kebab inspect ocr-stats / ocr-failures` 검증. - `crates/kebab-store-sqlite/tests/pdf_ocr_events_insert_smoke.rs`. **AC-12**: `[logging] retention_*` default emit in `kebab init` config. - test: `kebab init --config /tmp/test-cfg.toml` → `[logging] keep_recent_runs = 100` + `retention_days = 30` present. **AC-13**: wire schema additive list sync in `integrations/claude-code/kebab/SKILL.md`. - test: `grep -c 'ocr_stats\.v1' integrations/claude-code/kebab/SKILL.md` returns ≥1, same for `ocr_failures.v1`. --- ## §6 Risks + open questions ### R-1 dual-write transaction safety (file vs SQLite race) **Issue**: emit_progress closure 가 file write 후 SQLite insert 실패 시, ndjson 과 DB 불일치. **Mitigation**: - file write first (durable, may fail). - SQLite write second (non-critical, warn on fail, don't propagate error). - per-run 단위 reconciliation tool (future enhancement, not in scope). ### R-2 V008 migration rollback (F6) **Issue**: user 가 v0.20.x → older version downgrade 시 V008 rolled back? **Mitigation**: CLAUDE.md migration policy follow. V008 은 additive table → old version 이 table ignore 하면 작동 OK. **Manual rollback** (v0.19.x ↔ v0.20.x alternating dogfood): ```sql DELETE FROM refinery_schema_history WHERE version=8; DROP TABLE IF EXISTS pdf_ocr_events; ``` **Out-of-scope**: v0.19.x ↔ v0.20.x alternate run 의 자동 rollback path 미제공. ### R-3 prune helper 가 concurrent ingest 시 stale lock **Issue**: cleanup_old_logs 가 file 삭제 중인데 다른 process 가 write? **Mitigation**: - cleanup 은 IngestLogWriter::open 시만 (ingest 시작 전). - per-process single ingest 가정 (현재 design). - concurrent ingest support 는 future phase. ### R-4 image decode failure handling (corrupt JPEG fallback) **Issue**: JPEG 가 corrupt → extract_image_dimensions 실패. **Mitigation**: helper returns Option<(u32, u32)> → (None, None) fallback. OCR 완료도 유효. warning event push (optional, future enhancement). ### R-5 wire schema additive minor — old consumer 의 schema 미인식 **Issue**: old `kebab` binary (v0.19.x) 가 v0.20.x `kebab inspect ocr-stats` 의 output consume? **Mitigation**: - `schema_version` = `ocr_stats.v1` explicit (old consumer 는 schema 미인식 → skip OK). - wire.v1 의 additive list → backward compat (old consumer 는 list 만 ignores). - new consumer 만 `ocr_stats.v1` / `ocr_failures.v1` 인식. ### R-6 Concurrent cleanup 에 의한 log file loss **Issue**: keep_recent_runs / retention_days 정책 으로 파일 삭제 중 user 가 tail 시도? **Mitigation**: cleanup 은 ingest start 때만. user 가 tail 하는 중인 파일은 일반적으로 recent (top N 내) → 안전. ### R-7 doc_id NULL wiring in LogEvent::Ocr closure (F1) **Issue**: emit_progress closure 에서 doc_id 가 None 또는 mismatch 시, file ndjson 과 SQLite record 의 doc_id 가 불일치. **Verification** (spec §6 R-7 add): ```bash # canonical.doc_id 가 set 되는 시점 확인 grep -n "canonical.doc_id\|\.doc_id\s*=" crates/kebab-app/src/lib.rs | head -10 ``` **Mitigation**: - closure scope 에서 doc_id 를 pre-capture (let doc_id_for_log: String = canonical.doc_id.0.clone()). - LogEvent::Ocr 생성 시 captured value 사용. - per-run integration test 에서 file ndjson 의 doc_id 와 SQLite SELECT 의 doc_id match verify. --- ## §7 References - **Parent spec**: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (design contract). - **Round 1 spec**: v0.20.0 sub-item 1 ingest log (PR #189). - **Code ranges**: - `crates/kebab-app/src/pdf_ocr_apply.rs` lines 155, 188, 265 (6 emit point). - `crates/kebab-app/src/ingest_log.rs` (LogEvent, IngestLogWriter). - `crates/kebab-config/src/lib.rs` (Config, LoggingCfg). - `crates/kebab-store-sqlite/src/lib.rs` (SqliteStore). - `crates/kebab-cli/src/main.rs` (Subcommand). - **Dependencies**: - `image` crate (transitive via kebab-parse-image). - `time` crate (RFC 3339 timestamp, already in workspace). - `rusqlite` (already in kebab-store-sqlite). - **Config sections**: - `[logging]`: ingest_log_enabled, ingest_log_dir, keep_recent_runs, retention_days. - `[pdf.ocr]`: (unchanged from v0.20.0). - **Wire schemas**: - `docs/wire-schema/v1/ocr_stats.schema.json` (신규). - `docs/wire-schema/v1/ocr_failures.schema.json` (신규). - `docs/wire-schema/v1/schema.schema.json` (additive: wire.schemas list 에 두 schema 추가). - **HOTFIXES contract**: 새로운 deviations 는 `tasks/HOTFIXES.md` 에 dated entry + cross-link to this spec. - **Version cascade**: image_width/height, SQLite table schema 추가는 index_version cascade 아님 (chunks/embeddings 미영향). - **Backward compat**: old config parses with `[logging]` defaults, wire schema additive minor. --- ## §8 Dependencies + imports ### Allowed dependencies - `image` crate (for ImageReader::new, into_dimensions). - `time` crate (RFC 3339 formatting, already in workspace). - `rusqlite` (for SQL execute / query, already in kebab-store-sqlite). - `serde_json` (for wire schema export, already in kebab-app). ### Forbidden dependencies - **None new introduced.** All uses are transitive or workspace-existing. --- ## §9 Testing strategy ### Unit tests - `extract_image_dimensions` helper: valid JPEG → Some((w, h)), corrupt JPEG → None. - `cleanup_old_logs`: keep_recent_runs / retention_days logic, file deletion. - LoggingCfg defaults: serde round-trip, backward compat. ### Integration tests **New test files**: - `crates/kebab-app/tests/ocr_inspect_smoke.rs`: scanned PDF ingest → inspect ocr-stats / ocr-failures validation. - `crates/kebab-store-sqlite/tests/pdf_ocr_events_insert_smoke.rs`: V008 migration, dual-write, prune logic. **Existing test updates**: - `pdf_ocr_roundtrip` → verify image_width/height non-null. - `ingest_report_snapshot` → verify ocr_stats output shape. ### Smoke test (docs/SMOKE.md) - `kebab ingest` with scanned PDF → `~/.local/state/kebab/logs/ingest-*.ndjson` + `pdf_ocr_events` table check. - `kebab inspect ocr-stats --json | jq '.schema_version'` = `"ocr_stats.v1"`. ### Regression tests - 기존 1370 workspace test suite — regression 0 기대 (cleanup 은 non-critical, file-only logging 은 unchanged). --- ## §10 Rollout + dogfood ### v0.20.x milestone 1. **spec approval** (이번 round 0). 2. **implementation** (A6 round 1, estimate 3-4 days for 4 enhancement). 3. **review + merge** (pull request via gitea-ops). 4. **dogfood** (user runs v0.20.x binary, accumulates OCR stats over 2-4 weeks). 5. **data-driven tuning** (inspect ocr-stats → timeout default adjust, release note v0.20.y). ### Backward compat notes - Old binary (v0.19.x) + new config (v0.20.x with `[logging] retention_*`): config parses, logging ignores new fields. - New binary (v0.20.x) + old config (v0.19.x without `[logging]`): defaults apply, logging works. - wire schema: additive, consumers ignore unknown fields. ### Release notes - **wire schema additive minor** (`ocr_stats.v1`, `ocr_failures.v1` 추가) → release trigger 아님 (CLAUDE.md §Versioning cascade). - 사용자 도그푸딩 중 데이터 누적 후, 본격 튜닝 (e.g. timeout 조정) 에 따라 `chore: bump 0.20.x → 0.20.y` 별 commit 가능 (dogfood 결과 반영 시). --- ## §11 Contract stability **Locked sections** (design contract 의 일부, future changes require spec §N update): - wire schema `ocr_stats.v1` field list. - wire schema `ocr_failures.v1` field list. - `[logging]` config fields. - CLI `inspect ocr-stats / ocr-failures` output format. **Flexible sections** (implementation detail, future refactor OK): - `extract_image_dimensions` helper location (intra-crate 이동 가능). - `cleanup_old_logs` 정책 (더 sophisticated 알고리즘 가능). - SQLite index strategy (추가 index 가능). --- ## Summary v0.20.x ingest log round 2 는 round 1 의 file-only ndjson 을 4 가지로 확장: 1. **image dimension capture** — raster JPEG decode (trivial). 2. **SQLite mirror** — V008 migration + `pdf_ocr_events` table (medium). 3. **CLI inspect** — corpus-wide OCR statistics API (medium). 4. **log retention** — automatic cleanup (low). 모두 non-breaking additive changes. backward compat 보장, wire schema minor bump, 500-800 line spec (이 문서 ≈ 700 lines).