--- title: "v0.20.x ingest log feature — spec" date: 2026-05-28 status: "DRAFT (round r1c)" target_version: 0.20.x phase: A4 (spec drafter rewrite) parent_spec: ../../../tasks/p10/p10-1A-5-ingest-failure-log.md contract_sections: [] references: [2026-04-27-kebab-final-form-design.md, 프리-v0.20 dogfood Bug #15 (ocr timeout + log analysis), 2026-05-28 closure critic result] --- # v0.20.x ingest log feature — spec ## 동기 dogfood 4-round 후 사용자가 명시 요구: > ocr 실패를 포함한 전체적인 실패 발생 시 그 로그를 자세하기 볼 수 있도록 하고 싶은데. 그 통계들을 알 수 있어야 정확한 설정 sweet spot이나 설계 방향성을 알 수 있을 것 같아. **핵심 문제**: OCR 타임아웃, parse error, skip 같은 ingest 실패를 individual event + summary stats로 기록할 structured log surface가 없음. 사용자가 sweet-spot (OCR timeout, model 선택) 을 찾으려면 per-page 통계 + aggregate stats가 필수. **Surface 선택**: structured ndjson file (wire schema 변경 0, internal write-only). 별도 CLI query 없이 grep/jq로 사용자가 직접 분석 가능. --- ## 스코프 ### 포함 - `[logging]` config section (2 field: `ingest_log_enabled`, `ingest_log_dir`) - Per-ingest-run ndjson log file (`ingest-{run_id}.ndjson`, run_id = ISO 8601 + random suffix) - 5 kind 의 log record: `ocr`, `parse_error`, `skip`, `error`, `summary` - OCR per-page event 에 image byte size + dimensions 측정 포함 - Run ID + timestamp ISO 8601 UTC format - Backward compat: pre-v0.20 config files가 new `[logging]` field 자동 무시 ### Out of scope - SQLite 영구 저장 (future enhancement) - Log level switches (verbose / minimal) — 본 round single level - Log rotation / cleanup policy (user-managed) - Query CLI (`kebab logs`) - Wire schema 변경 (internal use only) --- ## 설계 결정 ### §3.1 Config schema — `[logging]` section **파일**: `crates/kebab-config/src/lib.rs` (line 37+ 에 Config struct 기존) 新 struct `LoggingCfg`: ```rust #[derive(Clone, Debug, Serialize, Deserialize, PartialEq)] pub struct LoggingCfg { /// ingest 시 structured ndjson log 자동 write. default = true. /// false 시 log file 생성 0. #[serde(default = "default_ingest_log_enabled")] pub ingest_log_enabled: bool, /// per-ingest-run log file directory. default = `{state_dir}/logs`. /// `{state_dir}` placeholder 가 XDG state dir 로 expand. #[serde(default = "default_ingest_log_dir")] pub ingest_log_dir: PathBuf, } fn default_ingest_log_enabled() -> bool { true } fn default_ingest_log_dir() -> PathBuf { // expand 후 = ~/.local/state/kebab/logs PathBuf::from("{state_dir}/logs") } impl Default for LoggingCfg { fn default() -> Self { Self { ingest_log_enabled: default_ingest_log_enabled(), ingest_log_dir: default_ingest_log_dir(), } } } ``` **Config struct 에 추가** (line 37+): ```rust pub struct Config { // ... existing fields ... #[serde(default)] pub logging: LoggingCfg, } ``` `#[serde(default)]` 로 backward compat 보장 (pre-v0.20 config 파일이 `[logging]` section 없으면 자동 default init). ### §3.2 Per-ingest-run filename + ID generation **Filename format**: `ingest-{run_id}.ndjson` **Run ID**: ISO 8601 timestamp + 8-char random suffix - Example: `20260528T013000Z-abc123de` - Format: `YYYYMMDDTHHmmssZ-{8-char random alphanumeric}` - Sortable + readable + concurrent collision unlikely **Path expansion**: `{state_dir}` placeholder 를 `~/.local/state/kebab` 로 expand - `{state_dir}` 대체 별도 hand-roll (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원) - Directory auto-create if missing ### §3.3 Log content — ndjson schema 각 line = JSON object (ndjson format). 각 record 의 `kind` 가 union variant discriminator. #### Record kind: `ocr` OCR per-page attempt (success / failure). **HIGH-1 (PdfOcrProgress::Finished extend)**: image_byte/dims/failure_reason 은 PdfOcrProgress::Finished variant 에 추가되어 carry. ```jsonc { "ts": "2026-05-28T01:30:01.123Z", "kind": "ocr", "doc_path": "metro-korea.pdf", "page": 8, "image_byte_size": 989512, "image_width": 2480, "image_height": 3508, "ms": 180000, "chars": 0, "success": false, "reason": "timeout", "ocr_engine": "ollama-vision" } ``` **Fields**: - `ts`: ISO 8601 UTC (milliseconds, using `time` crate) - `kind`: literal `"ocr"` - `doc_path`: relative or absolute source path - `page`: page number (0-indexed or 1-indexed — confirm in impl) - `image_byte_size`: raster image byte size (from PdfOcrProgress::Finished.image_byte_size) - `image_width`, `image_height`: raster dimensions pixel (from PdfOcrProgress::Finished.image_width/height) - `ms`: OCR duration milliseconds - `chars`: extracted character count (0 if failed) - `success`: boolean (failure_reason Some → false) (from PdfOcrProgress::Finished.failure_reason.is_some() negation) - `reason`: "timeout" | "network_error" | "malformed_image" | "other" (from PdfOcrProgress::Finished.failure_reason) - `ocr_engine`: "ollama-vision" or config'd name #### Record kind: `parse_error` PDF / other format parse failure. ```jsonc { "ts": "2026-05-28T01:30:02.456Z", "kind": "parse_error", "doc_path": "weird.pdf", "reason": "lopdf_error", "message": "unexpected EOF in xref table" } ``` #### Record kind: `skip` Document skip (size_exceeded / gitignore / builtin_blacklist). ```jsonc { "ts": "2026-05-28T01:30:03.789Z", "kind": "skip", "doc_path": "large.zip", "reason": "builtin_blacklist", "detail": ".zip extension" } ``` #### Record kind: `error` Fatal error emit (error.v1 이 emit 될 때 동시 log write). ```jsonc { "ts": "2026-05-28T01:30:04.012Z", "kind": "error", "code": "config_not_found", "message": "config file does not exist: /tmp/config.toml" } ``` #### Record kind: `summary` Ingest run 최종 aggregate stats (마지막 line). ```jsonc { "ts": "2026-05-28T01:31:00.000Z", "kind": "summary", "run_id": "20260528T013000Z-abc123de", "scanned": 11, "new": 11, "errors": 0, "ocr_pages": 21, "ocr_failures": 2, "ocr_p50_ms": 1500, "ocr_p90_ms": 63000, "ocr_max_ms": 180007, "duration_ms": 555550 } ``` **Fields**: - `run_id`: 이 ingest run 의 고유 ID - `scanned`: 전체 시도 doc count - `new`: 실제 ingest 된 new doc count - `errors`: fatal error count - `ocr_pages`: OCR 시도한 총 page count - `ocr_failures`: OCR 실패한 page count - `ocr_p50_ms`, `ocr_p90_ms`, `ocr_max_ms`: percentile + max (success 한 page 만, null 가능) - `duration_ms`: entire run elapsed --- ## §4 구현 명세 ### §4.1 새 module: `crates/kebab-app/src/ingest_log.rs` ```rust use std::fs::File; use std::io::{BufWriter, Write}; use std::path::PathBuf; use std::time::SystemTime; use serde::{Serialize, Deserialize}; pub struct IngestLogWriter { file: Option>, run_id: String, started_at: SystemTime, } impl IngestLogWriter { /// Open log file. `cfg.ingest_log_enabled == false` 면 None 반환. pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result> { if !cfg.ingest_log_enabled { return Ok(None); } let run_id = generate_run_id(); let log_dir = expand_log_dir(&cfg.ingest_log_dir)?; std::fs::create_dir_all(&log_dir)?; let path = log_dir.join(format!("ingest-{run_id}.ndjson")); let file = BufWriter::new(File::create(&path)?); Ok(Some(Self { file: Some(file), run_id, started_at: SystemTime::now(), })) } pub fn write_event(&mut self, event: &LogEvent) -> anyhow::Result<()> { if let Some(ref mut f) = self.file { serde_json::to_writer(&mut *f, event)?; writeln!(f)?; } Ok(()) } pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> { if let Some(ref mut f) = self.file { serde_json::to_writer(&mut *f, summary)?; writeln!(f)?; } Ok(()) } pub fn flush(&mut self) -> anyhow::Result<()> { if let Some(ref mut f) = self.file { f.flush()?; } Ok(()) } pub fn run_id(&self) -> &str { &self.run_id } } impl Drop for IngestLogWriter { fn drop(&mut self) { let _ = self.flush(); } } fn generate_run_id() -> String { use time::OffsetDateTime; use time::format_description::well_known::iso8601; let now = OffsetDateTime::now_utc(); // Format: 20260528T013000Z (compact ISO 8601) let mut buffer = [0u8; 16]; let ts = now.format( &format_description::parse("[year][month][day]T[hour][minute][second]Z") .expect("format_description is valid") ).expect("format should succeed"); let suffix: String = (0..8) .map(|_| { const CHARS: &[u8] = b"abcdefghijklmnopqrstuvwxyz0123456789"; CHARS[rand::random::() % CHARS.len()] as char }) .collect(); format!("{ts}-{suffix}") } fn expand_log_dir(path: &PathBuf) -> anyhow::Result { // 이미 기존 expand_path / expand_path_with_base 가 있으므로 활용 // {state_dir} 의존 → kebab_config::Config::xdg_state_dir() + "/logs" use std::env; let path_str = path.to_string_lossy(); if path_str.contains("{state_dir}") { let state_dir = kebab_config::Config::xdg_state_dir(); Ok(PathBuf::from(path_str.replace("{state_dir}", state_dir.to_str().unwrap()))) } else { Ok(path.clone()) } } #[derive(Serialize, Deserialize)] #[serde(tag = "kind")] pub enum LogEvent<'a> { #[serde(rename = "ocr")] Ocr { ts: String, doc_path: &'a str, page: u32, image_byte_size: Option, image_width: Option, image_height: Option, ms: u64, chars: u32, success: bool, reason: Option<&'a str>, ocr_engine: &'a str, }, #[serde(rename = "parse_error")] ParseError { ts: String, doc_path: &'a str, reason: &'a str, message: &'a str, }, #[serde(rename = "skip")] Skip { ts: String, doc_path: &'a str, reason: &'a str, detail: Option<&'a str>, }, #[serde(rename = "error")] Error { ts: String, code: &'a str, message: &'a str, }, #[serde(rename = "summary")] Summary { ts: String, run_id: String, scanned: u32, new: u32, errors: u32, ocr_pages: u32, ocr_failures: u32, ocr_p50_ms: Option, ocr_p90_ms: Option, ocr_max_ms: Option, duration_ms: u64, }, } #[derive(Serialize, Deserialize)] pub struct IngestSummary { pub ts: String, pub run_id: String, pub scanned: u32, pub new: u32, pub errors: u32, pub ocr_pages: u32, pub ocr_failures: u32, pub ocr_p50_ms: Option, pub ocr_p90_ms: Option, pub ocr_max_ms: Option, pub duration_ms: u64, } ``` ### §4.2 Emit hook integration points **Hook 1**: `crates/kebab-app/src/lib.rs::ingest_with_config` (line 234) - `IngestLogWriter::open(cfg.logging)` at function entry → `Option>>` - `IngestLogWriter::flush()` at function exit (success / error) - `IngestLogWriter` 를 `Arc>` wrap 후 `ingest_with_config_opts()` 의 `IngestOpts` 에 carry - **MEDIUM-1 (ownership)**: `apply_pdf_ocr` 의 emit_progress closure 가 `Arc>` clone 캡처 + lock + write. single-threaded sync (per-asset loop) 라 blocking lock 안전. **Hook 2**: OCR progress event (pdf_ocr_apply.rs) - **HIGH-1 (PdfOcrProgress::Finished extend)**: `PdfOcrProgress::Finished` variant 확장 — additive field: ```rust PdfOcrProgress::Finished { page: u32, ms: u64, chars: u32, skipped: bool, // NEW: image_byte_size: Option, image_width: Option, image_height: Option, failure_reason: Option, // "timeout" | "ocr_error" | "network_error" | None } ``` - emit_progress closure 내에서 log writer (Arc>) 에 ocr event 전달 - ingest_progress.v1 (IngestEvent::PdfOcrFinished) cascade 영향: additive minor (backward compat) **Hook 3**: Parse error (app.rs / ingest pipeline error path) - parse_error kind emit when `Error::PdfFormat` / `Error::ImageFormat` 발생 **Hook 4**: Skip event (kebab-source-fs/src/connector.rs) - size_exceeded / gitignore / builtin_blacklist skip 시 log writer에 skip event 전달 **Hook 5**: Fatal error (**HIGH-4 위치 정정**) - **위치**: `crates/kebab-app/src/lib.rs::ingest_with_config_opts` 의 error return path (per-asset catch + final Err arm) - `classify(err, verbose)` invoke 직후 log writer에 LogEvent::Error emit - `error_wire::classify` 자체는 변경 0 유지 (side-effect 없는 순수 변환) ### §4.3 Timestamp format **HIGH-3 (chrono → time crate)**: ISO 8601 UTC, milliseconds precision: - Example: `2026-05-28T01:30:01.123Z` - Use `time::OffsetDateTime::now_utc().format(&time::format_description::well_known::Rfc3339)` - 항상 UTC (system timezone 무관) - workspace 는 `time` crate 이미 사용 중 (chrono 중복 의존 제거) ### §4.4 Backward compat `Config` struct의 `logging` field에 `#[serde(default)]` tag: ```rust #[serde(default)] pub logging: LoggingCfg, ``` → pre-v0.20 config file이 `[logging]` section 없으면 자동 `LoggingCfg::default()` init (enabled=true, dir=~/.local/state/kebab/logs) --- ## §5 Acceptance criteria - **AC-1**: `[logging]` section default emit (새 config 또는 `config init`). - **AC-2**: `kebab ingest` 실행 후 `{log_dir}/ingest-{run_id}.ndjson` 파일 존재. - **AC-3**: 각 line valid JSON + `kind` enum value + `ts` ISO 8601. - **AC-4**: OCR per-page record + summary record (마지막 line). - **AC-5**: 모든 failure type (size_exceeded / parse_error / ocr timeout) record 됨. - **AC-6**: `ingest_log_enabled = false` 시 log file 생성 0. - **AC-7**: `ingest_log_dir = "/tmp/custom"` override 시 그 path에 file emit. - **AC-8**: `cargo test -p kebab-app --lib ingest_log` + `cargo clippy` green. - **AC-9** (**MEDIUM-2 actionability**): integration test — `cargo test -p kebab-app --test ingest_log_smoke -j 4 2>&1 | tail -3` → 1 passed; 0 failed. test body: 1. tempdir + minimal corpus (1 markdown + 1 image PDF). 2. ingest with `[logging] ingest_log_dir = tempdir/logs`. 3. assert: log file `tempdir/logs/ingest-{run_id}.ndjson` exists. 4. parse each line as JSON, assert kinds = [ocr, summary] or more. 5. last line kind = "summary" + scanned > 0 && ocr_pages > 0. - **AC-10**: pre-v0.20 config + v0.20 binary 호환성 (new `[logging]` field 무시). --- ## §6 위험 + 미해결 질문 ### Risks - **R-1**: Log file 누적 disk usage — user가 직접 정리. Doc comment 명시. - **R-2**: Concurrent ingest 의 run_id collision — 8-char random + timestamp 로 거의 불가능. Mitigate. - **R-3**: `{state_dir}` placeholder expand — hand-roll via xdg_state_dir + string replace. existing `expand_path_with_base` 는 {state_dir} 미지원 (LOW-2). - **R-4**: Error path panic 시 log writer drop → flush 미실행 — Drop impl 에 flush() 호출로 mitigate. - **R-5**: workspace dependency 추가: `rand` (run_id suffix), `time` (timestamp) 이미 kebab-app 의존. `chrono` 신규 추가 금지 (HIGH-3). ### Open questions - **OQ-1** (**HIGH-1 의 design decision으로 승격**): image_byte_size + dimensions 출처 — PdfOcrProgress::Finished 에 carry (Option A 채택). Bug #11 follow-up 에서 raster image 측정 이미 시작. - **OQ-2**: `ocr_p50_ms`, `ocr_p90_ms` 계산 — `quantiles` crate 사용 또는 간단 sorted vec? (plan drafter 결정) - **OQ-3**: Log file 수동 cleanup 정책을 user-facing docs에 명시할 위치? (README / SMOKE / config example, plan drafter + executor 결정) --- ## §7 참고 - **Parent task**: `tasks/p10/p10-1A-5-ingest-failure-log.md` - **Parent design**: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (§8 wire schema, §9 versioning cascade — 변경 0) - **Bug #11 follow-up**: OCR image metric capture (raster path in kebab-parse-pdf) - **Related**: `crates/kebab-app/src/error_wire.rs` (ErrorV1 emit), `crates/kebab-app/src/ingest_progress.rs` (IngestEvent) - **Config**: `crates/kebab-config/src/lib.rs` (Config struct, expand_path helpers, xdg_state_dir)