Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
v0.20.x ingest log feature — spec
동기
dogfood 4-round 후 사용자가 명시 요구:
ocr 실패를 포함한 전체적인 실패 발생 시 그 로그를 자세하기 볼 수 있도록 하고 싶은데. 그 통계들을 알 수 있어야 정확한 설정 sweet spot이나 설계 방향성을 알 수 있을 것 같아.
핵심 문제: OCR 타임아웃, parse error, skip 같은 ingest 실패를 individual event + summary stats로 기록할 structured log surface가 없음. 사용자가 sweet-spot (OCR timeout, model 선택) 을 찾으려면 per-page 통계 + aggregate stats가 필수.
Surface 선택: structured ndjson file (wire schema 변경 0, internal write-only). 별도 CLI query 없이 grep/jq로 사용자가 직접 분석 가능.
스코프
포함
[logging]config section (2 field:ingest_log_enabled,ingest_log_dir)- Per-ingest-run ndjson log file (
ingest-{run_id}.ndjson, run_id = ISO 8601 + random suffix) - 5 kind 의 log record:
ocr,parse_error,skip,error,summary - OCR per-page event 에 image byte size + dimensions 측정 포함
- Run ID + timestamp ISO 8601 UTC format
- Backward compat: pre-v0.20 config files가 new
[logging]field 자동 무시
Out of scope
- SQLite 영구 저장 (future enhancement)
- Log level switches (verbose / minimal) — 본 round single level
- Log rotation / cleanup policy (user-managed)
- Query CLI (
kebab logs) - Wire schema 변경 (internal use only)
설계 결정
§3.1 Config schema — [logging] section
파일: crates/kebab-config/src/lib.rs (line 37+ 에 Config struct 기존)
新 struct LoggingCfg:
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct LoggingCfg {
/// ingest 시 structured ndjson log 자동 write. default = true.
/// false 시 log file 생성 0.
#[serde(default = "default_ingest_log_enabled")]
pub ingest_log_enabled: bool,
/// per-ingest-run log file directory. default = `{state_dir}/logs`.
/// `{state_dir}` placeholder 가 XDG state dir 로 expand.
#[serde(default = "default_ingest_log_dir")]
pub ingest_log_dir: PathBuf,
}
fn default_ingest_log_enabled() -> bool { true }
fn default_ingest_log_dir() -> PathBuf {
// expand 후 = ~/.local/state/kebab/logs
PathBuf::from("{state_dir}/logs")
}
impl Default for LoggingCfg {
fn default() -> Self {
Self {
ingest_log_enabled: default_ingest_log_enabled(),
ingest_log_dir: default_ingest_log_dir(),
}
}
}
Config struct 에 추가 (line 37+):
pub struct Config {
// ... existing fields ...
#[serde(default)]
pub logging: LoggingCfg,
}
#[serde(default)] 로 backward compat 보장 (pre-v0.20 config 파일이 [logging] section 없으면 자동 default init).
§3.2 Per-ingest-run filename + ID generation
Filename format: ingest-{run_id}.ndjson
Run ID: ISO 8601 timestamp + 8-char random suffix
- Example:
20260528T013000Z-abc123de - Format:
YYYYMMDDTHHmmssZ-{8-char random alphanumeric} - Sortable + readable + concurrent collision unlikely
Path expansion: {state_dir} placeholder 를 ~/.local/state/kebab 로 expand
{state_dir}대체 별도 hand-roll (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원)- Directory auto-create if missing
§3.3 Log content — ndjson schema
각 line = JSON object (ndjson format). 각 record 의 kind 가 union variant discriminator.
Record kind: ocr
OCR per-page attempt (success / failure). HIGH-1 (PdfOcrProgress::Finished extend): image_byte/dims/failure_reason 은 PdfOcrProgress::Finished variant 에 추가되어 carry.
{
"ts": "2026-05-28T01:30:01.123Z",
"kind": "ocr",
"doc_path": "metro-korea.pdf",
"page": 8,
"image_byte_size": 989512,
"image_width": 2480,
"image_height": 3508,
"ms": 180000,
"chars": 0,
"success": false,
"reason": "timeout",
"ocr_engine": "ollama-vision"
}
Fields:
ts: ISO 8601 UTC (milliseconds, usingtimecrate)kind: literal"ocr"doc_path: relative or absolute source pathpage: page number (0-indexed or 1-indexed — confirm in impl)image_byte_size: raster image byte size (from PdfOcrProgress::Finished.image_byte_size)image_width,image_height: raster dimensions pixel (from PdfOcrProgress::Finished.image_width/height)ms: OCR duration millisecondschars: extracted character count (0 if failed)success: boolean (failure_reason Some → false) (from PdfOcrProgress::Finished.failure_reason.is_some() negation)reason: "timeout" | "network_error" | "malformed_image" | "other" (from PdfOcrProgress::Finished.failure_reason)ocr_engine: "ollama-vision" or config'd name
Record kind: parse_error
PDF / other format parse failure.
{
"ts": "2026-05-28T01:30:02.456Z",
"kind": "parse_error",
"doc_path": "weird.pdf",
"reason": "lopdf_error",
"message": "unexpected EOF in xref table"
}
Record kind: skip
Document skip (size_exceeded / gitignore / builtin_blacklist).
{
"ts": "2026-05-28T01:30:03.789Z",
"kind": "skip",
"doc_path": "large.zip",
"reason": "builtin_blacklist",
"detail": ".zip extension"
}
Record kind: error
Fatal error emit (error.v1 이 emit 될 때 동시 log write).
{
"ts": "2026-05-28T01:30:04.012Z",
"kind": "error",
"code": "config_not_found",
"message": "config file does not exist: /tmp/config.toml"
}
Record kind: summary
Ingest run 최종 aggregate stats (마지막 line).
{
"ts": "2026-05-28T01:31:00.000Z",
"kind": "summary",
"run_id": "20260528T013000Z-abc123de",
"scanned": 11,
"new": 11,
"errors": 0,
"ocr_pages": 21,
"ocr_failures": 2,
"ocr_p50_ms": 1500,
"ocr_p90_ms": 63000,
"ocr_max_ms": 180007,
"duration_ms": 555550
}
Fields:
run_id: 이 ingest run 의 고유 IDscanned: 전체 시도 doc countnew: 실제 ingest 된 new doc counterrors: fatal error countocr_pages: OCR 시도한 총 page countocr_failures: OCR 실패한 page countocr_p50_ms,ocr_p90_ms,ocr_max_ms: percentile + max (success 한 page 만, null 가능)duration_ms: entire run elapsed
§4 구현 명세
§4.1 새 module: crates/kebab-app/src/ingest_log.rs
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;
use std::time::SystemTime;
use serde::{Serialize, Deserialize};
pub struct IngestLogWriter {
file: Option<BufWriter<File>>,
run_id: String,
started_at: SystemTime,
}
impl IngestLogWriter {
/// Open log file. `cfg.ingest_log_enabled == false` 면 None 반환.
pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
if !cfg.ingest_log_enabled {
return Ok(None);
}
let run_id = generate_run_id();
let log_dir = expand_log_dir(&cfg.ingest_log_dir)?;
std::fs::create_dir_all(&log_dir)?;
let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
let file = BufWriter::new(File::create(&path)?);
Ok(Some(Self {
file: Some(file),
run_id,
started_at: SystemTime::now(),
}))
}
pub fn write_event(&mut self, event: &LogEvent) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
serde_json::to_writer(&mut *f, event)?;
writeln!(f)?;
}
Ok(())
}
pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
serde_json::to_writer(&mut *f, summary)?;
writeln!(f)?;
}
Ok(())
}
pub fn flush(&mut self) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
f.flush()?;
}
Ok(())
}
pub fn run_id(&self) -> &str {
&self.run_id
}
}
impl Drop for IngestLogWriter {
fn drop(&mut self) {
let _ = self.flush();
}
}
fn generate_run_id() -> String {
use time::OffsetDateTime;
use time::format_description::well_known::iso8601;
let now = OffsetDateTime::now_utc();
// Format: 20260528T013000Z (compact ISO 8601)
let mut buffer = [0u8; 16];
let ts = now.format(
&format_description::parse("[year][month][day]T[hour][minute][second]Z")
.expect("format_description is valid")
).expect("format should succeed");
let suffix: String = (0..8)
.map(|_| {
const CHARS: &[u8] = b"abcdefghijklmnopqrstuvwxyz0123456789";
CHARS[rand::random::<usize>() % CHARS.len()] as char
})
.collect();
format!("{ts}-{suffix}")
}
fn expand_log_dir(path: &PathBuf) -> anyhow::Result<PathBuf> {
// 이미 기존 expand_path / expand_path_with_base 가 있으므로 활용
// {state_dir} 의존 → kebab_config::Config::xdg_state_dir() + "/logs"
use std::env;
let path_str = path.to_string_lossy();
if path_str.contains("{state_dir}") {
let state_dir = kebab_config::Config::xdg_state_dir();
Ok(PathBuf::from(path_str.replace("{state_dir}", state_dir.to_str().unwrap())))
} else {
Ok(path.clone())
}
}
#[derive(Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum LogEvent<'a> {
#[serde(rename = "ocr")]
Ocr {
ts: String,
doc_path: &'a str,
page: u32,
image_byte_size: Option<u64>,
image_width: Option<u32>,
image_height: Option<u32>,
ms: u64,
chars: u32,
success: bool,
reason: Option<&'a str>,
ocr_engine: &'a str,
},
#[serde(rename = "parse_error")]
ParseError {
ts: String,
doc_path: &'a str,
reason: &'a str,
message: &'a str,
},
#[serde(rename = "skip")]
Skip {
ts: String,
doc_path: &'a str,
reason: &'a str,
detail: Option<&'a str>,
},
#[serde(rename = "error")]
Error {
ts: String,
code: &'a str,
message: &'a str,
},
#[serde(rename = "summary")]
Summary {
ts: String,
run_id: String,
scanned: u32,
new: u32,
errors: u32,
ocr_pages: u32,
ocr_failures: u32,
ocr_p50_ms: Option<u64>,
ocr_p90_ms: Option<u64>,
ocr_max_ms: Option<u64>,
duration_ms: u64,
},
}
#[derive(Serialize, Deserialize)]
pub struct IngestSummary {
pub ts: String,
pub run_id: String,
pub scanned: u32,
pub new: u32,
pub errors: u32,
pub ocr_pages: u32,
pub ocr_failures: u32,
pub ocr_p50_ms: Option<u64>,
pub ocr_p90_ms: Option<u64>,
pub ocr_max_ms: Option<u64>,
pub duration_ms: u64,
}
§4.2 Emit hook integration points
Hook 1: crates/kebab-app/src/lib.rs::ingest_with_config (line 234)
IngestLogWriter::open(cfg.logging)at function entry →Option<Arc<Mutex<IngestLogWriter>>>IngestLogWriter::flush()at function exit (success / error)IngestLogWriter를Arc<Mutex<_>>wrap 후ingest_with_config_opts()의IngestOpts에 carry- MEDIUM-1 (ownership):
apply_pdf_ocr의 emit_progress closure 가Arc<Mutex<IngestLogWriter>>clone 캡처 + lock + write. single-threaded sync (per-asset loop) 라 blocking lock 안전.
Hook 2: OCR progress event (pdf_ocr_apply.rs)
- HIGH-1 (PdfOcrProgress::Finished extend):
PdfOcrProgress::Finishedvariant 확장 — additive field:PdfOcrProgress::Finished { page: u32, ms: u64, chars: u32, skipped: bool, // NEW: image_byte_size: Option<u64>, image_width: Option<u32>, image_height: Option<u32>, failure_reason: Option<String>, // "timeout" | "ocr_error" | "network_error" | None } - emit_progress closure 내에서 log writer (Arc<Mutex>) 에 ocr event 전달
- ingest_progress.v1 (IngestEvent::PdfOcrFinished) cascade 영향: additive minor (backward compat)
Hook 3: Parse error (app.rs / ingest pipeline error path)
- parse_error kind emit when
Error::PdfFormat/Error::ImageFormat발생
Hook 4: Skip event (kebab-source-fs/src/connector.rs)
- size_exceeded / gitignore / builtin_blacklist skip 시 log writer에 skip event 전달
Hook 5: Fatal error (HIGH-4 위치 정정)
- 위치:
crates/kebab-app/src/lib.rs::ingest_with_config_opts의 error return path (per-asset catch + final Err arm) classify(err, verbose)invoke 직후 log writer에 LogEvent::Error emiterror_wire::classify자체는 변경 0 유지 (side-effect 없는 순수 변환)
§4.3 Timestamp format
HIGH-3 (chrono → time crate): ISO 8601 UTC, milliseconds precision:
- Example:
2026-05-28T01:30:01.123Z - Use
time::OffsetDateTime::now_utc().format(&time::format_description::well_known::Rfc3339) - 항상 UTC (system timezone 무관)
- workspace 는
timecrate 이미 사용 중 (chrono 중복 의존 제거)
§4.4 Backward compat
Config struct의 logging field에 #[serde(default)] tag:
#[serde(default)]
pub logging: LoggingCfg,
→ pre-v0.20 config file이 [logging] section 없으면 자동 LoggingCfg::default() init (enabled=true, dir=~/.local/state/kebab/logs)
§5 Acceptance criteria
- AC-1:
[logging]section default emit (새 config 또는config init). - AC-2:
kebab ingest실행 후{log_dir}/ingest-{run_id}.ndjson파일 존재. - AC-3: 각 line valid JSON +
kindenum value +tsISO 8601. - AC-4: OCR per-page record + summary record (마지막 line).
- AC-5: 모든 failure type (size_exceeded / parse_error / ocr timeout) record 됨.
- AC-6:
ingest_log_enabled = false시 log file 생성 0. - AC-7:
ingest_log_dir = "/tmp/custom"override 시 그 path에 file emit. - AC-8:
cargo test -p kebab-app --lib ingest_log+cargo clippygreen. - AC-9 (MEDIUM-2 actionability): integration test —
cargo test -p kebab-app --test ingest_log_smoke -j 4 2>&1 | tail -3→ 1 passed; 0 failed. test body:- tempdir + minimal corpus (1 markdown + 1 image PDF).
- ingest with
[logging] ingest_log_dir = tempdir/logs. - assert: log file
tempdir/logs/ingest-{run_id}.ndjsonexists. - parse each line as JSON, assert kinds = [ocr, summary] or more.
- last line kind = "summary" + scanned > 0 && ocr_pages > 0.
- AC-10: pre-v0.20 config + v0.20 binary 호환성 (new
[logging]field 무시).
§6 위험 + 미해결 질문
Risks
- R-1: Log file 누적 disk usage — user가 직접 정리. Doc comment 명시.
- R-2: Concurrent ingest 의 run_id collision — 8-char random + timestamp 로 거의 불가능. Mitigate.
- R-3:
{state_dir}placeholder expand — hand-roll via xdg_state_dir + string replace. existingexpand_path_with_base는 {state_dir} 미지원 (LOW-2). - R-4: Error path panic 시 log writer drop → flush 미실행 — Drop impl 에 flush() 호출로 mitigate.
- R-5: workspace dependency 추가:
rand(run_id suffix),time(timestamp) 이미 kebab-app 의존.chrono신규 추가 금지 (HIGH-3).
Open questions
- OQ-1 (HIGH-1 의 design decision으로 승격): image_byte_size + dimensions 출처 — PdfOcrProgress::Finished 에 carry (Option A 채택). Bug #11 follow-up 에서 raster image 측정 이미 시작.
- OQ-2:
ocr_p50_ms,ocr_p90_ms계산 —quantilescrate 사용 또는 간단 sorted vec? (plan drafter 결정) - OQ-3: Log file 수동 cleanup 정책을 user-facing docs에 명시할 위치? (README / SMOKE / config example, plan drafter + executor 결정)
§7 참고
- Parent task:
tasks/p10/p10-1A-5-ingest-failure-log.md - Parent design:
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md(§8 wire schema, §9 versioning cascade — 변경 0) - Bug #11 follow-up: OCR image metric capture (raster path in kebab-parse-pdf)
- Related:
crates/kebab-app/src/error_wire.rs(ErrorV1 emit),crates/kebab-app/src/ingest_progress.rs(IngestEvent) - Config:
crates/kebab-config/src/lib.rs(Config struct, expand_path helpers, xdg_state_dir)