Files
kebab/docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md
altair823 685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00

16 KiB

title: "v0.20.x ingest log feature — spec" date: 2026-05-28 status: "DRAFT (round r1c)" target_version: 0.20.x phase: A4 (spec drafter rewrite) parent_spec: ../../../tasks/p10/p10-1A-5-ingest-failure-log.md contract_sections: [] references: [2026-04-27-kebab-final-form-design.md, 프리-v0.20 dogfood Bug #15 (ocr timeout + log analysis), 2026-05-28 closure critic result]

v0.20.x ingest log feature — spec

동기

dogfood 4-round 후 사용자가 명시 요구:

ocr 실패를 포함한 전체적인 실패 발생 시 그 로그를 자세하기 볼 수 있도록 하고 싶은데. 그 통계들을 알 수 있어야 정확한 설정 sweet spot이나 설계 방향성을 알 수 있을 것 같아.

핵심 문제: OCR 타임아웃, parse error, skip 같은 ingest 실패를 individual event + summary stats로 기록할 structured log surface가 없음. 사용자가 sweet-spot (OCR timeout, model 선택) 을 찾으려면 per-page 통계 + aggregate stats가 필수.

Surface 선택: structured ndjson file (wire schema 변경 0, internal write-only). 별도 CLI query 없이 grep/jq로 사용자가 직접 분석 가능.


스코프

포함

  • [logging] config section (2 field: ingest_log_enabled, ingest_log_dir)
  • Per-ingest-run ndjson log file (ingest-{run_id}.ndjson, run_id = ISO 8601 + random suffix)
  • 5 kind 의 log record: ocr, parse_error, skip, error, summary
  • OCR per-page event 에 image byte size + dimensions 측정 포함
  • Run ID + timestamp ISO 8601 UTC format
  • Backward compat: pre-v0.20 config files가 new [logging] field 자동 무시

Out of scope

  • SQLite 영구 저장 (future enhancement)
  • Log level switches (verbose / minimal) — 본 round single level
  • Log rotation / cleanup policy (user-managed)
  • Query CLI (kebab logs)
  • Wire schema 변경 (internal use only)

설계 결정

§3.1 Config schema — [logging] section

파일: crates/kebab-config/src/lib.rs (line 37+ 에 Config struct 기존)

新 struct LoggingCfg:

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct LoggingCfg {
    /// ingest 시 structured ndjson log 자동 write. default = true.
    /// false 시 log file 생성 0.
    #[serde(default = "default_ingest_log_enabled")]
    pub ingest_log_enabled: bool,

    /// per-ingest-run log file directory. default = `{state_dir}/logs`.
    /// `{state_dir}` placeholder 가 XDG state dir 로 expand.
    #[serde(default = "default_ingest_log_dir")]
    pub ingest_log_dir: PathBuf,
}

fn default_ingest_log_enabled() -> bool { true }
fn default_ingest_log_dir() -> PathBuf { 
    // expand 후 = ~/.local/state/kebab/logs
    PathBuf::from("{state_dir}/logs")
}

impl Default for LoggingCfg {
    fn default() -> Self {
        Self {
            ingest_log_enabled: default_ingest_log_enabled(),
            ingest_log_dir: default_ingest_log_dir(),
        }
    }
}

Config struct 에 추가 (line 37+):

pub struct Config {
    // ... existing fields ...
    #[serde(default)]
    pub logging: LoggingCfg,
}

#[serde(default)] 로 backward compat 보장 (pre-v0.20 config 파일이 [logging] section 없으면 자동 default init).

§3.2 Per-ingest-run filename + ID generation

Filename format: ingest-{run_id}.ndjson

Run ID: ISO 8601 timestamp + 8-char random suffix

  • Example: 20260528T013000Z-abc123de
  • Format: YYYYMMDDTHHmmssZ-{8-char random alphanumeric}
  • Sortable + readable + concurrent collision unlikely

Path expansion: {state_dir} placeholder 를 ~/.local/state/kebab 로 expand

  • {state_dir} 대체 별도 hand-roll (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원)
  • Directory auto-create if missing

§3.3 Log content — ndjson schema

각 line = JSON object (ndjson format). 각 record 의 kind 가 union variant discriminator.

Record kind: ocr

OCR per-page attempt (success / failure). HIGH-1 (PdfOcrProgress::Finished extend): image_byte/dims/failure_reason 은 PdfOcrProgress::Finished variant 에 추가되어 carry.

{
  "ts": "2026-05-28T01:30:01.123Z",
  "kind": "ocr",
  "doc_path": "metro-korea.pdf",
  "page": 8,
  "image_byte_size": 989512,
  "image_width": 2480,
  "image_height": 3508,
  "ms": 180000,
  "chars": 0,
  "success": false,
  "reason": "timeout",
  "ocr_engine": "ollama-vision"
}

Fields:

  • ts: ISO 8601 UTC (milliseconds, using time crate)
  • kind: literal "ocr"
  • doc_path: relative or absolute source path
  • page: page number (0-indexed or 1-indexed — confirm in impl)
  • image_byte_size: raster image byte size (from PdfOcrProgress::Finished.image_byte_size)
  • image_width, image_height: raster dimensions pixel (from PdfOcrProgress::Finished.image_width/height)
  • ms: OCR duration milliseconds
  • chars: extracted character count (0 if failed)
  • success: boolean (failure_reason Some → false) (from PdfOcrProgress::Finished.failure_reason.is_some() negation)
  • reason: "timeout" | "network_error" | "malformed_image" | "other" (from PdfOcrProgress::Finished.failure_reason)
  • ocr_engine: "ollama-vision" or config'd name

Record kind: parse_error

PDF / other format parse failure.

{
  "ts": "2026-05-28T01:30:02.456Z",
  "kind": "parse_error",
  "doc_path": "weird.pdf",
  "reason": "lopdf_error",
  "message": "unexpected EOF in xref table"
}

Record kind: skip

Document skip (size_exceeded / gitignore / builtin_blacklist).

{
  "ts": "2026-05-28T01:30:03.789Z",
  "kind": "skip",
  "doc_path": "large.zip",
  "reason": "builtin_blacklist",
  "detail": ".zip extension"
}

Record kind: error

Fatal error emit (error.v1 이 emit 될 때 동시 log write).

{
  "ts": "2026-05-28T01:30:04.012Z",
  "kind": "error",
  "code": "config_not_found",
  "message": "config file does not exist: /tmp/config.toml"
}

Record kind: summary

Ingest run 최종 aggregate stats (마지막 line).

{
  "ts": "2026-05-28T01:31:00.000Z",
  "kind": "summary",
  "run_id": "20260528T013000Z-abc123de",
  "scanned": 11,
  "new": 11,
  "errors": 0,
  "ocr_pages": 21,
  "ocr_failures": 2,
  "ocr_p50_ms": 1500,
  "ocr_p90_ms": 63000,
  "ocr_max_ms": 180007,
  "duration_ms": 555550
}

Fields:

  • run_id: 이 ingest run 의 고유 ID
  • scanned: 전체 시도 doc count
  • new: 실제 ingest 된 new doc count
  • errors: fatal error count
  • ocr_pages: OCR 시도한 총 page count
  • ocr_failures: OCR 실패한 page count
  • ocr_p50_ms, ocr_p90_ms, ocr_max_ms: percentile + max (success 한 page 만, null 가능)
  • duration_ms: entire run elapsed

§4 구현 명세

§4.1 새 module: crates/kebab-app/src/ingest_log.rs

use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;
use std::time::SystemTime;
use serde::{Serialize, Deserialize};

pub struct IngestLogWriter {
    file: Option<BufWriter<File>>,
    run_id: String,
    started_at: SystemTime,
}

impl IngestLogWriter {
    /// Open log file. `cfg.ingest_log_enabled == false` 면 None 반환.
    pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
        if !cfg.ingest_log_enabled {
            return Ok(None);
        }
        let run_id = generate_run_id();
        let log_dir = expand_log_dir(&cfg.ingest_log_dir)?;
        std::fs::create_dir_all(&log_dir)?;
        let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
        let file = BufWriter::new(File::create(&path)?);
        Ok(Some(Self {
            file: Some(file),
            run_id,
            started_at: SystemTime::now(),
        }))
    }

    pub fn write_event(&mut self, event: &LogEvent) -> anyhow::Result<()> {
        if let Some(ref mut f) = self.file {
            serde_json::to_writer(&mut *f, event)?;
            writeln!(f)?;
        }
        Ok(())
    }

    pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> {
        if let Some(ref mut f) = self.file {
            serde_json::to_writer(&mut *f, summary)?;
            writeln!(f)?;
        }
        Ok(())
    }

    pub fn flush(&mut self) -> anyhow::Result<()> {
        if let Some(ref mut f) = self.file {
            f.flush()?;
        }
        Ok(())
    }

    pub fn run_id(&self) -> &str {
        &self.run_id
    }
}

impl Drop for IngestLogWriter {
    fn drop(&mut self) {
        let _ = self.flush();
    }
}

fn generate_run_id() -> String {
    use time::OffsetDateTime;
    use time::format_description::well_known::iso8601;
    let now = OffsetDateTime::now_utc();
    // Format: 20260528T013000Z (compact ISO 8601)
    let mut buffer = [0u8; 16];
    let ts = now.format(
        &format_description::parse("[year][month][day]T[hour][minute][second]Z")
            .expect("format_description is valid")
    ).expect("format should succeed");
    let suffix: String = (0..8)
        .map(|_| {
            const CHARS: &[u8] = b"abcdefghijklmnopqrstuvwxyz0123456789";
            CHARS[rand::random::<usize>() % CHARS.len()] as char
        })
        .collect();
    format!("{ts}-{suffix}")
}

fn expand_log_dir(path: &PathBuf) -> anyhow::Result<PathBuf> {
    // 이미 기존 expand_path / expand_path_with_base 가 있으므로 활용
    // {state_dir} 의존 → kebab_config::Config::xdg_state_dir() + "/logs"
    use std::env;
    let path_str = path.to_string_lossy();
    if path_str.contains("{state_dir}") {
        let state_dir = kebab_config::Config::xdg_state_dir();
        Ok(PathBuf::from(path_str.replace("{state_dir}", state_dir.to_str().unwrap())))
    } else {
        Ok(path.clone())
    }
}

#[derive(Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum LogEvent<'a> {
    #[serde(rename = "ocr")]
    Ocr {
        ts: String,
        doc_path: &'a str,
        page: u32,
        image_byte_size: Option<u64>,
        image_width: Option<u32>,
        image_height: Option<u32>,
        ms: u64,
        chars: u32,
        success: bool,
        reason: Option<&'a str>,
        ocr_engine: &'a str,
    },
    #[serde(rename = "parse_error")]
    ParseError {
        ts: String,
        doc_path: &'a str,
        reason: &'a str,
        message: &'a str,
    },
    #[serde(rename = "skip")]
    Skip {
        ts: String,
        doc_path: &'a str,
        reason: &'a str,
        detail: Option<&'a str>,
    },
    #[serde(rename = "error")]
    Error {
        ts: String,
        code: &'a str,
        message: &'a str,
    },
    #[serde(rename = "summary")]
    Summary {
        ts: String,
        run_id: String,
        scanned: u32,
        new: u32,
        errors: u32,
        ocr_pages: u32,
        ocr_failures: u32,
        ocr_p50_ms: Option<u64>,
        ocr_p90_ms: Option<u64>,
        ocr_max_ms: Option<u64>,
        duration_ms: u64,
    },
}

#[derive(Serialize, Deserialize)]
pub struct IngestSummary {
    pub ts: String,
    pub run_id: String,
    pub scanned: u32,
    pub new: u32,
    pub errors: u32,
    pub ocr_pages: u32,
    pub ocr_failures: u32,
    pub ocr_p50_ms: Option<u64>,
    pub ocr_p90_ms: Option<u64>,
    pub ocr_max_ms: Option<u64>,
    pub duration_ms: u64,
}

§4.2 Emit hook integration points

Hook 1: crates/kebab-app/src/lib.rs::ingest_with_config (line 234)

  • IngestLogWriter::open(cfg.logging) at function entry → Option<Arc<Mutex<IngestLogWriter>>>
  • IngestLogWriter::flush() at function exit (success / error)
  • IngestLogWriterArc<Mutex<_>> wrap 후 ingest_with_config_opts()IngestOpts 에 carry
  • MEDIUM-1 (ownership): apply_pdf_ocr 의 emit_progress closure 가 Arc<Mutex<IngestLogWriter>> clone 캡처 + lock + write. single-threaded sync (per-asset loop) 라 blocking lock 안전.

Hook 2: OCR progress event (pdf_ocr_apply.rs)

  • HIGH-1 (PdfOcrProgress::Finished extend): PdfOcrProgress::Finished variant 확장 — additive field:
    PdfOcrProgress::Finished {
        page: u32,
        ms: u64,
        chars: u32,
        skipped: bool,
        // NEW:
        image_byte_size: Option<u64>,
        image_width: Option<u32>,
        image_height: Option<u32>,
        failure_reason: Option<String>,  // "timeout" | "ocr_error" | "network_error" | None
    }
    
  • emit_progress closure 내에서 log writer (Arc<Mutex>) 에 ocr event 전달
  • ingest_progress.v1 (IngestEvent::PdfOcrFinished) cascade 영향: additive minor (backward compat)

Hook 3: Parse error (app.rs / ingest pipeline error path)

  • parse_error kind emit when Error::PdfFormat / Error::ImageFormat 발생

Hook 4: Skip event (kebab-source-fs/src/connector.rs)

  • size_exceeded / gitignore / builtin_blacklist skip 시 log writer에 skip event 전달

Hook 5: Fatal error (HIGH-4 위치 정정)

  • 위치: crates/kebab-app/src/lib.rs::ingest_with_config_opts 의 error return path (per-asset catch + final Err arm)
  • classify(err, verbose) invoke 직후 log writer에 LogEvent::Error emit
  • error_wire::classify 자체는 변경 0 유지 (side-effect 없는 순수 변환)

§4.3 Timestamp format

HIGH-3 (chrono → time crate): ISO 8601 UTC, milliseconds precision:

  • Example: 2026-05-28T01:30:01.123Z
  • Use time::OffsetDateTime::now_utc().format(&time::format_description::well_known::Rfc3339)
  • 항상 UTC (system timezone 무관)
  • workspace 는 time crate 이미 사용 중 (chrono 중복 의존 제거)

§4.4 Backward compat

Config struct의 logging field에 #[serde(default)] tag:

#[serde(default)]
pub logging: LoggingCfg,

→ pre-v0.20 config file이 [logging] section 없으면 자동 LoggingCfg::default() init (enabled=true, dir=~/.local/state/kebab/logs)


§5 Acceptance criteria

  • AC-1: [logging] section default emit (새 config 또는 config init).
  • AC-2: kebab ingest 실행 후 {log_dir}/ingest-{run_id}.ndjson 파일 존재.
  • AC-3: 각 line valid JSON + kind enum value + ts ISO 8601.
  • AC-4: OCR per-page record + summary record (마지막 line).
  • AC-5: 모든 failure type (size_exceeded / parse_error / ocr timeout) record 됨.
  • AC-6: ingest_log_enabled = false 시 log file 생성 0.
  • AC-7: ingest_log_dir = "/tmp/custom" override 시 그 path에 file emit.
  • AC-8: cargo test -p kebab-app --lib ingest_log + cargo clippy green.
  • AC-9 (MEDIUM-2 actionability): integration test — cargo test -p kebab-app --test ingest_log_smoke -j 4 2>&1 | tail -3 → 1 passed; 0 failed. test body:
    1. tempdir + minimal corpus (1 markdown + 1 image PDF).
    2. ingest with [logging] ingest_log_dir = tempdir/logs.
    3. assert: log file tempdir/logs/ingest-{run_id}.ndjson exists.
    4. parse each line as JSON, assert kinds = [ocr, summary] or more.
    5. last line kind = "summary" + scanned > 0 && ocr_pages > 0.
  • AC-10: pre-v0.20 config + v0.20 binary 호환성 (new [logging] field 무시).

§6 위험 + 미해결 질문

Risks

  • R-1: Log file 누적 disk usage — user가 직접 정리. Doc comment 명시.
  • R-2: Concurrent ingest 의 run_id collision — 8-char random + timestamp 로 거의 불가능. Mitigate.
  • R-3: {state_dir} placeholder expand — hand-roll via xdg_state_dir + string replace. existing expand_path_with_base 는 {state_dir} 미지원 (LOW-2).
  • R-4: Error path panic 시 log writer drop → flush 미실행 — Drop impl 에 flush() 호출로 mitigate.
  • R-5: workspace dependency 추가: rand (run_id suffix), time (timestamp) 이미 kebab-app 의존. chrono 신규 추가 금지 (HIGH-3).

Open questions

  • OQ-1 (HIGH-1 의 design decision으로 승격): image_byte_size + dimensions 출처 — PdfOcrProgress::Finished 에 carry (Option A 채택). Bug #11 follow-up 에서 raster image 측정 이미 시작.
  • OQ-2: ocr_p50_ms, ocr_p90_ms 계산 — quantiles crate 사용 또는 간단 sorted vec? (plan drafter 결정)
  • OQ-3: Log file 수동 cleanup 정책을 user-facing docs에 명시할 위치? (README / SMOKE / config example, plan drafter + executor 결정)

§7 참고

  • Parent task: tasks/p10/p10-1A-5-ingest-failure-log.md
  • Parent design: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md (§8 wire schema, §9 versioning cascade — 변경 0)
  • Bug #11 follow-up: OCR image metric capture (raster path in kebab-parse-pdf)
  • Related: crates/kebab-app/src/error_wire.rs (ErrorV1 emit), crates/kebab-app/src/ingest_progress.rs (IngestEvent)
  • Config: crates/kebab-config/src/lib.rs (Config struct, expand_path helpers, xdg_state_dir)