style: cargo fmt --all (round 4 ingest log feature follow-up)

Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 04:18:40 +00:00
parent 445b096215
commit 685007789a
235 changed files with 6520 additions and 3955 deletions

View File

@@ -0,0 +1,491 @@
---
title: "v0.20.x ingest log feature — spec"
date: 2026-05-28
status: "DRAFT (round r1c)"
target_version: 0.20.x
phase: A4 (spec drafter rewrite)
parent_spec: ../../../tasks/p10/p10-1A-5-ingest-failure-log.md
contract_sections: []
references: [2026-04-27-kebab-final-form-design.md, 프리-v0.20 dogfood Bug #15 (ocr timeout + log analysis), 2026-05-28 closure critic result]
---
# v0.20.x ingest log feature — spec
## 동기
dogfood 4-round 후 사용자가 명시 요구:
> ocr 실패를 포함한 전체적인 실패 발생 시 그 로그를 자세하기 볼 수 있도록 하고 싶은데. 그 통계들을 알 수 있어야 정확한 설정 sweet spot이나 설계 방향성을 알 수 있을 것 같아.
**핵심 문제**: OCR 타임아웃, parse error, skip 같은 ingest 실패를 individual event + summary stats로 기록할 structured log surface가 없음. 사용자가 sweet-spot (OCR timeout, model 선택) 을 찾으려면 per-page 통계 + aggregate stats가 필수.
**Surface 선택**: structured ndjson file (wire schema 변경 0, internal write-only). 별도 CLI query 없이 grep/jq로 사용자가 직접 분석 가능.
---
## 스코프
### 포함
- `[logging]` config section (2 field: `ingest_log_enabled`, `ingest_log_dir`)
- Per-ingest-run ndjson log file (`ingest-{run_id}.ndjson`, run_id = ISO 8601 + random suffix)
- 5 kind 의 log record: `ocr`, `parse_error`, `skip`, `error`, `summary`
- OCR per-page event 에 image byte size + dimensions 측정 포함
- Run ID + timestamp ISO 8601 UTC format
- Backward compat: pre-v0.20 config files가 new `[logging]` field 자동 무시
### Out of scope
- SQLite 영구 저장 (future enhancement)
- Log level switches (verbose / minimal) — 본 round single level
- Log rotation / cleanup policy (user-managed)
- Query CLI (`kebab logs`)
- Wire schema 변경 (internal use only)
---
## 설계 결정
### §3.1 Config schema — `[logging]` section
**파일**: `crates/kebab-config/src/lib.rs` (line 37+ 에 Config struct 기존)
新 struct `LoggingCfg`:
```rust
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct LoggingCfg {
/// ingest 시 structured ndjson log 자동 write. default = true.
/// false 시 log file 생성 0.
#[serde(default = "default_ingest_log_enabled")]
pub ingest_log_enabled: bool,
/// per-ingest-run log file directory. default = `{state_dir}/logs`.
/// `{state_dir}` placeholder 가 XDG state dir 로 expand.
#[serde(default = "default_ingest_log_dir")]
pub ingest_log_dir: PathBuf,
}
fn default_ingest_log_enabled() -> bool { true }
fn default_ingest_log_dir() -> PathBuf {
// expand 후 = ~/.local/state/kebab/logs
PathBuf::from("{state_dir}/logs")
}
impl Default for LoggingCfg {
fn default() -> Self {
Self {
ingest_log_enabled: default_ingest_log_enabled(),
ingest_log_dir: default_ingest_log_dir(),
}
}
}
```
**Config struct 에 추가** (line 37+):
```rust
pub struct Config {
// ... existing fields ...
#[serde(default)]
pub logging: LoggingCfg,
}
```
`#[serde(default)]` 로 backward compat 보장 (pre-v0.20 config 파일이 `[logging]` section 없으면 자동 default init).
### §3.2 Per-ingest-run filename + ID generation
**Filename format**: `ingest-{run_id}.ndjson`
**Run ID**: ISO 8601 timestamp + 8-char random suffix
- Example: `20260528T013000Z-abc123de`
- Format: `YYYYMMDDTHHmmssZ-{8-char random alphanumeric}`
- Sortable + readable + concurrent collision unlikely
**Path expansion**: `{state_dir}` placeholder 를 `~/.local/state/kebab` 로 expand
- `{state_dir}` 대체 별도 hand-roll (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원)
- Directory auto-create if missing
### §3.3 Log content — ndjson schema
각 line = JSON object (ndjson format). 각 record 의 `kind` 가 union variant discriminator.
#### Record kind: `ocr`
OCR per-page attempt (success / failure). **HIGH-1 (PdfOcrProgress::Finished extend)**: image_byte/dims/failure_reason 은 PdfOcrProgress::Finished variant 에 추가되어 carry.
```jsonc
{
"ts": "2026-05-28T01:30:01.123Z",
"kind": "ocr",
"doc_path": "metro-korea.pdf",
"page": 8,
"image_byte_size": 989512,
"image_width": 2480,
"image_height": 3508,
"ms": 180000,
"chars": 0,
"success": false,
"reason": "timeout",
"ocr_engine": "ollama-vision"
}
```
**Fields**:
- `ts`: ISO 8601 UTC (milliseconds, using `time` crate)
- `kind`: literal `"ocr"`
- `doc_path`: relative or absolute source path
- `page`: page number (0-indexed or 1-indexed — confirm in impl)
- `image_byte_size`: raster image byte size (from PdfOcrProgress::Finished.image_byte_size)
- `image_width`, `image_height`: raster dimensions pixel (from PdfOcrProgress::Finished.image_width/height)
- `ms`: OCR duration milliseconds
- `chars`: extracted character count (0 if failed)
- `success`: boolean (failure_reason Some → false) (from PdfOcrProgress::Finished.failure_reason.is_some() negation)
- `reason`: "timeout" | "network_error" | "malformed_image" | "other" (from PdfOcrProgress::Finished.failure_reason)
- `ocr_engine`: "ollama-vision" or config'd name
#### Record kind: `parse_error`
PDF / other format parse failure.
```jsonc
{
"ts": "2026-05-28T01:30:02.456Z",
"kind": "parse_error",
"doc_path": "weird.pdf",
"reason": "lopdf_error",
"message": "unexpected EOF in xref table"
}
```
#### Record kind: `skip`
Document skip (size_exceeded / gitignore / builtin_blacklist).
```jsonc
{
"ts": "2026-05-28T01:30:03.789Z",
"kind": "skip",
"doc_path": "large.zip",
"reason": "builtin_blacklist",
"detail": ".zip extension"
}
```
#### Record kind: `error`
Fatal error emit (error.v1 이 emit 될 때 동시 log write).
```jsonc
{
"ts": "2026-05-28T01:30:04.012Z",
"kind": "error",
"code": "config_not_found",
"message": "config file does not exist: /tmp/config.toml"
}
```
#### Record kind: `summary`
Ingest run 최종 aggregate stats (마지막 line).
```jsonc
{
"ts": "2026-05-28T01:31:00.000Z",
"kind": "summary",
"run_id": "20260528T013000Z-abc123de",
"scanned": 11,
"new": 11,
"errors": 0,
"ocr_pages": 21,
"ocr_failures": 2,
"ocr_p50_ms": 1500,
"ocr_p90_ms": 63000,
"ocr_max_ms": 180007,
"duration_ms": 555550
}
```
**Fields**:
- `run_id`: 이 ingest run 의 고유 ID
- `scanned`: 전체 시도 doc count
- `new`: 실제 ingest 된 new doc count
- `errors`: fatal error count
- `ocr_pages`: OCR 시도한 총 page count
- `ocr_failures`: OCR 실패한 page count
- `ocr_p50_ms`, `ocr_p90_ms`, `ocr_max_ms`: percentile + max (success 한 page 만, null 가능)
- `duration_ms`: entire run elapsed
---
## §4 구현 명세
### §4.1 새 module: `crates/kebab-app/src/ingest_log.rs`
```rust
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;
use std::time::SystemTime;
use serde::{Serialize, Deserialize};
pub struct IngestLogWriter {
file: Option<BufWriter<File>>,
run_id: String,
started_at: SystemTime,
}
impl IngestLogWriter {
/// Open log file. `cfg.ingest_log_enabled == false` 면 None 반환.
pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
if !cfg.ingest_log_enabled {
return Ok(None);
}
let run_id = generate_run_id();
let log_dir = expand_log_dir(&cfg.ingest_log_dir)?;
std::fs::create_dir_all(&log_dir)?;
let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
let file = BufWriter::new(File::create(&path)?);
Ok(Some(Self {
file: Some(file),
run_id,
started_at: SystemTime::now(),
}))
}
pub fn write_event(&mut self, event: &LogEvent) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
serde_json::to_writer(&mut *f, event)?;
writeln!(f)?;
}
Ok(())
}
pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
serde_json::to_writer(&mut *f, summary)?;
writeln!(f)?;
}
Ok(())
}
pub fn flush(&mut self) -> anyhow::Result<()> {
if let Some(ref mut f) = self.file {
f.flush()?;
}
Ok(())
}
pub fn run_id(&self) -> &str {
&self.run_id
}
}
impl Drop for IngestLogWriter {
fn drop(&mut self) {
let _ = self.flush();
}
}
fn generate_run_id() -> String {
use time::OffsetDateTime;
use time::format_description::well_known::iso8601;
let now = OffsetDateTime::now_utc();
// Format: 20260528T013000Z (compact ISO 8601)
let mut buffer = [0u8; 16];
let ts = now.format(
&format_description::parse("[year][month][day]T[hour][minute][second]Z")
.expect("format_description is valid")
).expect("format should succeed");
let suffix: String = (0..8)
.map(|_| {
const CHARS: &[u8] = b"abcdefghijklmnopqrstuvwxyz0123456789";
CHARS[rand::random::<usize>() % CHARS.len()] as char
})
.collect();
format!("{ts}-{suffix}")
}
fn expand_log_dir(path: &PathBuf) -> anyhow::Result<PathBuf> {
// 이미 기존 expand_path / expand_path_with_base 가 있으므로 활용
// {state_dir} 의존 → kebab_config::Config::xdg_state_dir() + "/logs"
use std::env;
let path_str = path.to_string_lossy();
if path_str.contains("{state_dir}") {
let state_dir = kebab_config::Config::xdg_state_dir();
Ok(PathBuf::from(path_str.replace("{state_dir}", state_dir.to_str().unwrap())))
} else {
Ok(path.clone())
}
}
#[derive(Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum LogEvent<'a> {
#[serde(rename = "ocr")]
Ocr {
ts: String,
doc_path: &'a str,
page: u32,
image_byte_size: Option<u64>,
image_width: Option<u32>,
image_height: Option<u32>,
ms: u64,
chars: u32,
success: bool,
reason: Option<&'a str>,
ocr_engine: &'a str,
},
#[serde(rename = "parse_error")]
ParseError {
ts: String,
doc_path: &'a str,
reason: &'a str,
message: &'a str,
},
#[serde(rename = "skip")]
Skip {
ts: String,
doc_path: &'a str,
reason: &'a str,
detail: Option<&'a str>,
},
#[serde(rename = "error")]
Error {
ts: String,
code: &'a str,
message: &'a str,
},
#[serde(rename = "summary")]
Summary {
ts: String,
run_id: String,
scanned: u32,
new: u32,
errors: u32,
ocr_pages: u32,
ocr_failures: u32,
ocr_p50_ms: Option<u64>,
ocr_p90_ms: Option<u64>,
ocr_max_ms: Option<u64>,
duration_ms: u64,
},
}
#[derive(Serialize, Deserialize)]
pub struct IngestSummary {
pub ts: String,
pub run_id: String,
pub scanned: u32,
pub new: u32,
pub errors: u32,
pub ocr_pages: u32,
pub ocr_failures: u32,
pub ocr_p50_ms: Option<u64>,
pub ocr_p90_ms: Option<u64>,
pub ocr_max_ms: Option<u64>,
pub duration_ms: u64,
}
```
### §4.2 Emit hook integration points
**Hook 1**: `crates/kebab-app/src/lib.rs::ingest_with_config` (line 234)
- `IngestLogWriter::open(cfg.logging)` at function entry → `Option<Arc<Mutex<IngestLogWriter>>>`
- `IngestLogWriter::flush()` at function exit (success / error)
- `IngestLogWriter``Arc<Mutex<_>>` wrap 후 `ingest_with_config_opts()``IngestOpts` 에 carry
- **MEDIUM-1 (ownership)**: `apply_pdf_ocr` 의 emit_progress closure 가 `Arc<Mutex<IngestLogWriter>>` clone 캡처 + lock + write. single-threaded sync (per-asset loop) 라 blocking lock 안전.
**Hook 2**: OCR progress event (pdf_ocr_apply.rs)
- **HIGH-1 (PdfOcrProgress::Finished extend)**: `PdfOcrProgress::Finished` variant 확장 — additive field:
```rust
PdfOcrProgress::Finished {
page: u32,
ms: u64,
chars: u32,
skipped: bool,
// NEW:
image_byte_size: Option<u64>,
image_width: Option<u32>,
image_height: Option<u32>,
failure_reason: Option<String>, // "timeout" | "ocr_error" | "network_error" | None
}
```
- emit_progress closure 내에서 log writer (Arc<Mutex<IngestLogWriter>>) 에 ocr event 전달
- ingest_progress.v1 (IngestEvent::PdfOcrFinished) cascade 영향: additive minor (backward compat)
**Hook 3**: Parse error (app.rs / ingest pipeline error path)
- parse_error kind emit when `Error::PdfFormat` / `Error::ImageFormat` 발생
**Hook 4**: Skip event (kebab-source-fs/src/connector.rs)
- size_exceeded / gitignore / builtin_blacklist skip 시 log writer에 skip event 전달
**Hook 5**: Fatal error (**HIGH-4 위치 정정**)
- **위치**: `crates/kebab-app/src/lib.rs::ingest_with_config_opts` 의 error return path (per-asset catch + final Err arm)
- `classify(err, verbose)` invoke 직후 log writer에 LogEvent::Error emit
- `error_wire::classify` 자체는 변경 0 유지 (side-effect 없는 순수 변환)
### §4.3 Timestamp format
**HIGH-3 (chrono → time crate)**: ISO 8601 UTC, milliseconds precision:
- Example: `2026-05-28T01:30:01.123Z`
- Use `time::OffsetDateTime::now_utc().format(&time::format_description::well_known::Rfc3339)`
- 항상 UTC (system timezone 무관)
- workspace 는 `time` crate 이미 사용 중 (chrono 중복 의존 제거)
### §4.4 Backward compat
`Config` struct의 `logging` field에 `#[serde(default)]` tag:
```rust
#[serde(default)]
pub logging: LoggingCfg,
```
→ pre-v0.20 config file이 `[logging]` section 없으면 자동 `LoggingCfg::default()` init (enabled=true, dir=~/.local/state/kebab/logs)
---
## §5 Acceptance criteria
- **AC-1**: `[logging]` section default emit (새 config 또는 `config init`).
- **AC-2**: `kebab ingest` 실행 후 `{log_dir}/ingest-{run_id}.ndjson` 파일 존재.
- **AC-3**: 각 line valid JSON + `kind` enum value + `ts` ISO 8601.
- **AC-4**: OCR per-page record + summary record (마지막 line).
- **AC-5**: 모든 failure type (size_exceeded / parse_error / ocr timeout) record 됨.
- **AC-6**: `ingest_log_enabled = false` 시 log file 생성 0.
- **AC-7**: `ingest_log_dir = "/tmp/custom"` override 시 그 path에 file emit.
- **AC-8**: `cargo test -p kebab-app --lib ingest_log` + `cargo clippy` green.
- **AC-9** (**MEDIUM-2 actionability**): integration test — `cargo test -p kebab-app --test ingest_log_smoke -j 4 2>&1 | tail -3` → 1 passed; 0 failed. test body:
1. tempdir + minimal corpus (1 markdown + 1 image PDF).
2. ingest with `[logging] ingest_log_dir = tempdir/logs`.
3. assert: log file `tempdir/logs/ingest-{run_id}.ndjson` exists.
4. parse each line as JSON, assert kinds = [ocr, summary] or more.
5. last line kind = "summary" + scanned > 0 && ocr_pages > 0.
- **AC-10**: pre-v0.20 config + v0.20 binary 호환성 (new `[logging]` field 무시).
---
## §6 위험 + 미해결 질문
### Risks
- **R-1**: Log file 누적 disk usage — user가 직접 정리. Doc comment 명시.
- **R-2**: Concurrent ingest 의 run_id collision — 8-char random + timestamp 로 거의 불가능. Mitigate.
- **R-3**: `{state_dir}` placeholder expand — hand-roll via xdg_state_dir + string replace. existing `expand_path_with_base` 는 {state_dir} 미지원 (LOW-2).
- **R-4**: Error path panic 시 log writer drop → flush 미실행 — Drop impl 에 flush() 호출로 mitigate.
- **R-5**: workspace dependency 추가: `rand` (run_id suffix), `time` (timestamp) 이미 kebab-app 의존. `chrono` 신규 추가 금지 (HIGH-3).
### Open questions
- **OQ-1** (**HIGH-1 의 design decision으로 승격**): image_byte_size + dimensions 출처 — PdfOcrProgress::Finished 에 carry (Option A 채택). Bug #11 follow-up 에서 raster image 측정 이미 시작.
- **OQ-2**: `ocr_p50_ms`, `ocr_p90_ms` 계산 — `quantiles` crate 사용 또는 간단 sorted vec? (plan drafter 결정)
- **OQ-3**: Log file 수동 cleanup 정책을 user-facing docs에 명시할 위치? (README / SMOKE / config example, plan drafter + executor 결정)
---
## §7 참고
- **Parent task**: `tasks/p10/p10-1A-5-ingest-failure-log.md`
- **Parent design**: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (§8 wire schema, §9 versioning cascade — 변경 0)
- **Bug #11 follow-up**: OCR image metric capture (raster path in kebab-parse-pdf)
- **Related**: `crates/kebab-app/src/error_wire.rs` (ErrorV1 emit), `crates/kebab-app/src/ingest_progress.rs` (IngestEvent)
- **Config**: `crates/kebab-config/src/lib.rs` (Config struct, expand_path helpers, xdg_state_dir)