feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention #190

Merged
altair823 merged 7 commits from feat/ingest-log-round2-enhancements into main 2026-05-28 08:12:27 +00:00
Owner

요약

v0.20.x ingest log round 2 — sweet-spot 점진 분석 surface 강화 (4 enhancement).

v0.20.0 sub-item 1 (PR #189 머지) 의 round 1 ingest log (file-only ndjson) 의 후속:

  1. PDF OCR raster image dimension capture (round 1 의 null 결함 fix).
  2. V008 SQLite mirror — historical OCR query table.
  3. CLI inspect commands — kebab inspect ocr-stats + kebab inspect ocr-failures.
  4. Log retention policy — keep_recent_runs + retention_days.

7 commit on feat/ingest-log-round2-enhancements atop main 89d334a.

Commits

  • 5977c8c — feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1)
  • 6482bf1 — feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2)
  • 4e451c9 — feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
  • d9ec7b8 — feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
  • 35c987d — feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
  • 9a36a06 — style: cargo fmt --all (feature follow-up)
  • 7c24734 — docs(superpowers): spec + plan artifacts

Workflow

Phase Artifact Result
Spec docs/superpowers/specs/2026-05-28-v0.20.x-logging-r2-spec.md (751 line) round 2 closure ACCEPT (7/7 critic r1 + 7/7 r2 traceability + 3 LOW non-blocking)
Plan docs/superpowers/plans/2026-05-28-v0.20.x-logging-r2-plan.md (576 line) closure ACCEPT (6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)
Executor 6 commit (Enhancement 1-4 + fmt) DONE, 13/13 verifier row green

핵심 surface 변경

Enhancement 1: image_width / image_height capture

  • PdfOcrProgress::Finished + ingest_progress.v1.pdf_ocr_finishedimage_width / image_height 가 round 1 에서 null hardcode 였음.
  • raster JPEG bytes 의 image::ImageReader decode 로 dimension 추출.
  • crates/kebab-app/Cargo.toml 의 image feature 에 "jpeg" 추가 (F3 closure finding).

Enhancement 2: V008 SQLite mirror

  • pdf_ocr_events table — per-OCR-call 영구 record.
  • columns: run_id, ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine.
  • 3 indices (doc_id, run_id, ts).
  • dual-write hook in pdf_ocr_apply.rs — file ndjson + SQLite insert 동시.
  • Arc::clone(&app.store) 패턴 (이중 open 회피, G1 closure finding).
  • canonical.doc_id 사전 capture 로 NULL 회피 (F1 critic finding).

Enhancement 3: CLI inspect commands

  • kebab inspect ocr-stats --jsonocr_stats.v1:
    { "total_events", "success_count", "failure_count", "success_rate",
      "p50_ms", "p90_ms", "p99_ms", "max_ms", "by_engine", "by_doc" }
    
  • kebab inspect ocr-failures --doc-id <id> --jsonocr_failures.v1 (single doc).
  • kebab inspect ocr-failures --json (no doc-id) → corpus-wide failures.
  • *_with_config facade rule (G2 closure finding).
  • wire schema additive minor — docs/wire-schema/v1/ocr_stats.schema.json + ocr_failures.schema.json 신규. schema.schema.json 의 wire.schemas 에 2 entry. integrations/claude-code/kebab/SKILL.md 동기.

Enhancement 4: Log retention

  • [logging] keep_recent_runs: u32 default 100 (최신 N file 보존).
  • [logging] retention_days: u32 default 30 (N일 내 file 보존).
  • OR-on-stale semantics (둘 중 하나라도 stale 시 삭제).
  • file cleanup helper in IngestLogWriter::open.
  • SQLite cleanup helper SqliteStore::prune_pdf_ocr_events(retention_days).

Wire schema additive minor

  • ocr_stats.v1 + ocr_failures.v1 schema → schema.v1.wire.schemas enum 에 2 entry 추가.
  • ingest_progress.v1.pdf_ocr_finishedimage_width / image_height 는 이미 v0.20.0 (PR #189) 에서 additive 추가됨 — 본 PR 은 null → Some(...) 값 채움.
  • backward-compat — old consumer 가 unknown schema 무시 가능.

D5 dogfood verification (R6, fresh KB)

  • [logging] keep_recent_runs = 100, retention_days = 30 config default emit ✓.
  • V008 migration 성공 + pdf_ocr_events table 존재 ✓.
  • kebab inspect ocr-stats --jsonocr_stats.v1 schema emit ✓.
  • kebab inspect ocr-failures --jsonocr_failures.v1 schema emit ✓.
  • per-ingest-run log file (round 1 round trip) + dual-write to SQLite ✓.

Workspace test + clippy

  • cargo test --workspace --no-fail-fast -j 1 → 전수 pass.
  • cargo clippy --workspace --all-targets -- -D warnings → exit 0.

🤖 Generated with Claude Code

## 요약 v0.20.x **ingest log round 2** — sweet-spot 점진 분석 surface 강화 (4 enhancement). v0.20.0 sub-item 1 (PR #189 머지) 의 round 1 ingest log (file-only ndjson) 의 후속: 1. PDF OCR raster image dimension capture (round 1 의 null 결함 fix). 2. V008 SQLite mirror — historical OCR query table. 3. CLI inspect commands — `kebab inspect ocr-stats` + `kebab inspect ocr-failures`. 4. Log retention policy — `keep_recent_runs` + `retention_days`. 7 commit on `feat/ingest-log-round2-enhancements` atop main `89d334a`. ## Commits - **`5977c8c`** — feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1) - **`6482bf1`** — feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2) - **`4e451c9`** — feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring) - **`d9ec7b8`** — feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor) - **`35c987d`** — feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4) - **`9a36a06`** — style: cargo fmt --all (feature follow-up) - **`7c24734`** — docs(superpowers): spec + plan artifacts ## Workflow | Phase | Artifact | Result | |-------|----------|--------| | Spec | `docs/superpowers/specs/2026-05-28-v0.20.x-logging-r2-spec.md` (751 line) | round 2 closure ACCEPT (7/7 critic r1 + 7/7 r2 traceability + 3 LOW non-blocking) | | Plan | `docs/superpowers/plans/2026-05-28-v0.20.x-logging-r2-plan.md` (576 line) | closure ACCEPT (6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve) | | Executor | 6 commit (Enhancement 1-4 + fmt) | DONE, 13/13 verifier row green | ## 핵심 surface 변경 ### Enhancement 1: image_width / image_height capture - `PdfOcrProgress::Finished` + `ingest_progress.v1.pdf_ocr_finished` 의 `image_width` / `image_height` 가 round 1 에서 null hardcode 였음. - raster JPEG bytes 의 `image::ImageReader` decode 로 dimension 추출. - `crates/kebab-app/Cargo.toml` 의 image feature 에 `"jpeg"` 추가 (F3 closure finding). ### Enhancement 2: V008 SQLite mirror - 새 `pdf_ocr_events` table — per-OCR-call 영구 record. - columns: `run_id, ts, doc_id, doc_path, page, image_byte_size, image_width, image_height, ms, chars, success, reason, ocr_engine`. - 3 indices (doc_id, run_id, ts). - dual-write hook in `pdf_ocr_apply.rs` — file ndjson + SQLite insert 동시. - `Arc::clone(&app.store)` 패턴 (이중 open 회피, G1 closure finding). - `canonical.doc_id` 사전 capture 로 NULL 회피 (F1 critic finding). ### Enhancement 3: CLI inspect commands - `kebab inspect ocr-stats --json` → `ocr_stats.v1`: ```json { "total_events", "success_count", "failure_count", "success_rate", "p50_ms", "p90_ms", "p99_ms", "max_ms", "by_engine", "by_doc" } ``` - `kebab inspect ocr-failures --doc-id <id> --json` → `ocr_failures.v1` (single doc). - `kebab inspect ocr-failures --json` (no doc-id) → corpus-wide failures. - `*_with_config` facade rule (G2 closure finding). - wire schema additive minor — `docs/wire-schema/v1/ocr_stats.schema.json` + `ocr_failures.schema.json` 신규. `schema.schema.json` 의 wire.schemas 에 2 entry. `integrations/claude-code/kebab/SKILL.md` 동기. ### Enhancement 4: Log retention - `[logging] keep_recent_runs: u32` default 100 (최신 N file 보존). - `[logging] retention_days: u32` default 30 (N일 내 file 보존). - OR-on-stale semantics (둘 중 하나라도 stale 시 삭제). - file cleanup helper in `IngestLogWriter::open`. - SQLite cleanup helper `SqliteStore::prune_pdf_ocr_events(retention_days)`. ## Wire schema additive minor - 새 `ocr_stats.v1` + `ocr_failures.v1` schema → `schema.v1.wire.schemas` enum 에 2 entry 추가. - `ingest_progress.v1.pdf_ocr_finished` 의 `image_width` / `image_height` 는 이미 v0.20.0 (PR #189) 에서 additive 추가됨 — 본 PR 은 null → Some(...) 값 채움. - backward-compat — old consumer 가 unknown schema 무시 가능. ## D5 dogfood verification (R6, fresh KB) - `[logging] keep_recent_runs = 100`, `retention_days = 30` config default emit ✓. - V008 migration 성공 + `pdf_ocr_events` table 존재 ✓. - `kebab inspect ocr-stats --json` → `ocr_stats.v1` schema emit ✓. - `kebab inspect ocr-failures --json` → `ocr_failures.v1` schema emit ✓. - per-ingest-run log file (round 1 round trip) + dual-write to SQLite ✓. ## Workspace test + clippy - `cargo test --workspace --no-fail-fast -j 1` → 전수 pass. - `cargo clippy --workspace --all-targets -- -D warnings` → exit 0. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 added 7 commits 2026-05-28 08:11:06 +00:00
Add extract_image_dimensions(bytes) helper using image::ImageReader
and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs
where page_image_bytes is in scope (OCR error path + success path).
The no-DCTDecode skip path leaves None as page_image_bytes is absent.
Result: LogEvent::Ocr carries non-null image_width/image_height on
successful raster decode, enabling future size-conditioned timeout tuning.

Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as
direct [dependencies] entry (not relying on feature unification via
kebab-parse-image).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.

Wiring into ingest path lands in the next commit.

Closure r1 F2: explicit lock acquisition in both methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 merged commit 7bbdc89ae3 into main 2026-05-28 08:12:27 +00:00
altair823 deleted branch feat/ingest-log-round2-enhancements 2026-05-28 08:12:29 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#190