feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention #190

Merged
altair823 merged 7 commits from feat/ingest-log-round2-enhancements into main 2026-05-28 08:12:27 +00:00

7 Commits

Author SHA1 Message Date
7c24734cc7 docs(superpowers): v0.20.x logging r2 spec + plan artifacts
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:04:32 +00:00
9a36a06f97 style: cargo fmt --all (v0.20.x logging r2 feature follow-up)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:34:01 +00:00
35c987df1c feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:17:47 +00:00
d9ec7b8dc3 feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:13:08 +00:00
4e451c9f7c feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:06:03 +00:00
6482bf1321 feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2)
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.

Wiring into ingest path lands in the next commit.

Closure r1 F2: explicit lock acquisition in both methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:56:54 +00:00
5977c8cdf1 feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1)
Add extract_image_dimensions(bytes) helper using image::ImageReader
and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs
where page_image_bytes is in scope (OCR error path + success path).
The no-DCTDecode skip path leaves None as page_image_bytes is absent.
Result: LogEvent::Ocr carries non-null image_width/image_height on
successful raster decode, enabling future size-conditioned timeout tuning.

Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as
direct [dependencies] entry (not relying on feature unification via
kebab-parse-image).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:54:55 +00:00