feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)

Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 06:13:08 +00:00
parent 4e451c9f7c
commit d9ec7b8dc3
9 changed files with 476 additions and 7 deletions

View File

@@ -146,6 +146,7 @@ Claude Code spawns `kebab mcp` at session start; the process stays alive across
- `search_response.v1.truncated = true` means budget forced snippet shortening or k reduction. Independent of `next_cursor`: widen `max_tokens` for fuller snippets, follow `next_cursor` for more hits, or both.
- `ask`'s `citations[]` mirrors `search_hit.v1` minus retrieval internals — same `doc_path` / `citation` shape.
- Schema reference lives in the kebab repo at `docs/wire-schema/v1/*.schema.json` if a field is unclear. v0.20.x additive minor: `ingest_progress.v1` `pdf_ocr_finished` events carry 4 optional new fields (`image_byte_size`, `image_width`, `image_height`, `failure_reason`) — absent on pre-v0.20 events (backward compat).
- v0.20.x r2 additive minor: `ocr_stats.v1` — corpus-wide OCR statistics (`total_events`, `success_rate`, `p50/p90/p99/max_ms`, `by_engine`, top-10 `by_doc` by failure count) emitted by `kebab inspect ocr-stats --json`. `ocr_failures.v1` — per-doc or corpus-wide recent failure list (`failure_count`, `failures[]` with `ts`, `page`, `ms`, `reason`, `image_byte_size`) emitted by `kebab inspect ocr-failures [--doc-id <id>] [--limit N] --json`.
- `search_hit.v1` and `answer.v1.citations[]` carry `indexed_at` (RFC3339) + `stale` (bool). When `stale == true`, the source doc hasn't been re-processed since `config.search.stale_threshold_days`. Surface this caveat to the user when summarizing — the cited snapshot may not reflect current reality.
## Capability discovery