마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
212 lines
11 KiB
Markdown
212 lines
11 KiB
Markdown
---
|
|
phase: P5
|
|
component: kebab-eval (metrics + compare)
|
|
task_id: p5-2
|
|
title: "Metrics computation + compare report"
|
|
status: completed
|
|
depends_on: [p5-1]
|
|
unblocks: []
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [§5.7 eval_runs.aggregate_json, phase epic tasks/phase-5-evaluation.md]
|
|
---
|
|
|
|
# p5-2 — Metrics + compare
|
|
|
|
## Goal
|
|
|
|
Compute hit@k, MRR, recall@k_doc, citation_coverage, groundedness, empty_result_rate, refusal_correctness from stored `eval_query_results`. Write `aggregate_json` back into `eval_runs`. Provide `kebab eval compare a b` that diffs two runs.
|
|
|
|
## Why now / why this size
|
|
|
|
Metric formulas + comparison logic are pure computation. Splitting them from p5-1 keeps the runner simple and lets us re-compute metrics over historical runs as formulas evolve.
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-config`
|
|
- `kebab-store-sqlite` (read eval rows, write `aggregate_json`)
|
|
- `serde`, `serde_json`
|
|
- `tracing`
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-app`, `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-vector`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| `eval_query_results` rows | SQLite | from p5-1 |
|
|
| `eval_runs` row | SQLite | from p5-1 |
|
|
| `GoldenQuery[..]` | `Vec<GoldenQuery>` | re-loaded for `expected_*` and `must_contain` |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `eval_runs.aggregate_json` updated | SQLite | history, CI checks |
|
|
| `CompareReport` | `kebab_eval::CompareReport` | `kebab-cli` printer |
|
|
| optional `runs_dir/<run_id>/report.md` | filesystem | human-readable summary |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub struct AggregateMetrics {
|
|
pub hit_at_k: std::collections::BTreeMap<u32, f32>, // k → hit@k
|
|
pub mrr: f32,
|
|
pub recall_at_k_doc: std::collections::BTreeMap<u32, f32>,
|
|
pub citation_coverage: f32,
|
|
pub groundedness: f32,
|
|
pub empty_result_rate: f32,
|
|
pub refusal_correctness: f32,
|
|
pub total_queries: u32,
|
|
pub failed_queries: u32,
|
|
}
|
|
|
|
pub struct CompareReport {
|
|
pub run_a: String,
|
|
pub run_b: String,
|
|
pub aggregate_a: AggregateMetrics,
|
|
pub aggregate_b: AggregateMetrics,
|
|
pub deltas: serde_json::Value, // per-metric delta
|
|
pub per_query: Vec<QueryComparison>,
|
|
}
|
|
|
|
pub struct QueryComparison {
|
|
pub query_id: String,
|
|
pub kind: ComparisonKind, // Win | Loss | Draw | Regression
|
|
pub a_hit_rank: Option<u32>,
|
|
pub b_hit_rank: Option<u32>,
|
|
pub note: Option<String>,
|
|
}
|
|
|
|
pub enum ComparisonKind { Win, Loss, Draw, Regression }
|
|
|
|
pub fn compute_aggregate(run_id: &str) -> anyhow::Result<AggregateMetrics>;
|
|
pub fn store_aggregate(run_id: &str, agg: &AggregateMetrics) -> anyhow::Result<()>;
|
|
pub fn compare_runs(run_id_a: &str, run_id_b: &str) -> anyhow::Result<CompareReport>;
|
|
pub fn render_report_md(report: &CompareReport) -> String;
|
|
```
|
|
|
|
## Behavior contract
|
|
|
|
- `hit@k` for k ∈ {1, 3, 5, 10}: query is a hit if any of its `expected_chunk_ids` appears in the run's top-k for that query (chunk-level). Aggregate = mean across queries with non-empty `expected_chunk_ids`.
|
|
- `MRR`: 1 / rank-of-first-correct-chunk; 0 if not found in top-10. Aggregate = mean across applicable queries.
|
|
- `recall@k_doc` for k ∈ {1, 3, 5, 10}: fraction of `expected_doc_ids` covered by the top-k hits' `doc_id`s, averaged across applicable queries.
|
|
- `citation_coverage`: fraction of RAG answers where every `Answer.citations[*].citation` resolves to a real chunk in the DB. Denominator = grounded RAG answers; if zero → metric is `NaN` and reported as `null` in JSON.
|
|
- `groundedness`: fraction of RAG answers where ALL `must_contain` strings appear AND no `forbidden` string appears. Denominator = RAG answers (excluding errors).
|
|
- `empty_result_rate`: fraction of queries returning zero `hits_top_k`.
|
|
- `refusal_correctness`: fraction of queries with `expected_doc_ids = []` (i.e., should refuse) that the system actually refused (Answer.grounded == false). Denominator = queries marked as "should refuse"; if zero → null.
|
|
- All metrics rounded to 4 decimal places for storage.
|
|
- `compare_runs`:
|
|
- Per-metric delta (`b - a`).
|
|
- Per-query: `Win` if b found correct chunk, a did not. `Loss` opposite. `Draw` if both same rank. `Regression` if a hit but b miss for the same expected chunk.
|
|
- `note` may explain known causes (chunker version diff, embedding diff, prompt diff).
|
|
- **Cross-version chunk_id matching is graceful, not a refusal.** When `chunker_version_a != chunker_version_b` the chunk-level criterion would be unstable (chunk_ids are part of the key), so per-query matching falls back to *doc_id + span overlap*: a hit counts if the run's top-k contains any chunk whose `doc_id` matches an expected `doc_id` AND whose `source_spans` overlap by at least 50% with one of the expected chunks' spans. The `CompareReport.deltas` JSON includes a top-level `"chunker_version_match": "exact" | "fallback_doc_span"` so consumers see which mode was used. Set `--strict-chunker-version` to revert to the old behavior (refuse). Default is graceful so chunker iteration is the natural workflow it should be.
|
|
- `render_report_md` produces a single Markdown file summarizing aggregate deltas + a Wins/Losses/Regressions table; not a wire schema; for human consumption only.
|
|
- `store_aggregate` updates `eval_runs.aggregate_json` (`UPDATE eval_runs SET aggregate_json = :json WHERE run_id = :id`).
|
|
|
|
## Storage / wire effects
|
|
|
|
- Writes: `eval_runs.aggregate_json`, optional `runs_dir/<run_id>/report.md`.
|
|
- Reads: `eval_runs`, `eval_query_results`.
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | hit@k computation on hand-rolled fixture | inline (3 queries, ranks {1, 4, miss}) |
|
|
| unit | MRR computation matches expected | inline |
|
|
| unit | recall@k_doc computation | inline |
|
|
| unit | citation_coverage with broken citation marks 0.0 | inline |
|
|
| unit | groundedness false when forbidden string appears | inline |
|
|
| unit | refusal_correctness 1.0 when all "should refuse" queries refused | inline |
|
|
| unit | NaN metrics (zero denominator) serialize as `null` in JSON | inline |
|
|
| unit | `compare_runs` per-query Win/Loss/Draw/Regression on synthetic ranks | inline |
|
|
| determinism | running `compute_aggregate` twice produces identical `AggregateMetrics` | inline |
|
|
| snapshot | `CompareReport` JSON for a fixed pair of runs stable | `fixtures/eval/compare-1.json` |
|
|
|
|
All tests under `cargo test -p kebab-eval metrics`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-eval` passes
|
|
- [ ] `cargo test -p kebab-eval metrics` passes
|
|
- [ ] No imports outside Allowed dependencies
|
|
- [ ] `eval_runs.aggregate_json` always populated after `store_aggregate`
|
|
- [ ] `kebab eval compare` CLI surface integrated via `kebab-app` (call `compare_runs` + `render_report_md`)
|
|
- [ ] PR links phase epic tasks/phase-5-evaluation.md
|
|
|
|
## Out of scope
|
|
|
|
- LLM-as-judge groundedness.
|
|
- Cross-corpus evaluation.
|
|
- HTTP server / dashboards.
|
|
- Metric weighting strategies (MRR weighting, etc.).
|
|
|
|
## Risks / notes
|
|
|
|
- Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable.
|
|
- "Should refuse" queries are encoded as `expected_doc_ids: []`. Document this convention in the golden YAML header comment.
|
|
- Chunker version drift across runs is the COMMON case, not the error case (you almost always re-chunk before evaluating a chunker change). Default behavior is graceful fallback (doc + span overlap); only `--strict-chunker-version` refuses. The `chunker_version_match` field in `CompareReport.deltas` makes the mode auditable, so silent miscompares are still impossible.
|
|
|
|
## Implementation deviations (intentional)
|
|
|
|
Recorded so reviewers don't trip on them; the runtime behavior is the
|
|
same one this spec defines, the names / wiring just differ.
|
|
|
|
- **Graceful fallback is doc-id-only, not doc + 50% span overlap.** The
|
|
`chunker_version_match` audit field is `"fallback_doc"` (not
|
|
`"fallback_doc_span"`). Span-overlap requires reading both runs'
|
|
`chunks.source_spans` simultaneously — but a chunker-version change
|
|
in practice re-indexes (overwrites) the chunks table, so by the time
|
|
you compute run B the run A chunk rows are already gone. Doc-id
|
|
matching is the strongest stable criterion under that workflow.
|
|
Span-overlap moves to a future phase that owns chunker-version
|
|
archival.
|
|
- **Helper signatures.** `compute_aggregate_with_config(cfg, run_id)` /
|
|
`store_aggregate_with_config(cfg, run_id, agg)` /
|
|
`compare_runs_with_config(cfg, a, b, opts)` exist alongside the
|
|
spec-pinned `compute_aggregate(run_id)` / `store_aggregate(run_id, agg)`
|
|
/ `compare_runs(a, b)` so integration tests can drive the pipeline
|
|
against a TempDir-backed `Config`. The no-arg forms wrap them with
|
|
`Config::load(None)`.
|
|
- **CLI surface lives on `kebab-cli` directly, not via `kebab-app`.** DoD
|
|
asks for `kebab eval compare` to be reached "via kebab-app", but `kebab-app`
|
|
already depends on `kebab-eval` (the P5-1 runner uses the App facade),
|
|
so routing the CLI through `kebab-app` would form a cycle. `kebab-cli` →
|
|
`kebab-eval` is wired directly; `kebab-app` is unchanged.
|
|
- **`AggregateMetrics` is `Serialize + Deserialize`.** The spec defines
|
|
only the field shape; we add `Deserialize` so the stored
|
|
`aggregate_json` can round-trip back into the type for follow-up
|
|
computations.
|
|
- **`anyhow`** is used in `Result` returns since the rest of the
|
|
workspace already speaks anyhow; not in the spec's Allowed list but
|
|
matches every other crate.
|
|
- **`kebab-eval` crate-level `kebab-app` dep stays.** The crate already
|
|
depends on `kebab-app` from P5-1 (the runner uses the `App` facade), so
|
|
the Cargo.toml entry remains. The new modules (`metrics.rs`,
|
|
`compare.rs`) do not import `kebab-app` themselves — they're behind the
|
|
same crate boundary as the runner, but the metric/compare *surface*
|
|
is `kebab-app`-clean. Splitting the crate to avoid a transitive Cargo
|
|
edge would be churn for no behavior gain.
|
|
- **`citation_coverage` is intentionally weaker than the spec literal.**
|
|
Spec calls for "every citation resolves to a real chunk in the DB".
|
|
Current implementation: an Answer counts as fully covered iff it has
|
|
≥1 citation AND every citation's path is non-empty. Tightening to a
|
|
per-citation `document_exists_by_path` SqliteStore probe is the next
|
|
step once that helper lands. Empty-citations no longer pass through
|
|
`Iterator::all`'s vacuous-true.
|
|
- **`refusal_correctness` is undefined for non-RAG runs.** The metric
|
|
judges whether the system *refused*; without an `Answer` (lexical-
|
|
only or vector-only run), there's nothing to judge. We exclude such
|
|
queries from the denominator rather than auto-failing them, so a
|
|
search-only run reports `refusal_correctness` as `null` instead of a
|
|
misleading 0.0.
|
|
- **`groundedness` skips queries with no `must_contain`/`forbidden`.**
|
|
An unconfigured golden entry would otherwise score a free 1.0 (or
|
|
0.0 if the answer happens to contain a forbidden string from a
|
|
later spec change). Refusal-class queries are also excluded — their
|
|
groundedness flows through `refusal_correctness`.
|