Files
kebab/tasks/p5/p5-2-metrics-compare.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

11 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P5 kebab-eval (metrics + compare) p5-2 Metrics computation + compare report completed
p5-1
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
§5.7 eval_runs.aggregate_json
phase epic tasks/phase-5-evaluation.md

p5-2 — Metrics + compare

Goal

Compute hit@k, MRR, recall@k_doc, citation_coverage, groundedness, empty_result_rate, refusal_correctness from stored eval_query_results. Write aggregate_json back into eval_runs. Provide kebab eval compare a b that diffs two runs.

Why now / why this size

Metric formulas + comparison logic are pure computation. Splitting them from p5-1 keeps the runner simple and lets us re-compute metrics over historical runs as formulas evolve.

Allowed dependencies

  • kebab-core
  • kebab-config
  • kebab-store-sqlite (read eval rows, write aggregate_json)
  • serde, serde_json
  • tracing
  • thiserror

Forbidden dependencies

  • kebab-app, kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-vector, kebab-embed*, kebab-search, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop

Inputs

input type source
eval_query_results rows SQLite from p5-1
eval_runs row SQLite from p5-1
GoldenQuery[..] Vec<GoldenQuery> re-loaded for expected_* and must_contain

Outputs

output type downstream
eval_runs.aggregate_json updated SQLite history, CI checks
CompareReport kebab_eval::CompareReport kebab-cli printer
optional runs_dir/<run_id>/report.md filesystem human-readable summary

Public surface (signatures only — no new types)

pub struct AggregateMetrics {
    pub hit_at_k:           std::collections::BTreeMap<u32, f32>,   // k → hit@k
    pub mrr:                f32,
    pub recall_at_k_doc:    std::collections::BTreeMap<u32, f32>,
    pub citation_coverage:  f32,
    pub groundedness:       f32,
    pub empty_result_rate:  f32,
    pub refusal_correctness: f32,
    pub total_queries:      u32,
    pub failed_queries:     u32,
}

pub struct CompareReport {
    pub run_a: String,
    pub run_b: String,
    pub aggregate_a: AggregateMetrics,
    pub aggregate_b: AggregateMetrics,
    pub deltas: serde_json::Value,             // per-metric delta
    pub per_query: Vec<QueryComparison>,
}

pub struct QueryComparison {
    pub query_id: String,
    pub kind: ComparisonKind,                  // Win | Loss | Draw | Regression
    pub a_hit_rank: Option<u32>,
    pub b_hit_rank: Option<u32>,
    pub note: Option<String>,
}

pub enum ComparisonKind { Win, Loss, Draw, Regression }

pub fn compute_aggregate(run_id: &str) -> anyhow::Result<AggregateMetrics>;
pub fn store_aggregate(run_id: &str, agg: &AggregateMetrics) -> anyhow::Result<()>;
pub fn compare_runs(run_id_a: &str, run_id_b: &str) -> anyhow::Result<CompareReport>;
pub fn render_report_md(report: &CompareReport) -> String;

Behavior contract

  • hit@k for k ∈ {1, 3, 5, 10}: query is a hit if any of its expected_chunk_ids appears in the run's top-k for that query (chunk-level). Aggregate = mean across queries with non-empty expected_chunk_ids.
  • MRR: 1 / rank-of-first-correct-chunk; 0 if not found in top-10. Aggregate = mean across applicable queries.
  • recall@k_doc for k ∈ {1, 3, 5, 10}: fraction of expected_doc_ids covered by the top-k hits' doc_ids, averaged across applicable queries.
  • citation_coverage: fraction of RAG answers where every Answer.citations[*].citation resolves to a real chunk in the DB. Denominator = grounded RAG answers; if zero → metric is NaN and reported as null in JSON.
  • groundedness: fraction of RAG answers where ALL must_contain strings appear AND no forbidden string appears. Denominator = RAG answers (excluding errors).
  • empty_result_rate: fraction of queries returning zero hits_top_k.
  • refusal_correctness: fraction of queries with expected_doc_ids = [] (i.e., should refuse) that the system actually refused (Answer.grounded == false). Denominator = queries marked as "should refuse"; if zero → null.
  • All metrics rounded to 4 decimal places for storage.
  • compare_runs:
    • Per-metric delta (b - a).
    • Per-query: Win if b found correct chunk, a did not. Loss opposite. Draw if both same rank. Regression if a hit but b miss for the same expected chunk.
    • note may explain known causes (chunker version diff, embedding diff, prompt diff).
    • Cross-version chunk_id matching is graceful, not a refusal. When chunker_version_a != chunker_version_b the chunk-level criterion would be unstable (chunk_ids are part of the key), so per-query matching falls back to doc_id + span overlap: a hit counts if the run's top-k contains any chunk whose doc_id matches an expected doc_id AND whose source_spans overlap by at least 50% with one of the expected chunks' spans. The CompareReport.deltas JSON includes a top-level "chunker_version_match": "exact" | "fallback_doc_span" so consumers see which mode was used. Set --strict-chunker-version to revert to the old behavior (refuse). Default is graceful so chunker iteration is the natural workflow it should be.
  • render_report_md produces a single Markdown file summarizing aggregate deltas + a Wins/Losses/Regressions table; not a wire schema; for human consumption only.
  • store_aggregate updates eval_runs.aggregate_json (UPDATE eval_runs SET aggregate_json = :json WHERE run_id = :id).

Storage / wire effects

  • Writes: eval_runs.aggregate_json, optional runs_dir/<run_id>/report.md.
  • Reads: eval_runs, eval_query_results.

Test plan

kind description fixture / data
unit hit@k computation on hand-rolled fixture inline (3 queries, ranks {1, 4, miss})
unit MRR computation matches expected inline
unit recall@k_doc computation inline
unit citation_coverage with broken citation marks 0.0 inline
unit groundedness false when forbidden string appears inline
unit refusal_correctness 1.0 when all "should refuse" queries refused inline
unit NaN metrics (zero denominator) serialize as null in JSON inline
unit compare_runs per-query Win/Loss/Draw/Regression on synthetic ranks inline
determinism running compute_aggregate twice produces identical AggregateMetrics inline
snapshot CompareReport JSON for a fixed pair of runs stable fixtures/eval/compare-1.json

All tests under cargo test -p kebab-eval metrics.

Definition of Done

  • cargo check -p kebab-eval passes
  • cargo test -p kebab-eval metrics passes
  • No imports outside Allowed dependencies
  • eval_runs.aggregate_json always populated after store_aggregate
  • kebab eval compare CLI surface integrated via kebab-app (call compare_runs + render_report_md)
  • PR links phase epic tasks/phase-5-evaluation.md

Out of scope

  • LLM-as-judge groundedness.
  • Cross-corpus evaluation.
  • HTTP server / dashboards.
  • Metric weighting strategies (MRR weighting, etc.).

Risks / notes

  • Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable.
  • "Should refuse" queries are encoded as expected_doc_ids: []. Document this convention in the golden YAML header comment.
  • Chunker version drift across runs is the COMMON case, not the error case (you almost always re-chunk before evaluating a chunker change). Default behavior is graceful fallback (doc + span overlap); only --strict-chunker-version refuses. The chunker_version_match field in CompareReport.deltas makes the mode auditable, so silent miscompares are still impossible.

Implementation deviations (intentional)

Recorded so reviewers don't trip on them; the runtime behavior is the same one this spec defines, the names / wiring just differ.

  • Graceful fallback is doc-id-only, not doc + 50% span overlap. The chunker_version_match audit field is "fallback_doc" (not "fallback_doc_span"). Span-overlap requires reading both runs' chunks.source_spans simultaneously — but a chunker-version change in practice re-indexes (overwrites) the chunks table, so by the time you compute run B the run A chunk rows are already gone. Doc-id matching is the strongest stable criterion under that workflow. Span-overlap moves to a future phase that owns chunker-version archival.
  • Helper signatures. compute_aggregate_with_config(cfg, run_id) / store_aggregate_with_config(cfg, run_id, agg) / compare_runs_with_config(cfg, a, b, opts) exist alongside the spec-pinned compute_aggregate(run_id) / store_aggregate(run_id, agg) / compare_runs(a, b) so integration tests can drive the pipeline against a TempDir-backed Config. The no-arg forms wrap them with Config::load(None).
  • CLI surface lives on kebab-cli directly, not via kebab-app. DoD asks for kebab eval compare to be reached "via kebab-app", but kebab-app already depends on kebab-eval (the P5-1 runner uses the App facade), so routing the CLI through kebab-app would form a cycle. kebab-clikebab-eval is wired directly; kebab-app is unchanged.
  • AggregateMetrics is Serialize + Deserialize. The spec defines only the field shape; we add Deserialize so the stored aggregate_json can round-trip back into the type for follow-up computations.
  • anyhow is used in Result returns since the rest of the workspace already speaks anyhow; not in the spec's Allowed list but matches every other crate.
  • kebab-eval crate-level kebab-app dep stays. The crate already depends on kebab-app from P5-1 (the runner uses the App facade), so the Cargo.toml entry remains. The new modules (metrics.rs, compare.rs) do not import kebab-app themselves — they're behind the same crate boundary as the runner, but the metric/compare surface is kebab-app-clean. Splitting the crate to avoid a transitive Cargo edge would be churn for no behavior gain.
  • citation_coverage is intentionally weaker than the spec literal. Spec calls for "every citation resolves to a real chunk in the DB". Current implementation: an Answer counts as fully covered iff it has ≥1 citation AND every citation's path is non-empty. Tightening to a per-citation document_exists_by_path SqliteStore probe is the next step once that helper lands. Empty-citations no longer pass through Iterator::all's vacuous-true.
  • refusal_correctness is undefined for non-RAG runs. The metric judges whether the system refused; without an Answer (lexical- only or vector-only run), there's nothing to judge. We exclude such queries from the denominator rather than auto-failing them, so a search-only run reports refusal_correctness as null instead of a misleading 0.0.
  • groundedness skips queries with no must_contain/forbidden. An unconfigured golden entry would otherwise score a free 1.0 (or 0.0 if the answer happens to contain a forbidden string from a later spec change). Refusal-class queries are also excluded — their groundedness flows through refusal_correctness.