Files
kebab/tasks/p5/p5-2-metrics-compare.md
kb b999a12ab5 tasks: address PR #1 review
- p3-3: SQLite-first/Lance-second + status marker (V003__embedding_status); drop "best-effort 2PC" misnomer
- p4-3: replace print_stream FnMut closure with mpsc::Sender<String> (RagPipeline stays Send+Sync)
- p4-3: tighten citation regex to strict [#<n>] only — reject [n]/prose/code-block false positives
- p5-2: compare_runs across chunker_version is graceful (doc + span overlap fallback) with chunker_version_match audit field; --strict-chunker-version restores refusal
- p7-1: per-page text via lopdf (pdf-extract has no per-page Rust API); use char count for spans
- p8-1: explicit rubato (FftFixedIn) for 16 kHz mono resample; symphonia decode only
- p9-5: drop cmd_read_pdf_page + pdfium native dep; cmd_read_file_bytes + frontend pdfjs; add traversal tests
2026-04-27 13:10:31 +00:00

7.7 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P5 kb-eval (metrics + compare) p5-2 Metrics computation + compare report planned
p5-1
../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
§5.7 eval_runs.aggregate_json
phase epic tasks/phase-5-evaluation.md

p5-2 — Metrics + compare

Goal

Compute hit@k, MRR, recall@k_doc, citation_coverage, groundedness, empty_result_rate, refusal_correctness from stored eval_query_results. Write aggregate_json back into eval_runs. Provide kb eval compare a b that diffs two runs.

Why now / why this size

Metric formulas + comparison logic are pure computation. Splitting them from p5-1 keeps the runner simple and lets us re-compute metrics over historical runs as formulas evolve.

Allowed dependencies

  • kb-core
  • kb-config
  • kb-store-sqlite (read eval rows, write aggregate_json)
  • serde, serde_json
  • tracing
  • thiserror

Forbidden dependencies

  • kb-app, kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-vector, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop

Inputs

input type source
eval_query_results rows SQLite from p5-1
eval_runs row SQLite from p5-1
GoldenQuery[..] Vec<GoldenQuery> re-loaded for expected_* and must_contain

Outputs

output type downstream
eval_runs.aggregate_json updated SQLite history, CI checks
CompareReport kb_eval::CompareReport kb-cli printer
optional runs_dir/<run_id>/report.md filesystem human-readable summary

Public surface (signatures only — no new types)

pub struct AggregateMetrics {
    pub hit_at_k:           std::collections::BTreeMap<u32, f32>,   // k → hit@k
    pub mrr:                f32,
    pub recall_at_k_doc:    std::collections::BTreeMap<u32, f32>,
    pub citation_coverage:  f32,
    pub groundedness:       f32,
    pub empty_result_rate:  f32,
    pub refusal_correctness: f32,
    pub total_queries:      u32,
    pub failed_queries:     u32,
}

pub struct CompareReport {
    pub run_a: String,
    pub run_b: String,
    pub aggregate_a: AggregateMetrics,
    pub aggregate_b: AggregateMetrics,
    pub deltas: serde_json::Value,             // per-metric delta
    pub per_query: Vec<QueryComparison>,
}

pub struct QueryComparison {
    pub query_id: String,
    pub kind: ComparisonKind,                  // Win | Loss | Draw | Regression
    pub a_hit_rank: Option<u32>,
    pub b_hit_rank: Option<u32>,
    pub note: Option<String>,
}

pub enum ComparisonKind { Win, Loss, Draw, Regression }

pub fn compute_aggregate(run_id: &str) -> anyhow::Result<AggregateMetrics>;
pub fn store_aggregate(run_id: &str, agg: &AggregateMetrics) -> anyhow::Result<()>;
pub fn compare_runs(run_id_a: &str, run_id_b: &str) -> anyhow::Result<CompareReport>;
pub fn render_report_md(report: &CompareReport) -> String;

Behavior contract

  • hit@k for k ∈ {1, 3, 5, 10}: query is a hit if any of its expected_chunk_ids appears in the run's top-k for that query (chunk-level). Aggregate = mean across queries with non-empty expected_chunk_ids.
  • MRR: 1 / rank-of-first-correct-chunk; 0 if not found in top-10. Aggregate = mean across applicable queries.
  • recall@k_doc for k ∈ {1, 3, 5, 10}: fraction of expected_doc_ids covered by the top-k hits' doc_ids, averaged across applicable queries.
  • citation_coverage: fraction of RAG answers where every Answer.citations[*].citation resolves to a real chunk in the DB. Denominator = grounded RAG answers; if zero → metric is NaN and reported as null in JSON.
  • groundedness: fraction of RAG answers where ALL must_contain strings appear AND no forbidden string appears. Denominator = RAG answers (excluding errors).
  • empty_result_rate: fraction of queries returning zero hits_top_k.
  • refusal_correctness: fraction of queries with expected_doc_ids = [] (i.e., should refuse) that the system actually refused (Answer.grounded == false). Denominator = queries marked as "should refuse"; if zero → null.
  • All metrics rounded to 4 decimal places for storage.
  • compare_runs:
    • Per-metric delta (b - a).
    • Per-query: Win if b found correct chunk, a did not. Loss opposite. Draw if both same rank. Regression if a hit but b miss for the same expected chunk.
    • note may explain known causes (chunker version diff, embedding diff, prompt diff).
    • Cross-version chunk_id matching is graceful, not a refusal. When chunker_version_a != chunker_version_b the chunk-level criterion would be unstable (chunk_ids are part of the key), so per-query matching falls back to doc_id + span overlap: a hit counts if the run's top-k contains any chunk whose doc_id matches an expected doc_id AND whose source_spans overlap by at least 50% with one of the expected chunks' spans. The CompareReport.deltas JSON includes a top-level "chunker_version_match": "exact" | "fallback_doc_span" so consumers see which mode was used. Set --strict-chunker-version to revert to the old behavior (refuse). Default is graceful so chunker iteration is the natural workflow it should be.
  • render_report_md produces a single Markdown file summarizing aggregate deltas + a Wins/Losses/Regressions table; not a wire schema; for human consumption only.
  • store_aggregate updates eval_runs.aggregate_json (UPDATE eval_runs SET aggregate_json = :json WHERE run_id = :id).

Storage / wire effects

  • Writes: eval_runs.aggregate_json, optional runs_dir/<run_id>/report.md.
  • Reads: eval_runs, eval_query_results.

Test plan

kind description fixture / data
unit hit@k computation on hand-rolled fixture inline (3 queries, ranks {1, 4, miss})
unit MRR computation matches expected inline
unit recall@k_doc computation inline
unit citation_coverage with broken citation marks 0.0 inline
unit groundedness false when forbidden string appears inline
unit refusal_correctness 1.0 when all "should refuse" queries refused inline
unit NaN metrics (zero denominator) serialize as null in JSON inline
unit compare_runs per-query Win/Loss/Draw/Regression on synthetic ranks inline
determinism running compute_aggregate twice produces identical AggregateMetrics inline
snapshot CompareReport JSON for a fixed pair of runs stable fixtures/eval/compare-1.json

All tests under cargo test -p kb-eval metrics.

Definition of Done

  • cargo check -p kb-eval passes
  • cargo test -p kb-eval metrics passes
  • No imports outside Allowed dependencies
  • eval_runs.aggregate_json always populated after store_aggregate
  • kb eval compare CLI surface integrated via kb-app (call compare_runs + render_report_md)
  • PR links phase epic tasks/phase-5-evaluation.md

Out of scope

  • LLM-as-judge groundedness.
  • Cross-corpus evaluation.
  • HTTP server / dashboards.
  • Metric weighting strategies (MRR weighting, etc.).

Risks / notes

  • Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable.
  • "Should refuse" queries are encoded as expected_doc_ids: []. Document this convention in the golden YAML header comment.
  • Chunker version drift across runs is the COMMON case, not the error case (you almost always re-chunk before evaluating a chunker change). Default behavior is graceful fallback (doc + span overlap); only --strict-chunker-version refuses. The chunker_version_match field in CompareReport.deltas makes the mode auditable, so silent miscompares are still impossible.