Compute hit@k, MRR, recall@k_doc, citation_coverage, groundedness, empty_result_rate, refusal_correctness from stored eval_query_results. Write aggregate_json back into eval_runs. Provide kb eval compare a b that diffs two runs.
Why now / why this size
Metric formulas + comparison logic are pure computation. Splitting them from p5-1 keeps the runner simple and lets us re-compute metrics over historical runs as formulas evolve.
hit@k for k ∈ {1, 3, 5, 10}: query is a hit if any of its expected_chunk_ids appears in the run's top-k for that query (chunk-level). Aggregate = mean across queries with non-empty expected_chunk_ids.
MRR: 1 / rank-of-first-correct-chunk; 0 if not found in top-10. Aggregate = mean across applicable queries.
recall@k_doc for k ∈ {1, 3, 5, 10}: fraction of expected_doc_ids covered by the top-k hits' doc_ids, averaged across applicable queries.
citation_coverage: fraction of RAG answers where every Answer.citations[*].citation resolves to a real chunk in the DB. Denominator = grounded RAG answers; if zero → metric is NaN and reported as null in JSON.
groundedness: fraction of RAG answers where ALL must_contain strings appear AND no forbidden string appears. Denominator = RAG answers (excluding errors).
empty_result_rate: fraction of queries returning zero hits_top_k.
refusal_correctness: fraction of queries with expected_doc_ids = [] (i.e., should refuse) that the system actually refused (Answer.grounded == false). Denominator = queries marked as "should refuse"; if zero → null.
All metrics rounded to 4 decimal places for storage.
compare_runs:
Per-metric delta (b - a).
Per-query: Win if b found correct chunk, a did not. Loss opposite. Draw if both same rank. Regression if a hit but b miss for the same expected chunk.
note may explain known causes (chunker version diff, embedding diff, prompt diff).
Cross-version chunk_id matching is graceful, not a refusal. When chunker_version_a != chunker_version_b the chunk-level criterion would be unstable (chunk_ids are part of the key), so per-query matching falls back to doc_id + span overlap: a hit counts if the run's top-k contains any chunk whose doc_id matches an expected doc_id AND whose source_spans overlap by at least 50% with one of the expected chunks' spans. The CompareReport.deltas JSON includes a top-level "chunker_version_match": "exact" | "fallback_doc_span" so consumers see which mode was used. Set --strict-chunker-version to revert to the old behavior (refuse). Default is graceful so chunker iteration is the natural workflow it should be.
render_report_md produces a single Markdown file summarizing aggregate deltas + a Wins/Losses/Regressions table; not a wire schema; for human consumption only.
store_aggregate updates eval_runs.aggregate_json (UPDATE eval_runs SET aggregate_json = :json WHERE run_id = :id).
Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable.
"Should refuse" queries are encoded as expected_doc_ids: []. Document this convention in the golden YAML header comment.
Chunker version drift across runs is the COMMON case, not the error case (you almost always re-chunk before evaluating a chunker change). Default behavior is graceful fallback (doc + span overlap); only --strict-chunker-version refuses. The chunker_version_match field in CompareReport.deltas makes the mode auditable, so silent miscompares are still impossible.