diff --git a/tasks/p5/p5-1-golden-fixture-runner.md b/tasks/p5/p5-1-golden-fixture-runner.md new file mode 100644 index 0000000..ec44acb --- /dev/null +++ b/tasks/p5/p5-1-golden-fixture-runner.md @@ -0,0 +1,154 @@ +--- +phase: P5 +component: kb-eval (runner) +task_id: p5-1 +title: "Golden query fixture loader + per-query runner" +status: planned +depends_on: [p4-3] +unblocks: [p5-2] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§5.7 eval_runs/eval_query_results, §6.3 runs_dir, phase epic tasks/phase-5-evaluation.md] +--- + +# p5-1 — Golden fixture runner + +## Goal + +Load `fixtures/golden_queries.yaml`, run each query through `kb-app` (lexical / vector / hybrid / rag), and persist results into `eval_query_results` + `runs_dir//per_query.jsonl`. + +## Why now / why this size + +The runner is the data collector; metrics computation is p5-2's job. Splitting them makes each piece simple and lets us re-compute metrics from stored runs without re-querying. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-app` (calls facade for search / ask) +- `kb-store-sqlite` (writes eval rows) +- `serde`, `serde_yaml`, `serde_json` +- `time` +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-vector`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag` (all reached via `kb-app` facade only), `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| `fixtures/golden_queries.yaml` | YAML | repo-shipped | +| `EvalRunOpts` | suite, mode, with_rag, k, temperature, seed | CLI | +| `kb-app` facade | search/ask | runtime | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `eval_runs` row | SQLite | p5-2, history | +| `eval_query_results` rows | SQLite | p5-2 | +| `runs_dir//per_query.jsonl` | filesystem | external tools, audits | +| `EvalRun` struct | `kb_eval::EvalRun` | caller | + +## Public surface (signatures only — no new types) + +```rust +pub struct GoldenQuery { + pub id: String, + pub query: String, + pub lang: kb_core::Lang, + pub expected_doc_ids: Vec, + pub expected_chunk_ids: Vec, + pub must_contain: Vec, + pub forbidden: Vec, + pub difficulty: Option, +} + +pub struct EvalRunOpts { + pub suite: String, // "golden" default + pub mode: kb_core::SearchMode, + pub with_rag: bool, + pub k: usize, + pub temperature: Option, + pub seed: Option, +} + +pub struct EvalRun { + pub run_id: String, + pub created_at: time::OffsetDateTime, + pub commit_hash: Option, + pub config_snapshot_json: serde_json::Value, + pub per_query: Vec, +} + +pub struct QueryResult { + pub query_id: String, + pub query: String, + pub mode: kb_core::SearchMode, + pub hits_top_k: Vec, + pub answer: Option, + pub elapsed_ms: u32, + pub error: Option, +} + +pub fn load_golden_set(path: &std::path::Path) -> anyhow::Result>; +pub fn run_eval(opts: &EvalRunOpts) -> anyhow::Result; +``` + +## Behavior contract + +- `load_golden_set`: + - Parses YAML; required fields: `id`, `query`. Optional: everything else (defaults to empty / `None`). + - Validates uniqueness of `id` and that `expected_doc_ids` / `expected_chunk_ids` exist in DB; missing → return error listing the offenders. +- `run_eval`: + - Loads `fixtures/golden_queries.yaml` (path overridable via env `KB_EVAL_GOLDEN`). + - Generates `run_id = "run_" + ulid_lower()`. + - Captures `config_snapshot_json`: serialized `kb_config::Config` plus `chunker_version`, `embedding_model+version+dims`, `llm.model_id`, `prompt_template_version`, `score_gate`, `rrf_k`, `index_version`. + - For each query: call `kb_app::search(SearchQuery { mode: opts.mode, k: opts.k, .. })`. If `opts.with_rag`, also call `kb_app::ask(query, AskOpts { mode: opts.mode, k: opts.k, explain: true, temperature: opts.temperature, seed: opts.seed, .. })`. + - Each `QueryResult` measured by elapsed wall-clock (ms). + - Errors are caught per-query (do not abort the run). Failed queries record `error: Some(msg)` and `hits_top_k = vec![]`. + - Determinism: with `temperature=0` and fixed `seed`, two consecutive runs produce byte-identical `per_query.jsonl` for non-RAG queries; RAG queries may differ in negligible token budget telemetry. + - Persists `eval_runs` row with `aggregate_json = {}` (filled by p5-2). Persists `eval_query_results` rows. Also writes `per_query.jsonl` to `runs_dir//`. +- `run_eval` does NOT compute hit@k or other metrics (that is p5-2). + +## Storage / wire effects + +- Writes: `eval_runs`, `eval_query_results`, `runs_dir//per_query.jsonl`. +- Reads: golden YAML, chunk/doc rows (via DB). + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | YAML loader rejects duplicate IDs | inline YAML | +| unit | YAML loader rejects unknown `expected_chunk_id` | seeded DB | +| unit | runner records `elapsed_ms ≥ 0` for each query | tiny corpus + 3 queries | +| unit | runner captures config_snapshot with all expected version fields | inline | +| unit | failing query (forced via mock retriever) records `error: Some(_)` and continues | mock | +| determinism | re-running same suite + fixed seed → identical `per_query.jsonl` (lexical only) | tmp DB, fixed corpus | +| snapshot | `EvalRun` (with mock LM for `with_rag`) JSON stable | `fixtures/eval/run-1.json` | + +All tests under `cargo test -p kb-eval runner`. + +## Definition of Done + +- [ ] `cargo check -p kb-eval` passes +- [ ] `cargo test -p kb-eval runner` passes +- [ ] `fixtures/golden_queries.yaml` template shipped (≥ 5 example entries) +- [ ] No imports outside Allowed dependencies +- [ ] PR links design §5.7 + +## Out of scope + +- Metric computation (p5-2). +- LLM-as-judge. +- Compare report generation. +- HTTP/server integrations. + +## Risks / notes + +- Large RAG suites can be slow. Consider `--max-queries` for incremental runs (kept here as a flag spec; implementation is the responsibility of this task). +- `expected_chunk_id` references depend on `chunker_version`. If chunker bumps, golden set must be re-curated. Fail fast in the loader. +- Use `time::OffsetDateTime::now_utc()` for `created_at`; never local TZ. diff --git a/tasks/p5/p5-2-metrics-compare.md b/tasks/p5/p5-2-metrics-compare.md new file mode 100644 index 0000000..54c0eef --- /dev/null +++ b/tasks/p5/p5-2-metrics-compare.md @@ -0,0 +1,151 @@ +--- +phase: P5 +component: kb-eval (metrics + compare) +task_id: p5-2 +title: "Metrics computation + compare report" +status: planned +depends_on: [p5-1] +unblocks: [] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§5.7 eval_runs.aggregate_json, phase epic tasks/phase-5-evaluation.md] +--- + +# p5-2 — Metrics + compare + +## Goal + +Compute hit@k, MRR, recall@k_doc, citation_coverage, groundedness, empty_result_rate, refusal_correctness from stored `eval_query_results`. Write `aggregate_json` back into `eval_runs`. Provide `kb eval compare a b` that diffs two runs. + +## Why now / why this size + +Metric formulas + comparison logic are pure computation. Splitting them from p5-1 keeps the runner simple and lets us re-compute metrics over historical runs as formulas evolve. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-store-sqlite` (read eval rows, write `aggregate_json`) +- `serde`, `serde_json` +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-app`, `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-vector`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| `eval_query_results` rows | SQLite | from p5-1 | +| `eval_runs` row | SQLite | from p5-1 | +| `GoldenQuery[..]` | `Vec` | re-loaded for `expected_*` and `must_contain` | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `eval_runs.aggregate_json` updated | SQLite | history, CI checks | +| `CompareReport` | `kb_eval::CompareReport` | `kb-cli` printer | +| optional `runs_dir//report.md` | filesystem | human-readable summary | + +## Public surface (signatures only — no new types) + +```rust +pub struct AggregateMetrics { + pub hit_at_k: std::collections::BTreeMap, // k → hit@k + pub mrr: f32, + pub recall_at_k_doc: std::collections::BTreeMap, + pub citation_coverage: f32, + pub groundedness: f32, + pub empty_result_rate: f32, + pub refusal_correctness: f32, + pub total_queries: u32, + pub failed_queries: u32, +} + +pub struct CompareReport { + pub run_a: String, + pub run_b: String, + pub aggregate_a: AggregateMetrics, + pub aggregate_b: AggregateMetrics, + pub deltas: serde_json::Value, // per-metric delta + pub per_query: Vec, +} + +pub struct QueryComparison { + pub query_id: String, + pub kind: ComparisonKind, // Win | Loss | Draw | Regression + pub a_hit_rank: Option, + pub b_hit_rank: Option, + pub note: Option, +} + +pub enum ComparisonKind { Win, Loss, Draw, Regression } + +pub fn compute_aggregate(run_id: &str) -> anyhow::Result; +pub fn store_aggregate(run_id: &str, agg: &AggregateMetrics) -> anyhow::Result<()>; +pub fn compare_runs(run_id_a: &str, run_id_b: &str) -> anyhow::Result; +pub fn render_report_md(report: &CompareReport) -> String; +``` + +## Behavior contract + +- `hit@k` for k ∈ {1, 3, 5, 10}: query is a hit if any of its `expected_chunk_ids` appears in the run's top-k for that query (chunk-level). Aggregate = mean across queries with non-empty `expected_chunk_ids`. +- `MRR`: 1 / rank-of-first-correct-chunk; 0 if not found in top-10. Aggregate = mean across applicable queries. +- `recall@k_doc` for k ∈ {1, 3, 5, 10}: fraction of `expected_doc_ids` covered by the top-k hits' `doc_id`s, averaged across applicable queries. +- `citation_coverage`: fraction of RAG answers where every `Answer.citations[*].citation` resolves to a real chunk in the DB. Denominator = grounded RAG answers; if zero → metric is `NaN` and reported as `null` in JSON. +- `groundedness`: fraction of RAG answers where ALL `must_contain` strings appear AND no `forbidden` string appears. Denominator = RAG answers (excluding errors). +- `empty_result_rate`: fraction of queries returning zero `hits_top_k`. +- `refusal_correctness`: fraction of queries with `expected_doc_ids = []` (i.e., should refuse) that the system actually refused (Answer.grounded == false). Denominator = queries marked as "should refuse"; if zero → null. +- All metrics rounded to 4 decimal places for storage. +- `compare_runs`: + - Per-metric delta (`b - a`). + - Per-query: `Win` if b found correct chunk, a did not. `Loss` opposite. `Draw` if both same rank. `Regression` if a hit but b miss for the same expected chunk. + - `note` may explain known causes (chunker version diff, embedding diff, prompt diff). +- `render_report_md` produces a single Markdown file summarizing aggregate deltas + a Wins/Losses/Regressions table; not a wire schema; for human consumption only. +- `store_aggregate` updates `eval_runs.aggregate_json` (`UPDATE eval_runs SET aggregate_json = :json WHERE run_id = :id`). + +## Storage / wire effects + +- Writes: `eval_runs.aggregate_json`, optional `runs_dir//report.md`. +- Reads: `eval_runs`, `eval_query_results`. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | hit@k computation on hand-rolled fixture | inline (3 queries, ranks {1, 4, miss}) | +| unit | MRR computation matches expected | inline | +| unit | recall@k_doc computation | inline | +| unit | citation_coverage with broken citation marks 0.0 | inline | +| unit | groundedness false when forbidden string appears | inline | +| unit | refusal_correctness 1.0 when all "should refuse" queries refused | inline | +| unit | NaN metrics (zero denominator) serialize as `null` in JSON | inline | +| unit | `compare_runs` per-query Win/Loss/Draw/Regression on synthetic ranks | inline | +| determinism | running `compute_aggregate` twice produces identical `AggregateMetrics` | inline | +| snapshot | `CompareReport` JSON for a fixed pair of runs stable | `fixtures/eval/compare-1.json` | + +All tests under `cargo test -p kb-eval metrics`. + +## Definition of Done + +- [ ] `cargo check -p kb-eval` passes +- [ ] `cargo test -p kb-eval metrics` passes +- [ ] No imports outside Allowed dependencies +- [ ] `eval_runs.aggregate_json` always populated after `store_aggregate` +- [ ] `kb eval compare` CLI surface integrated via `kb-app` (call `compare_runs` + `render_report_md`) +- [ ] PR links phase epic tasks/phase-5-evaluation.md + +## Out of scope + +- LLM-as-judge groundedness. +- Cross-corpus evaluation. +- HTTP server / dashboards. +- Metric weighting strategies (MRR weighting, etc.). + +## Risks / notes + +- Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable. +- "Should refuse" queries are encoded as `expected_doc_ids: []`. Document this convention in the golden YAML header comment. +- Chunker version drift across runs makes `expected_chunk_ids` invalid; `compare_runs` should refuse to compare runs with mismatched `chunker_version` and emit a clear error rather than silent miscompares.