Load fixtures/golden_queries.yaml, run each query through kb-app (lexical / vector / hybrid / rag), and persist results into eval_query_results + runs_dir/<run_id>/per_query.jsonl.
Why now / why this size
The runner is the data collector; metrics computation is p5-2's job. Splitting them makes each piece simple and lets us re-compute metrics from stored runs without re-querying.
For each query: call kb_app::search(SearchQuery { mode: opts.mode, k: opts.k, .. }). If opts.with_rag, also call kb_app::ask(query, AskOpts { mode: opts.mode, k: opts.k, explain: true, temperature: opts.temperature, seed: opts.seed, .. }).
Each QueryResult measured by elapsed wall-clock (ms).
Errors are caught per-query (do not abort the run). Failed queries record error: Some(msg) and hits_top_k = vec![].
Determinism: with temperature=0 and fixed seed, two consecutive runs produce byte-identical per_query.jsonl for non-RAG queries; RAG queries may differ in negligible token budget telemetry.
Persists eval_runs row with aggregate_json = {} (filled by p5-2). Persists eval_query_results rows. Also writes per_query.jsonl to runs_dir/<run_id>/.
run_eval does NOT compute hit@k or other metrics (that is p5-2).
runner captures config_snapshot with all expected version fields
inline
unit
failing query (forced via mock retriever) records error: Some(_) and continues
mock
determinism
re-running same suite + fixed seed → identical per_query.jsonl (lexical only)
tmp DB, fixed corpus
snapshot
EvalRun (with mock LM for with_rag) JSON stable
fixtures/eval/run-1.json
All tests under cargo test -p kb-eval runner.
Definition of Done
cargo check -p kb-eval passes
cargo test -p kb-eval runner passes
fixtures/golden_queries.yaml template shipped (≥ 5 example entries)
No imports outside Allowed dependencies
PR links design §5.7
Out of scope
Metric computation (p5-2).
LLM-as-judge.
Compare report generation.
HTTP/server integrations.
Risks / notes
Large RAG suites can be slow. Consider --max-queries for incremental runs (kept here as a flag spec; implementation is the responsibility of this task).
expected_chunk_id references depend on chunker_version. If chunker bumps, golden set must be re-curated. Fail fast in the loader.
Use time::OffsetDateTime::now_utc() for created_at; never local TZ.