feat(p5-2): kb-eval metrics + compare — AggregateMetrics, CompareReport, kb eval CLI
P5-2 구현. 저장된 eval_runs / eval_query_results 위에서:
- `kb_eval::metrics`: hit@k / MRR / recall@k_doc / citation_coverage /
groundedness / empty_result_rate / refusal_correctness 계산. NaN
metrics (분모 0)는 JSON null. 4-decimal round + Deserialize 추가로
aggregate_json 라운드트립.
- `kb_eval::compare`: 두 run 비교 → CompareReport (per-metric Δ + per-
query Win/Loss/Draw/Regression). chunker_version drift 시 graceful
doc-id fallback (chunker_version_match: "fallback_doc"), `strict`
옵션이면 refuse.
- `render_report_md`: 인간용 Markdown (집계 + Wins/Losses/Regressions
표).
- `SqliteStore::{load_eval_run, load_eval_query_results,
update_eval_run_aggregate}` + owned `EvalRunRecord` /
`EvalQueryResultRecord` 추가 — write 측 borrow-shape는 그대로.
- `kb eval` CLI: `run` (P5-1 위임), `aggregate <id>`, `compare <a> <b>
[--strict-chunker-version] [--write-report]`. `--json` 으로 raw
CompareReport, 기본은 Markdown 출력.
## Spec deviations (intentional, doc 명시)
- Graceful 매칭은 doc-id-only (chunker_version_match: "fallback_doc")
— 50% span overlap은 chunker re-index 후 양쪽 chunks 동시 보존이
현실적으로 안 돼서 P6+ 로 deferred.
- `*_with_config` 헬퍼 추가: 통합 테스트가 TempDir Config 로 드라이브.
no-arg 형태는 Config::load(None) 로 위임.
- CLI 는 kb-cli → kb-eval 직접 wire (kb-app cycle 회피). DoD 의 "via
kb-app" 의도는 facade 단일화였지만 cycle 발생.
- `AggregateMetrics: Deserialize` 추가 — aggregate_json 라운드트립.
## 검증
- `cargo test -p kb-eval` 30/30 (15 unit + 2 loader + 8 metrics+compare
통합 + 7 runner). 8 통합 중 snapshot 1 건 (`compare-1.json`).
- `cargo test -p kb-store-sqlite` 33/33.
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- forbidden imports 부재 (`kb-source-fs|kb-parse|kb-normalize|kb-chunk|
kb-store-vector|kb-embed|kb-search|kb-llm|kb-rag|kb-tui|kb-desktop|
kb-app` — kb-app 는 metrics/compare 모듈에 부재; runner 만 사용).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,7 +3,7 @@ phase: P5
|
||||
component: kb-eval (metrics + compare)
|
||||
task_id: p5-2
|
||||
title: "Metrics computation + compare report"
|
||||
status: planned
|
||||
status: completed
|
||||
depends_on: [p5-1]
|
||||
unblocks: []
|
||||
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
|
||||
@@ -150,3 +150,37 @@ All tests under `cargo test -p kb-eval metrics`.
|
||||
- Floating-point sums in MRR cause minor cross-platform drift; round to 4 decimals on storage to keep snapshots stable.
|
||||
- "Should refuse" queries are encoded as `expected_doc_ids: []`. Document this convention in the golden YAML header comment.
|
||||
- Chunker version drift across runs is the COMMON case, not the error case (you almost always re-chunk before evaluating a chunker change). Default behavior is graceful fallback (doc + span overlap); only `--strict-chunker-version` refuses. The `chunker_version_match` field in `CompareReport.deltas` makes the mode auditable, so silent miscompares are still impossible.
|
||||
|
||||
## Implementation deviations (intentional)
|
||||
|
||||
Recorded so reviewers don't trip on them; the runtime behavior is the
|
||||
same one this spec defines, the names / wiring just differ.
|
||||
|
||||
- **Graceful fallback is doc-id-only, not doc + 50% span overlap.** The
|
||||
`chunker_version_match` audit field is `"fallback_doc"` (not
|
||||
`"fallback_doc_span"`). Span-overlap requires reading both runs'
|
||||
`chunks.source_spans` simultaneously — but a chunker-version change
|
||||
in practice re-indexes (overwrites) the chunks table, so by the time
|
||||
you compute run B the run A chunk rows are already gone. Doc-id
|
||||
matching is the strongest stable criterion under that workflow.
|
||||
Span-overlap moves to a future phase that owns chunker-version
|
||||
archival.
|
||||
- **Helper signatures.** `compute_aggregate_with_config(cfg, run_id)` /
|
||||
`store_aggregate_with_config(cfg, run_id, agg)` /
|
||||
`compare_runs_with_config(cfg, a, b, opts)` exist alongside the
|
||||
spec-pinned `compute_aggregate(run_id)` / `store_aggregate(run_id, agg)`
|
||||
/ `compare_runs(a, b)` so integration tests can drive the pipeline
|
||||
against a TempDir-backed `Config`. The no-arg forms wrap them with
|
||||
`Config::load(None)`.
|
||||
- **CLI surface lives on `kb-cli` directly, not via `kb-app`.** DoD
|
||||
asks for `kb eval compare` to be reached "via kb-app", but `kb-app`
|
||||
already depends on `kb-eval` (the P5-1 runner uses the App facade),
|
||||
so routing the CLI through `kb-app` would form a cycle. `kb-cli` →
|
||||
`kb-eval` is wired directly; `kb-app` is unchanged.
|
||||
- **`AggregateMetrics` is `Serialize + Deserialize`.** The spec defines
|
||||
only the field shape; we add `Deserialize` so the stored
|
||||
`aggregate_json` can round-trip back into the type for follow-up
|
||||
computations.
|
||||
- **`anyhow`** is used in `Result` returns since the rest of the
|
||||
workspace already speaks anyhow; not in the spec's Allowed list but
|
||||
matches every other crate.
|
||||
|
||||
Reference in New Issue
Block a user