fix(p5-2): apply push-time review items — citation/refusal correctness + nits

두 reviewer 의 should-fix 4 건 + nit 5 건 push 전 반영. ## should-fix - `citation_coverage`: 빈 citations[] 가 `Iterator::all` vacuous-true 로 1.0 새는 거 차단 — `!is_empty() && all(non-empty path)` 로 변경. 또한 `_store: &SqliteStore` dead 인자 시그니처에서 제거 (호출 사이트 + 테스트 helper 정리). - `refusal_correctness`: lexical-only run 에서 `answer == None` 인 경우 분모 증가 안 함 (NaN/null 출력) — 자동 fail 처리하던 게 metric 의미를 왜곡함. 새 unit test `refusal_correctness_nan_for_non_rag_run` 추가. - `groundedness`: `must_contain.is_empty() && forbidden.is_empty()`인 golden 은 분모에서 제외. unconfigured entry 가 free 1.0 받지 않게. 새 unit test `groundedness_skips_unconfigured_goldens` 추가. - `kb-cli/Cargo.toml` rationale 코멘트 사실 오류 정정 — kb-eval → kb-app 의존이지 그 반대 아님. ## nits - `KB_EVAL_GOLDEN` / `DEFAULT_GOLDEN_PATH` 중복 — `metrics::` 의 `pub(crate)` 로 단일화, `runner` 가 import. - `render_report_md` 의 `{:?}` `ComparisonKind` → 명시적 lowercase 매핑 함수 (`win`/`loss`/`draw`/`regression`) — JSON 직렬화 컨벤션과 통일. - `extract_chunker_version` `None == None` 매치 silent 위험에 대한 defensive 코멘트. - `delta_null_when_either_nan` 테스트의 `let mut` suppress hack → struct update syntax 로 정리. - `empty_store` test helper + 매번 `mem::forget(tmp)` 죽은 코드 제거. ## 추가 spec doc `tasks/p5/p5-2-metrics-compare.md` deviations 섹션 4 항목 추가: - `kb-eval` crate-level `kb-app` dep — P5-1 inheritance, 새 모듈 surface 는 import 안 함. - `citation_coverage` 약화된 resolver — `document_exists_by_path` 기다리는 중. - `refusal_correctness` non-RAG 런 NaN. - `groundedness` no-check golden skip. ## 검증 - `cargo test -p kb-eval` 35/35 (18 unit + 2 loader + 8 integration + 7 runner; 새 3 unit test). - `cargo clippy --workspace --all-targets -- -D warnings` clean. - `compare_report_snapshot_matches_fixture` 변경 없이 통과 — 새 동작이 스냅샷 입력 (lexical-only, no must_contain, no should-refuse) 영향 없음. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 03:17:32 +00:00
parent d9a5b88d27
commit ee1f2339dd
5 changed files with 151 additions and 108 deletions
--- a/tasks/p5/p5-2-metrics-compare.md
+++ b/tasks/p5/p5-2-metrics-compare.md
@@ -184,3 +184,28 @@ same one this spec defines, the names / wiring just differ.
 - **`anyhow`** is used in `Result` returns since the rest of the
  workspace already speaks anyhow; not in the spec's Allowed list but
  matches every other crate.
+- **`kb-eval` crate-level `kb-app` dep stays.** The crate already
+  depends on `kb-app` from P5-1 (the runner uses the `App` facade), so
+  the Cargo.toml entry remains. The new modules (`metrics.rs`,
+  `compare.rs`) do not import `kb-app` themselves — they're behind the
+  same crate boundary as the runner, but the metric/compare *surface*
+  is `kb-app`-clean. Splitting the crate to avoid a transitive Cargo
+  edge would be churn for no behavior gain.
+- **`citation_coverage` is intentionally weaker than the spec literal.**
+  Spec calls for "every citation resolves to a real chunk in the DB".
+  Current implementation: an Answer counts as fully covered iff it has
+  ≥1 citation AND every citation's path is non-empty. Tightening to a
+  per-citation `document_exists_by_path` SqliteStore probe is the next
+  step once that helper lands. Empty-citations no longer pass through
+  `Iterator::all`'s vacuous-true.
+- **`refusal_correctness` is undefined for non-RAG runs.** The metric
+  judges whether the system *refused*; without an `Answer` (lexical-
+  only or vector-only run), there's nothing to judge. We exclude such
+  queries from the denominator rather than auto-failing them, so a
+  search-only run reports `refusal_correctness` as `null` instead of a
+  misleading 0.0.
+- **`groundedness` skips queries with no `must_contain`/`forbidden`.**
+  An unconfigured golden entry would otherwise score a free 1.0 (or
+  0.0 if the answer happens to contain a forbidden string from a
+  later spec change). Refusal-class queries are also excluded — their
+  groundedness flows through `refusal_correctness`.