fix(p5-2): apply push-time review items — citation/refusal correctness + nits

두 reviewer 의 should-fix 4 건 + nit 5 건 push 전 반영.

## should-fix

- `citation_coverage`: 빈 citations[] 가 `Iterator::all` vacuous-true 로
  1.0 새는 거 차단 — `!is_empty() && all(non-empty path)` 로 변경.
  또한 `_store: &SqliteStore` dead 인자 시그니처에서 제거 (호출 사이트
  + 테스트 helper 정리).
- `refusal_correctness`: lexical-only run 에서 `answer == None` 인 경우
  분모 증가 안 함 (NaN/null 출력) — 자동 fail 처리하던 게 metric 의미를
  왜곡함. 새 unit test `refusal_correctness_nan_for_non_rag_run` 추가.
- `groundedness`: `must_contain.is_empty() && forbidden.is_empty()`인
  golden 은 분모에서 제외. unconfigured entry 가 free 1.0 받지 않게.
  새 unit test `groundedness_skips_unconfigured_goldens` 추가.
- `kb-cli/Cargo.toml` rationale 코멘트 사실 오류 정정 — kb-eval →
  kb-app 의존이지 그 반대 아님.

## nits

- `KB_EVAL_GOLDEN` / `DEFAULT_GOLDEN_PATH` 중복 — `metrics::` 의
  `pub(crate)` 로 단일화, `runner` 가 import.
- `render_report_md` 의 `{:?}` `ComparisonKind` → 명시적 lowercase
  매핑 함수 (`win`/`loss`/`draw`/`regression`) — JSON 직렬화 컨벤션과
  통일.
- `extract_chunker_version` `None == None` 매치 silent 위험에 대한
  defensive 코멘트.
- `delta_null_when_either_nan` 테스트의 `let mut` suppress hack →
  struct update syntax 로 정리.
- `empty_store` test helper + 매번 `mem::forget(tmp)` 죽은 코드 제거.

## 추가 spec doc

`tasks/p5/p5-2-metrics-compare.md` deviations 섹션 4 항목 추가:

- `kb-eval` crate-level `kb-app` dep — P5-1 inheritance, 새 모듈 surface
  는 import 안 함.
- `citation_coverage` 약화된 resolver — `document_exists_by_path` 기다리는
  중.
- `refusal_correctness` non-RAG 런 NaN.
- `groundedness` no-check golden skip.

## 검증

- `cargo test -p kb-eval` 35/35 (18 unit + 2 loader + 8 integration + 7
  runner; 새 3 unit test).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- `compare_report_snapshot_matches_fixture` 변경 없이 통과 — 새 동작이
  스냅샷 입력 (lexical-only, no must_contain, no should-refuse) 영향 없음.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-02 03:17:32 +00:00
parent d9a5b88d27
commit ee1f2339dd
5 changed files with 151 additions and 108 deletions

View File

@@ -184,3 +184,28 @@ same one this spec defines, the names / wiring just differ.
- **`anyhow`** is used in `Result` returns since the rest of the
workspace already speaks anyhow; not in the spec's Allowed list but
matches every other crate.
- **`kb-eval` crate-level `kb-app` dep stays.** The crate already
depends on `kb-app` from P5-1 (the runner uses the `App` facade), so
the Cargo.toml entry remains. The new modules (`metrics.rs`,
`compare.rs`) do not import `kb-app` themselves — they're behind the
same crate boundary as the runner, but the metric/compare *surface*
is `kb-app`-clean. Splitting the crate to avoid a transitive Cargo
edge would be churn for no behavior gain.
- **`citation_coverage` is intentionally weaker than the spec literal.**
Spec calls for "every citation resolves to a real chunk in the DB".
Current implementation: an Answer counts as fully covered iff it has
≥1 citation AND every citation's path is non-empty. Tightening to a
per-citation `document_exists_by_path` SqliteStore probe is the next
step once that helper lands. Empty-citations no longer pass through
`Iterator::all`'s vacuous-true.
- **`refusal_correctness` is undefined for non-RAG runs.** The metric
judges whether the system *refused*; without an `Answer` (lexical-
only or vector-only run), there's nothing to judge. We exclude such
queries from the denominator rather than auto-failing them, so a
search-only run reports `refusal_correctness` as `null` instead of a
misleading 0.0.
- **`groundedness` skips queries with no `must_contain`/`forbidden`.**
An unconfigured golden entry would otherwise score a free 1.0 (or
0.0 if the answer happens to contain a forbidden string from a
later spec change). Refusal-class queries are also excluded — their
groundedness flows through `refusal_correctness`.