fix(p5-2): apply push-time review items — citation/refusal correctness + nits
두 reviewer 의 should-fix 4 건 + nit 5 건 push 전 반영.
## should-fix
- `citation_coverage`: 빈 citations[] 가 `Iterator::all` vacuous-true 로
1.0 새는 거 차단 — `!is_empty() && all(non-empty path)` 로 변경.
또한 `_store: &SqliteStore` dead 인자 시그니처에서 제거 (호출 사이트
+ 테스트 helper 정리).
- `refusal_correctness`: lexical-only run 에서 `answer == None` 인 경우
분모 증가 안 함 (NaN/null 출력) — 자동 fail 처리하던 게 metric 의미를
왜곡함. 새 unit test `refusal_correctness_nan_for_non_rag_run` 추가.
- `groundedness`: `must_contain.is_empty() && forbidden.is_empty()`인
golden 은 분모에서 제외. unconfigured entry 가 free 1.0 받지 않게.
새 unit test `groundedness_skips_unconfigured_goldens` 추가.
- `kb-cli/Cargo.toml` rationale 코멘트 사실 오류 정정 — kb-eval →
kb-app 의존이지 그 반대 아님.
## nits
- `KB_EVAL_GOLDEN` / `DEFAULT_GOLDEN_PATH` 중복 — `metrics::` 의
`pub(crate)` 로 단일화, `runner` 가 import.
- `render_report_md` 의 `{:?}` `ComparisonKind` → 명시적 lowercase
매핑 함수 (`win`/`loss`/`draw`/`regression`) — JSON 직렬화 컨벤션과
통일.
- `extract_chunker_version` `None == None` 매치 silent 위험에 대한
defensive 코멘트.
- `delta_null_when_either_nan` 테스트의 `let mut` suppress hack →
struct update syntax 로 정리.
- `empty_store` test helper + 매번 `mem::forget(tmp)` 죽은 코드 제거.
## 추가 spec doc
`tasks/p5/p5-2-metrics-compare.md` deviations 섹션 4 항목 추가:
- `kb-eval` crate-level `kb-app` dep — P5-1 inheritance, 새 모듈 surface
는 import 안 함.
- `citation_coverage` 약화된 resolver — `document_exists_by_path` 기다리는
중.
- `refusal_correctness` non-RAG 런 NaN.
- `groundedness` no-check golden skip.
## 검증
- `cargo test -p kb-eval` 35/35 (18 unit + 2 loader + 8 integration + 7
runner; 새 3 unit test).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- `compare_report_snapshot_matches_fixture` 변경 없이 통과 — 새 동작이
스냅샷 입력 (lexical-only, no must_contain, no should-refuse) 영향 없음.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -184,3 +184,28 @@ same one this spec defines, the names / wiring just differ.
|
||||
- **`anyhow`** is used in `Result` returns since the rest of the
|
||||
workspace already speaks anyhow; not in the spec's Allowed list but
|
||||
matches every other crate.
|
||||
- **`kb-eval` crate-level `kb-app` dep stays.** The crate already
|
||||
depends on `kb-app` from P5-1 (the runner uses the `App` facade), so
|
||||
the Cargo.toml entry remains. The new modules (`metrics.rs`,
|
||||
`compare.rs`) do not import `kb-app` themselves — they're behind the
|
||||
same crate boundary as the runner, but the metric/compare *surface*
|
||||
is `kb-app`-clean. Splitting the crate to avoid a transitive Cargo
|
||||
edge would be churn for no behavior gain.
|
||||
- **`citation_coverage` is intentionally weaker than the spec literal.**
|
||||
Spec calls for "every citation resolves to a real chunk in the DB".
|
||||
Current implementation: an Answer counts as fully covered iff it has
|
||||
≥1 citation AND every citation's path is non-empty. Tightening to a
|
||||
per-citation `document_exists_by_path` SqliteStore probe is the next
|
||||
step once that helper lands. Empty-citations no longer pass through
|
||||
`Iterator::all`'s vacuous-true.
|
||||
- **`refusal_correctness` is undefined for non-RAG runs.** The metric
|
||||
judges whether the system *refused*; without an `Answer` (lexical-
|
||||
only or vector-only run), there's nothing to judge. We exclude such
|
||||
queries from the denominator rather than auto-failing them, so a
|
||||
search-only run reports `refusal_correctness` as `null` instead of a
|
||||
misleading 0.0.
|
||||
- **`groundedness` skips queries with no `must_contain`/`forbidden`.**
|
||||
An unconfigured golden entry would otherwise score a free 1.0 (or
|
||||
0.0 if the answer happens to contain a forbidden string from a
|
||||
later spec change). Refusal-class queries are also excluded — their
|
||||
groundedness flows through `refusal_correctness`.
|
||||
|
||||
Reference in New Issue
Block a user