docs(fb-39): golden header + design §10.3 eval + spec status + INDEX

Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk explanation + measurement guidance. Add §10.3 Eval metrics section to frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc, P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open → completed (eval foundation only, lever deferral noted). Update tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:35:15 +09:00
parent bb0ec0469f
commit f00fb376fe
4 changed files with 40 additions and 9 deletions
--- a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
+++ b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
@@ -1510,6 +1510,26 @@ agent 가 분기). HTTP-SSE transport 는 fb-29 deferral 따라 P+. classify
 모듈은 `kebab-app::error_wire` 에 single source — kebab-cli + kebab-mcp
 공유.

+### 10.3 Eval metrics (fb-39)
+
+#### Retrieval metrics (ground-truth curated)
+
+`kebab eval run` 이 golden query suite (`fixtures/golden_queries.yaml`) 대해 메트릭 계산. Curator 가 `expected_chunk_ids` 및 `expected_doc_ids` 설정 시에만 측정 가능 (shipped template 은 empty — workspace 별 자체 채움).
+
+| 메트릭 | 정의 | 조건 |
+|--------|------|------|
+| `hit_at_k` | top-k 안 expected chunk 존재 여부 (binary). P(hit@k=true) 평균 | `expected_chunk_ids` 채움 |
+| `MRR` | Mean Reciprocal Rank (첫 관련 chunk rank 역수 평균) | `expected_chunk_ids` 채움 |
+| `recall_at_k_doc` | top-k 안 expected doc 비율 (`|top-k_docs ∩ expected_doc_ids| / |expected_doc_ids|`) | `expected_doc_ids` 채움 |
+| `precision_at_k_chunk` (fb-39) | top-k 안 chunk_id 가 `expected_chunk_ids` 에 포함된 비율. 분모 = k (fixed) — `top-k` 부족도 precision 손실로 간주. 빈 `expected_chunk_ids` query 는 skip. | `expected_chunk_ids` 채움 |
+
+#### Groundedness metrics (rule-based)
+
+| 메트릭 | 정의 |
+|--------|------|
+| `must_contain` pass | answer 문자열 이 `golden.must_contain` 의 모든 substring 포함 |
+| `forbidden` pass | answer 문자열 이 `golden.forbidden` 의 substring 미포함 |
+
 ---

 ## 11. 동결 범위 / 변경 정책
--- a/fixtures/golden_queries.yaml
+++ b/fixtures/golden_queries.yaml
@@ -1,4 +1,4 @@
-# Golden query suite for `kb eval run` (P5-1 / P5-2).
+# Golden query suite for `kebab eval run` (P5-1 / P5-2 / fb-39).
 #
 # Top-level: list of queries. Required fields: `id`, `query`. All
 # others are optional and default to empty / null.
@@ -7,8 +7,16 @@
 # real rows in the active workspace's SQLite store at run time. Stale
 # references make the runner bail at start. The shipped template
 # leaves them empty so the file is loadable on any fresh workspace —
-# fill them in after a `kb ingest` to enable hit@k / MRR metrics
-# (P5-2).
+# fill them in after a `kebab ingest` to enable the metrics that
+# require ground truth (P5-2 + fb-39):
+#
+#   - `expected_chunk_ids` →  hit_at_k, MRR, precision_at_k_chunk (fb-39)
+#   - `expected_doc_ids`   →  recall_at_k_doc
+#
+# `precision_at_k_chunk` (fb-39): of the top-k retrieved hits, what
+# fraction's `chunk_id` is in `expected_chunk_ids`. Denominator is k
+# (fixed) — `top-k` shortfall is treated as precision loss. Queries
+# with empty `expected_chunk_ids` are skipped from this metric.
 #
 # `must_contain` / `forbidden` drive the rule-based groundedness
 # metric (P5-2).
--- a/tasks/INDEX.md
+++ b/tasks/INDEX.md
@@ -129,7 +129,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.

    ### 🎯 0.5.0 — RAG quality (cascade 동반: V00X + reindex)
    - [p9-fb-38 score semantics](p9/p9-fb-38-score-semantics.md) — ✅ 머지 (2026-05-10)
-    - [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ⏳ 미구현, brainstorm 필요 (embedding_version cascade)
+    - [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ✅ 머지 (2026-05-10) — eval foundation only, lever 적용 deferred
    - [p9-fb-40 fact-grounded answer](p9/p9-fb-40-fact-grounded-answer.md) — ✅ 머지 (2026-05-10)

    ### 🎯 0.6.0 또는 P+ — reasoning
--- a/tasks/p9/p9-fb-39-retrieval-precision-tuning.md
+++ b/tasks/p9/p9-fb-39-retrieval-precision-tuning.md
@@ -1,20 +1,23 @@
 ---
 phase: P9
-component: kebab-search + kebab-rag + kebab-chunk
+component: kebab-eval + docs
 task_id: p9-fb-39
 title: "Retrieval precision 튜닝 (rank 5+ 노이즈 완화)"
-status: open
-target_version: 0.5.0
+status: completed
+target_version: 0.7.0
 depends_on: []
 unblocks: []
 contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
-contract_sections: [§3 chunking, §4 search, §7 RAG]
+contract_sections: [§3 chunking, §4 search, §7 RAG, §10.3 eval metrics]
 source_feedback: 사용자 도그푸딩 2026-05-06 — Claude Code 가 kebab CLI 사용 후 "rank 5+ 부터 노이즈 섞임" 지적. precision-at-k 가 k=5 이후 떨어짐.
 ---

 # p9-fb-39 — Retrieval precision 튜닝

-> ⏳ **백로그 only — 미구현.** 본 spec 은 도그푸딩 피드백 skeleton. 구현 착수 전 [superpowers:brainstorming](../../docs/superpowers/) 으로 설계 단계 선행 필요. 어느 lever (chunk policy / RRF k / score gate / cross-encoder / embedding 업그레이드) 부터 손볼지, eval golden set 선행 여부 brainstorm 후 결정.
+> ✅ **Eval foundation 부분 구현 완료.** P@k metric (P@5, P@10) 추가. 본 spec 의 lever 적용 (chunk policy / RRF / cross-encoder / embedding 업그레이드) 은 별도 task 로 분리 (fb-39b 이후).
+>
+> - Design: [`docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md`](../../docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md)
+> - Plan: [`docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md`](../../docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md)

 ## 증상 / 동기