docs(fb-39): golden header + design §10.3 eval + spec status + INDEX
Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk explanation + measurement guidance. Add §10.3 Eval metrics section to frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc, P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open → completed (eval foundation only, lever deferral noted). Update tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1510,6 +1510,26 @@ agent 가 분기). HTTP-SSE transport 는 fb-29 deferral 따라 P+. classify
|
||||
모듈은 `kebab-app::error_wire` 에 single source — kebab-cli + kebab-mcp
|
||||
공유.
|
||||
|
||||
### 10.3 Eval metrics (fb-39)
|
||||
|
||||
#### Retrieval metrics (ground-truth curated)
|
||||
|
||||
`kebab eval run` 이 golden query suite (`fixtures/golden_queries.yaml`) 대해 메트릭 계산. Curator 가 `expected_chunk_ids` 및 `expected_doc_ids` 설정 시에만 측정 가능 (shipped template 은 empty — workspace 별 자체 채움).
|
||||
|
||||
| 메트릭 | 정의 | 조건 |
|
||||
|--------|------|------|
|
||||
| `hit_at_k` | top-k 안 expected chunk 존재 여부 (binary). P(hit@k=true) 평균 | `expected_chunk_ids` 채움 |
|
||||
| `MRR` | Mean Reciprocal Rank (첫 관련 chunk rank 역수 평균) | `expected_chunk_ids` 채움 |
|
||||
| `recall_at_k_doc` | top-k 안 expected doc 비율 (`|top-k_docs ∩ expected_doc_ids| / |expected_doc_ids|`) | `expected_doc_ids` 채움 |
|
||||
| `precision_at_k_chunk` (fb-39) | top-k 안 chunk_id 가 `expected_chunk_ids` 에 포함된 비율. 분모 = k (fixed) — `top-k` 부족도 precision 손실로 간주. 빈 `expected_chunk_ids` query 는 skip. | `expected_chunk_ids` 채움 |
|
||||
|
||||
#### Groundedness metrics (rule-based)
|
||||
|
||||
| 메트릭 | 정의 |
|
||||
|--------|------|
|
||||
| `must_contain` pass | answer 문자열 이 `golden.must_contain` 의 모든 substring 포함 |
|
||||
| `forbidden` pass | answer 문자열 이 `golden.forbidden` 의 substring 미포함 |
|
||||
|
||||
---
|
||||
|
||||
## 11. 동결 범위 / 변경 정책
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Golden query suite for `kb eval run` (P5-1 / P5-2).
|
||||
# Golden query suite for `kebab eval run` (P5-1 / P5-2 / fb-39).
|
||||
#
|
||||
# Top-level: list of queries. Required fields: `id`, `query`. All
|
||||
# others are optional and default to empty / null.
|
||||
@@ -7,8 +7,16 @@
|
||||
# real rows in the active workspace's SQLite store at run time. Stale
|
||||
# references make the runner bail at start. The shipped template
|
||||
# leaves them empty so the file is loadable on any fresh workspace —
|
||||
# fill them in after a `kb ingest` to enable hit@k / MRR metrics
|
||||
# (P5-2).
|
||||
# fill them in after a `kebab ingest` to enable the metrics that
|
||||
# require ground truth (P5-2 + fb-39):
|
||||
#
|
||||
# - `expected_chunk_ids` → hit_at_k, MRR, precision_at_k_chunk (fb-39)
|
||||
# - `expected_doc_ids` → recall_at_k_doc
|
||||
#
|
||||
# `precision_at_k_chunk` (fb-39): of the top-k retrieved hits, what
|
||||
# fraction's `chunk_id` is in `expected_chunk_ids`. Denominator is k
|
||||
# (fixed) — `top-k` shortfall is treated as precision loss. Queries
|
||||
# with empty `expected_chunk_ids` are skipped from this metric.
|
||||
#
|
||||
# `must_contain` / `forbidden` drive the rule-based groundedness
|
||||
# metric (P5-2).
|
||||
|
||||
@@ -129,7 +129,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
|
||||
|
||||
### 🎯 0.5.0 — RAG quality (cascade 동반: V00X + reindex)
|
||||
- [p9-fb-38 score semantics](p9/p9-fb-38-score-semantics.md) — ✅ 머지 (2026-05-10)
|
||||
- [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ⏳ 미구현, brainstorm 필요 (embedding_version cascade)
|
||||
- [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ✅ 머지 (2026-05-10) — eval foundation only, lever 적용 deferred
|
||||
- [p9-fb-40 fact-grounded answer](p9/p9-fb-40-fact-grounded-answer.md) — ✅ 머지 (2026-05-10)
|
||||
|
||||
### 🎯 0.6.0 또는 P+ — reasoning
|
||||
|
||||
@@ -1,20 +1,23 @@
|
||||
---
|
||||
phase: P9
|
||||
component: kebab-search + kebab-rag + kebab-chunk
|
||||
component: kebab-eval + docs
|
||||
task_id: p9-fb-39
|
||||
title: "Retrieval precision 튜닝 (rank 5+ 노이즈 완화)"
|
||||
status: open
|
||||
target_version: 0.5.0
|
||||
status: completed
|
||||
target_version: 0.7.0
|
||||
depends_on: []
|
||||
unblocks: []
|
||||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||||
contract_sections: [§3 chunking, §4 search, §7 RAG]
|
||||
contract_sections: [§3 chunking, §4 search, §7 RAG, §10.3 eval metrics]
|
||||
source_feedback: 사용자 도그푸딩 2026-05-06 — Claude Code 가 kebab CLI 사용 후 "rank 5+ 부터 노이즈 섞임" 지적. precision-at-k 가 k=5 이후 떨어짐.
|
||||
---
|
||||
|
||||
# p9-fb-39 — Retrieval precision 튜닝
|
||||
|
||||
> ⏳ **백로그 only — 미구현.** 본 spec 은 도그푸딩 피드백 skeleton. 구현 착수 전 [superpowers:brainstorming](../../docs/superpowers/) 으로 설계 단계 선행 필요. 어느 lever (chunk policy / RRF k / score gate / cross-encoder / embedding 업그레이드) 부터 손볼지, eval golden set 선행 여부 brainstorm 후 결정.
|
||||
> ✅ **Eval foundation 부분 구현 완료.** P@k metric (P@5, P@10) 추가. 본 spec 의 lever 적용 (chunk policy / RRF / cross-encoder / embedding 업그레이드) 은 별도 task 로 분리 (fb-39b 이후).
|
||||
>
|
||||
> - Design: [`docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md`](../../docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md)
|
||||
> - Plan: [`docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md`](../../docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md)
|
||||
|
||||
## 증상 / 동기
|
||||
|
||||
|
||||
Reference in New Issue
Block a user