From f00fb376fe8bb9a349160c2d1e1baa876a35aa99 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Sun, 10 May 2026 22:35:15 +0900 Subject: [PATCH] =?UTF-8?q?docs(fb-39):=20golden=20header=20+=20design=20?= =?UTF-8?q?=C2=A710.3=20eval=20+=20spec=20status=20+=20INDEX?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk explanation + measurement guidance. Add §10.3 Eval metrics section to frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc, P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open → completed (eval foundation only, lever deferral noted). Update tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-04-27-kebab-final-form-design.md | 20 +++++++++++++++++++ fixtures/golden_queries.yaml | 14 ++++++++++--- tasks/INDEX.md | 2 +- .../p9/p9-fb-39-retrieval-precision-tuning.md | 13 +++++++----- 4 files changed, 40 insertions(+), 9 deletions(-) diff --git a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md index a4efb74..199fcc9 100644 --- a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +++ b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md @@ -1510,6 +1510,26 @@ agent 가 분기). HTTP-SSE transport 는 fb-29 deferral 따라 P+. classify 모듈은 `kebab-app::error_wire` 에 single source — kebab-cli + kebab-mcp 공유. +### 10.3 Eval metrics (fb-39) + +#### Retrieval metrics (ground-truth curated) + +`kebab eval run` 이 golden query suite (`fixtures/golden_queries.yaml`) 대해 메트릭 계산. Curator 가 `expected_chunk_ids` 및 `expected_doc_ids` 설정 시에만 측정 가능 (shipped template 은 empty — workspace 별 자체 채움). + +| 메트릭 | 정의 | 조건 | +|--------|------|------| +| `hit_at_k` | top-k 안 expected chunk 존재 여부 (binary). P(hit@k=true) 평균 | `expected_chunk_ids` 채움 | +| `MRR` | Mean Reciprocal Rank (첫 관련 chunk rank 역수 평균) | `expected_chunk_ids` 채움 | +| `recall_at_k_doc` | top-k 안 expected doc 비율 (`|top-k_docs ∩ expected_doc_ids| / |expected_doc_ids|`) | `expected_doc_ids` 채움 | +| `precision_at_k_chunk` (fb-39) | top-k 안 chunk_id 가 `expected_chunk_ids` 에 포함된 비율. 분모 = k (fixed) — `top-k` 부족도 precision 손실로 간주. 빈 `expected_chunk_ids` query 는 skip. | `expected_chunk_ids` 채움 | + +#### Groundedness metrics (rule-based) + +| 메트릭 | 정의 | +|--------|------| +| `must_contain` pass | answer 문자열 이 `golden.must_contain` 의 모든 substring 포함 | +| `forbidden` pass | answer 문자열 이 `golden.forbidden` 의 substring 미포함 | + --- ## 11. 동결 범위 / 변경 정책 diff --git a/fixtures/golden_queries.yaml b/fixtures/golden_queries.yaml index eecf629..ec631fc 100644 --- a/fixtures/golden_queries.yaml +++ b/fixtures/golden_queries.yaml @@ -1,4 +1,4 @@ -# Golden query suite for `kb eval run` (P5-1 / P5-2). +# Golden query suite for `kebab eval run` (P5-1 / P5-2 / fb-39). # # Top-level: list of queries. Required fields: `id`, `query`. All # others are optional and default to empty / null. @@ -7,8 +7,16 @@ # real rows in the active workspace's SQLite store at run time. Stale # references make the runner bail at start. The shipped template # leaves them empty so the file is loadable on any fresh workspace — -# fill them in after a `kb ingest` to enable hit@k / MRR metrics -# (P5-2). +# fill them in after a `kebab ingest` to enable the metrics that +# require ground truth (P5-2 + fb-39): +# +# - `expected_chunk_ids` → hit_at_k, MRR, precision_at_k_chunk (fb-39) +# - `expected_doc_ids` → recall_at_k_doc +# +# `precision_at_k_chunk` (fb-39): of the top-k retrieved hits, what +# fraction's `chunk_id` is in `expected_chunk_ids`. Denominator is k +# (fixed) — `top-k` shortfall is treated as precision loss. Queries +# with empty `expected_chunk_ids` are skipped from this metric. # # `must_contain` / `forbidden` drive the rule-based groundedness # metric (P5-2). diff --git a/tasks/INDEX.md b/tasks/INDEX.md index aa42923..912b9f0 100644 --- a/tasks/INDEX.md +++ b/tasks/INDEX.md @@ -129,7 +129,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능. ### 🎯 0.5.0 — RAG quality (cascade 동반: V00X + reindex) - [p9-fb-38 score semantics](p9/p9-fb-38-score-semantics.md) — ✅ 머지 (2026-05-10) - - [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ⏳ 미구현, brainstorm 필요 (embedding_version cascade) + - [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ✅ 머지 (2026-05-10) — eval foundation only, lever 적용 deferred - [p9-fb-40 fact-grounded answer](p9/p9-fb-40-fact-grounded-answer.md) — ✅ 머지 (2026-05-10) ### 🎯 0.6.0 또는 P+ — reasoning diff --git a/tasks/p9/p9-fb-39-retrieval-precision-tuning.md b/tasks/p9/p9-fb-39-retrieval-precision-tuning.md index 724a6fb..7f9641e 100644 --- a/tasks/p9/p9-fb-39-retrieval-precision-tuning.md +++ b/tasks/p9/p9-fb-39-retrieval-precision-tuning.md @@ -1,20 +1,23 @@ --- phase: P9 -component: kebab-search + kebab-rag + kebab-chunk +component: kebab-eval + docs task_id: p9-fb-39 title: "Retrieval precision 튜닝 (rank 5+ 노이즈 완화)" -status: open -target_version: 0.5.0 +status: completed +target_version: 0.7.0 depends_on: [] unblocks: [] contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md -contract_sections: [§3 chunking, §4 search, §7 RAG] +contract_sections: [§3 chunking, §4 search, §7 RAG, §10.3 eval metrics] source_feedback: 사용자 도그푸딩 2026-05-06 — Claude Code 가 kebab CLI 사용 후 "rank 5+ 부터 노이즈 섞임" 지적. precision-at-k 가 k=5 이후 떨어짐. --- # p9-fb-39 — Retrieval precision 튜닝 -> ⏳ **백로그 only — 미구현.** 본 spec 은 도그푸딩 피드백 skeleton. 구현 착수 전 [superpowers:brainstorming](../../docs/superpowers/) 으로 설계 단계 선행 필요. 어느 lever (chunk policy / RRF k / score gate / cross-encoder / embedding 업그레이드) 부터 손볼지, eval golden set 선행 여부 brainstorm 후 결정. +> ✅ **Eval foundation 부분 구현 완료.** P@k metric (P@5, P@10) 추가. 본 spec 의 lever 적용 (chunk policy / RRF / cross-encoder / embedding 업그레이드) 은 별도 task 로 분리 (fb-39b 이후). +> +> - Design: [`docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md`](../../docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md) +> - Plan: [`docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md`](../../docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md) ## 증상 / 동기