feat(fb-39): eval foundation — precision_at_k_chunk metric #136

Merged
altair823 merged 5 commits from feat/fb-39-eval-foundation into main 2026-05-10 13:41:07 +00:00
Owner

Summary

  • AggregateMetrics gains precision_at_k_chunk: BTreeMap<u32, f32> (P@5, P@10).
  • Binary relevance via expected_chunk_ids; denominator = k (fixed); empty expected_chunk_ids skipped (mirrors hit_at_k policy).
  • 6 unit tests cover backwards-compat deserialize, exact match, partial top-k, zero relevant, empty expected, multi-query average.
  • Golden YAML header strengthened with ground-truth doc.
  • Design §10.3 eval metrics section added.
  • Lever application (chunk policy / RRF / cross-encoder / embedding upgrade) deferred to fb-39b — measurement infrastructure first.

Design: docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md
Plan: docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md

Test plan

  • cargo test --workspace --no-fail-fast -j 1 green
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • kebab-eval: 6 new unit tests + existing snapshot fixture additively regenerated
  • Manual smoke: kebab eval run against any workspace with non-empty expected_chunk_ids → metrics_json contains precision_at_k_chunk

🤖 Generated with Claude Code

## Summary - AggregateMetrics gains precision_at_k_chunk: BTreeMap<u32, f32> (P@5, P@10). - Binary relevance via expected_chunk_ids; denominator = k (fixed); empty expected_chunk_ids skipped (mirrors hit_at_k policy). - 6 unit tests cover backwards-compat deserialize, exact match, partial top-k, zero relevant, empty expected, multi-query average. - Golden YAML header strengthened with ground-truth doc. - Design §10.3 eval metrics section added. - Lever application (chunk policy / RRF / cross-encoder / embedding upgrade) deferred to fb-39b — measurement infrastructure first. Design: docs/superpowers/specs/2026-05-10-p9-fb-39-eval-foundation-design.md Plan: docs/superpowers/plans/2026-05-10-p9-fb-39-eval-foundation.md ## Test plan - [x] cargo test --workspace --no-fail-fast -j 1 green - [x] cargo clippy --workspace --all-targets -- -D warnings clean - [x] kebab-eval: 6 new unit tests + existing snapshot fixture additively regenerated - [ ] Manual smoke: kebab eval run against any workspace with non-empty expected_chunk_ids → metrics_json contains precision_at_k_chunk 🤖 Generated with Claude Code
altair823 added 4 commits 2026-05-10 13:36:08 +00:00
- AggregateMetrics 에 precision_at_k_chunk: BTreeMap<u32, f32>
  (P@5, P@10) 추가, binary relevance via expected_chunk_ids
- Denominator = k 고정 (hits.len() < k 도 precision 손실 간주)
- Empty expected_chunk_ids query 는 skip (hit_at_k 동일 정책)
- Lever 적용 (chunk policy / RRF / cross-encoder / embedding) 은
  본 spec 범위 외 — fb-39b 이후 별도 task
- Golden set schema 무변경, shipped fixtures 헤더 주석만 강화

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks: AggregateMetrics.precision_at_k_chunk field + serde
backwards-compat, compute aggregation in loop with 5 unit tests,
golden YAML header doc strengthening, design §11 + INDEX + status
flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk
explanation + measurement guidance. Add §10.3 Eval metrics section to
frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc,
P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open
→ completed (eval foundation only, lever deferral noted). Update
tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 added 1 commit 2026-05-10 13:39:18 +00:00
kebab eval compare now surfaces precision_at_k_chunk delta in both
human-readable table + deltas JSON. Snapshot fixture regenerated
additively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 merged commit 240120ee80 into main 2026-05-10 13:41:07 +00:00
altair823 deleted branch feat/fb-39-eval-foundation 2026-05-10 13:41:08 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#136