Files
kebab/fixtures/golden_queries.yaml
th-kim0823 f00fb376fe docs(fb-39): golden header + design §10.3 eval + spec status + INDEX
Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk
explanation + measurement guidance. Add §10.3 Eval metrics section to
frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc,
P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open
→ completed (eval foundation only, lever deferral noted). Update
tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:35:15 +09:00

54 lines
1.7 KiB
YAML

# Golden query suite for `kebab eval run` (P5-1 / P5-2 / fb-39).
#
# Top-level: list of queries. Required fields: `id`, `query`. All
# others are optional and default to empty / null.
#
# Curators: `expected_doc_ids` and `expected_chunk_ids` MUST refer to
# real rows in the active workspace's SQLite store at run time. Stale
# references make the runner bail at start. The shipped template
# leaves them empty so the file is loadable on any fresh workspace —
# fill them in after a `kebab ingest` to enable the metrics that
# require ground truth (P5-2 + fb-39):
#
# - `expected_chunk_ids` → hit_at_k, MRR, precision_at_k_chunk (fb-39)
# - `expected_doc_ids` → recall_at_k_doc
#
# `precision_at_k_chunk` (fb-39): of the top-k retrieved hits, what
# fraction's `chunk_id` is in `expected_chunk_ids`. Denominator is k
# (fixed) — `top-k` shortfall is treated as precision loss. Queries
# with empty `expected_chunk_ids` are skipped from this metric.
#
# `must_contain` / `forbidden` drive the rule-based groundedness
# metric (P5-2).
- id: g001
query: "Cargo workspace 멤버 추가하는 법"
lang: ko
must_contain: ["[workspace]", "members"]
difficulty: easy
- id: g002
query: "What is Rust ownership?"
lang: en
must_contain: ["borrow", "lifetime"]
difficulty: easy
- id: g003
query: "Markdown chunking 규칙은?"
lang: ko
must_contain: ["heading"]
forbidden: ["embedding"]
difficulty: medium
- id: g004
query: "How does FTS5 tokenization work for Korean text?"
lang: en
must_contain: ["unicode61", "tokenizer"]
difficulty: medium
- id: g005
query: "RAG citation 검증은 어떻게 동작?"
lang: ko
must_contain: ["citation", "marker"]
difficulty: hard