마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
155 lines
6.3 KiB
Markdown
155 lines
6.3 KiB
Markdown
---
|
|
phase: P5
|
|
component: kebab-eval (runner)
|
|
task_id: p5-1
|
|
title: "Golden query fixture loader + per-query runner"
|
|
status: completed
|
|
depends_on: [p4-3]
|
|
unblocks: [p5-2]
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [§5.7 eval_runs/eval_query_results, §6.3 runs_dir, phase epic tasks/phase-5-evaluation.md]
|
|
---
|
|
|
|
# p5-1 — Golden fixture runner
|
|
|
|
## Goal
|
|
|
|
Load `fixtures/golden_queries.yaml`, run each query through `kebab-app` (lexical / vector / hybrid / rag), and persist results into `eval_query_results` + `runs_dir/<run_id>/per_query.jsonl`.
|
|
|
|
## Why now / why this size
|
|
|
|
The runner is the data collector; metrics computation is p5-2's job. Splitting them makes each piece simple and lets us re-compute metrics from stored runs without re-querying.
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-config`
|
|
- `kebab-app` (calls facade for search / ask)
|
|
- `kebab-store-sqlite` (writes eval rows)
|
|
- `serde`, `serde_yaml`, `serde_json`
|
|
- `time`
|
|
- `tracing`
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-vector`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag` (all reached via `kebab-app` facade only), `kebab-tui`, `kebab-desktop`
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| `fixtures/golden_queries.yaml` | YAML | repo-shipped |
|
|
| `EvalRunOpts` | suite, mode, with_rag, k, temperature, seed | CLI |
|
|
| `kebab-app` facade | search/ask | runtime |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `eval_runs` row | SQLite | p5-2, history |
|
|
| `eval_query_results` rows | SQLite | p5-2 |
|
|
| `runs_dir/<run_id>/per_query.jsonl` | filesystem | external tools, audits |
|
|
| `EvalRun` struct | `kebab_eval::EvalRun` | caller |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub struct GoldenQuery {
|
|
pub id: String,
|
|
pub query: String,
|
|
pub lang: kebab_core::Lang,
|
|
pub expected_doc_ids: Vec<kebab_core::DocumentId>,
|
|
pub expected_chunk_ids: Vec<kebab_core::ChunkId>,
|
|
pub must_contain: Vec<String>,
|
|
pub forbidden: Vec<String>,
|
|
pub difficulty: Option<String>,
|
|
}
|
|
|
|
pub struct EvalRunOpts {
|
|
pub suite: String, // "golden" default
|
|
pub mode: kebab_core::SearchMode,
|
|
pub with_rag: bool,
|
|
pub k: usize,
|
|
pub temperature: Option<f32>,
|
|
pub seed: Option<u64>,
|
|
}
|
|
|
|
pub struct EvalRun {
|
|
pub run_id: String,
|
|
pub created_at: time::OffsetDateTime,
|
|
pub commit_hash: Option<String>,
|
|
pub config_snapshot_json: serde_json::Value,
|
|
pub per_query: Vec<QueryResult>,
|
|
}
|
|
|
|
pub struct QueryResult {
|
|
pub query_id: String,
|
|
pub query: String,
|
|
pub mode: kebab_core::SearchMode,
|
|
pub hits_top_k: Vec<kebab_core::SearchHit>,
|
|
pub answer: Option<kebab_core::Answer>,
|
|
pub elapsed_ms: u32,
|
|
pub error: Option<String>,
|
|
}
|
|
|
|
pub fn load_golden_set(path: &std::path::Path) -> anyhow::Result<Vec<GoldenQuery>>;
|
|
pub fn run_eval(opts: &EvalRunOpts) -> anyhow::Result<EvalRun>;
|
|
```
|
|
|
|
## Behavior contract
|
|
|
|
- `load_golden_set`:
|
|
- Parses YAML; required fields: `id`, `query`. Optional: everything else (defaults to empty / `None`).
|
|
- Validates uniqueness of `id` and that `expected_doc_ids` / `expected_chunk_ids` exist in DB; missing → return error listing the offenders.
|
|
- `run_eval`:
|
|
- Loads `fixtures/golden_queries.yaml` (path overridable via env `KEBAB_EVAL_GOLDEN`).
|
|
- Generates `run_id = "run_" + ulid_lower()`.
|
|
- Captures `config_snapshot_json`: serialized `kebab_config::Config` plus `chunker_version`, `embedding_model+version+dims`, `llm.model_id`, `prompt_template_version`, `score_gate`, `rrf_k`, `index_version`.
|
|
- For each query: call `kebab_app::search(SearchQuery { mode: opts.mode, k: opts.k, .. })`. If `opts.with_rag`, also call `kebab_app::ask(query, AskOpts { mode: opts.mode, k: opts.k, explain: true, temperature: opts.temperature, seed: opts.seed, .. })`.
|
|
- Each `QueryResult` measured by elapsed wall-clock (ms).
|
|
- Errors are caught per-query (do not abort the run). Failed queries record `error: Some(msg)` and `hits_top_k = vec![]`.
|
|
- Determinism: with `temperature=0` and fixed `seed`, two consecutive runs produce byte-identical `per_query.jsonl` for non-RAG queries; RAG queries may differ in negligible token budget telemetry.
|
|
- Persists `eval_runs` row with `aggregate_json = {}` (filled by p5-2). Persists `eval_query_results` rows. Also writes `per_query.jsonl` to `runs_dir/<run_id>/`.
|
|
- `run_eval` does NOT compute hit@k or other metrics (that is p5-2).
|
|
|
|
## Storage / wire effects
|
|
|
|
- Writes: `eval_runs`, `eval_query_results`, `runs_dir/<run_id>/per_query.jsonl`.
|
|
- Reads: golden YAML, chunk/doc rows (via DB).
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | YAML loader rejects duplicate IDs | inline YAML |
|
|
| unit | YAML loader rejects unknown `expected_chunk_id` | seeded DB |
|
|
| unit | runner records `elapsed_ms ≥ 0` for each query | tiny corpus + 3 queries |
|
|
| unit | runner captures config_snapshot with all expected version fields | inline |
|
|
| unit | failing query (forced via mock retriever) records `error: Some(_)` and continues | mock |
|
|
| determinism | re-running same suite + fixed seed → identical `per_query.jsonl` (lexical only) | tmp DB, fixed corpus |
|
|
| snapshot | `EvalRun` (with mock LM for `with_rag`) JSON stable | `fixtures/eval/run-1.json` |
|
|
|
|
All tests under `cargo test -p kebab-eval runner`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-eval` passes
|
|
- [ ] `cargo test -p kebab-eval runner` passes
|
|
- [ ] `fixtures/golden_queries.yaml` template shipped (≥ 5 example entries)
|
|
- [ ] No imports outside Allowed dependencies
|
|
- [ ] PR links design §5.7
|
|
|
|
## Out of scope
|
|
|
|
- Metric computation (p5-2).
|
|
- LLM-as-judge.
|
|
- Compare report generation.
|
|
- HTTP/server integrations.
|
|
|
|
## Risks / notes
|
|
|
|
- Large RAG suites can be slow. `--max-queries` flag is deferred to P5-2 / a follow-up. Rationale: (a) the runner currently runs all queries unconditionally; (b) without metrics aggregation it adds little incremental value; (c) trivial to add as a `Vec::truncate` once the eval CLI subcommand exists.
|
|
- `expected_chunk_id` references depend on `chunker_version`. If chunker bumps, golden set must be re-curated. Fail fast in the loader.
|
|
- Use `time::OffsetDateTime::now_utc()` for `created_at`; never local TZ.
|