Files

altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report

마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 04:01:55 +00:00

6.3 KiB

Raw Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections

phase

component

task_id

title

status

depends_on

unblocks

contract_source

contract_sections

kebab-eval (runner)

p5-1

Golden query fixture loader + per-query runner

completed

p4-3

p5-2

../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md

§5.7 eval_runs/eval_query_results

§6.3 runs_dir

phase epic tasks/phase-5-evaluation.md

p5-1 — Golden fixture runner

Goal

Load fixtures/golden_queries.yaml, run each query through kebab-app (lexical / vector / hybrid / rag), and persist results into eval_query_results + runs_dir/<run_id>/per_query.jsonl.

Why now / why this size

The runner is the data collector; metrics computation is p5-2's job. Splitting them makes each piece simple and lets us re-compute metrics from stored runs without re-querying.

Allowed dependencies

kebab-core
kebab-config
kebab-app (calls facade for search / ask)
kebab-store-sqlite (writes eval rows)
serde, serde_yaml, serde_json
time
tracing
thiserror

Forbidden dependencies

kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-vector, kebab-embed*, kebab-search, kebab-llm*, kebab-rag (all reached via kebab-app facade only), kebab-tui, kebab-desktop

Inputs

input	type	source
`fixtures/golden_queries.yaml`	YAML	repo-shipped
`EvalRunOpts`	suite, mode, with_rag, k, temperature, seed	CLI
`kebab-app` facade	search/ask	runtime

Outputs

output	type	downstream
`eval_runs` row	SQLite	p5-2, history
`eval_query_results` rows	SQLite	p5-2
`runs_dir/<run_id>/per_query.jsonl`	filesystem	external tools, audits
`EvalRun` struct	`kebab_eval::EvalRun`	caller

Public surface (signatures only — no new types)

pub struct GoldenQuery {
    pub id: String,
    pub query: String,
    pub lang: kebab_core::Lang,
    pub expected_doc_ids: Vec<kebab_core::DocumentId>,
    pub expected_chunk_ids: Vec<kebab_core::ChunkId>,
    pub must_contain: Vec<String>,
    pub forbidden: Vec<String>,
    pub difficulty: Option<String>,
}

pub struct EvalRunOpts {
    pub suite: String,                    // "golden" default
    pub mode:  kebab_core::SearchMode,
    pub with_rag: bool,
    pub k: usize,
    pub temperature: Option<f32>,
    pub seed: Option<u64>,
}

pub struct EvalRun {
    pub run_id: String,
    pub created_at: time::OffsetDateTime,
    pub commit_hash: Option<String>,
    pub config_snapshot_json: serde_json::Value,
    pub per_query: Vec<QueryResult>,
}

pub struct QueryResult {
    pub query_id: String,
    pub query: String,
    pub mode: kebab_core::SearchMode,
    pub hits_top_k: Vec<kebab_core::SearchHit>,
    pub answer: Option<kebab_core::Answer>,
    pub elapsed_ms: u32,
    pub error: Option<String>,
}

pub fn load_golden_set(path: &std::path::Path) -> anyhow::Result<Vec<GoldenQuery>>;
pub fn run_eval(opts: &EvalRunOpts) -> anyhow::Result<EvalRun>;

Behavior contract

load_golden_set:
- Parses YAML; required fields: id, query. Optional: everything else (defaults to empty / None).
- Validates uniqueness of id and that expected_doc_ids / expected_chunk_ids exist in DB; missing → return error listing the offenders.
run_eval:
- Loads fixtures/golden_queries.yaml (path overridable via env KEBAB_EVAL_GOLDEN).
- Generates run_id = "run_" + ulid_lower().
- Captures config_snapshot_json: serialized kebab_config::Config plus chunker_version, embedding_model+version+dims, llm.model_id, prompt_template_version, score_gate, rrf_k, index_version.
- For each query: call kebab_app::search(SearchQuery { mode: opts.mode, k: opts.k, .. }). If opts.with_rag, also call kebab_app::ask(query, AskOpts { mode: opts.mode, k: opts.k, explain: true, temperature: opts.temperature, seed: opts.seed, .. }).
- Each QueryResult measured by elapsed wall-clock (ms).
- Errors are caught per-query (do not abort the run). Failed queries record error: Some(msg) and hits_top_k = vec![].
- Determinism: with temperature=0 and fixed seed, two consecutive runs produce byte-identical per_query.jsonl for non-RAG queries; RAG queries may differ in negligible token budget telemetry.
- Persists eval_runs row with aggregate_json = {} (filled by p5-2). Persists eval_query_results rows. Also writes per_query.jsonl to runs_dir/<run_id>/.
run_eval does NOT compute hit@k or other metrics (that is p5-2).

Storage / wire effects

Writes: eval_runs, eval_query_results, runs_dir/<run_id>/per_query.jsonl.
Reads: golden YAML, chunk/doc rows (via DB).

Test plan

kind	description	fixture / data
unit	YAML loader rejects duplicate IDs	inline YAML
unit	YAML loader rejects unknown `expected_chunk_id`	seeded DB
unit	runner records `elapsed_ms ≥ 0` for each query	tiny corpus + 3 queries
unit	runner captures config_snapshot with all expected version fields	inline
unit	failing query (forced via mock retriever) records `error: Some(_)` and continues	mock
determinism	re-running same suite + fixed seed → identical `per_query.jsonl` (lexical only)	tmp DB, fixed corpus
snapshot	`EvalRun` (with mock LM for `with_rag`) JSON stable	`fixtures/eval/run-1.json`

All tests under cargo test -p kebab-eval runner.

Definition of Done

cargo check -p kebab-eval passes
cargo test -p kebab-eval runner passes
fixtures/golden_queries.yaml template shipped (≥ 5 example entries)
No imports outside Allowed dependencies
PR links design §5.7

Out of scope

Metric computation (p5-2).
LLM-as-judge.
Compare report generation.
HTTP/server integrations.

Risks / notes

Large RAG suites can be slow. --max-queries flag is deferred to P5-2 / a follow-up. Rationale: (a) the runner currently runs all queries unconditionally; (b) without metrics aggregation it adds little incremental value; (c) trivial to add as a Vec::truncate once the eval CLI subcommand exists.
expected_chunk_id references depend on chunker_version. If chunker bumps, golden set must be re-curated. Fail fast in the loader.
Use time::OffsetDateTime::now_utc() for created_at; never local TZ.

6.3 KiB Raw Blame History