Cmd::Eval now loads Config via cli.config (same pattern as all other
subcommands) before dispatching to the inner match. Each arm now calls
the *_with_config variant:
run_eval(&opts) → run_eval_with_config(&cfg, &opts)
compute_aggregate(run_id) → compute_aggregate_with_config(&cfg, run_id)
store_aggregate(run_id, ..) → store_aggregate_with_config(&cfg, run_id, ..)
Compare already called compare_runs_with_config but sourced cfg from
Config::load(None) — that redundant load is removed; cfg comes from
the shared binding above.
Fixes the same facade-rule regression pattern as P3-5 / P4-3: previously
`kebab --config /build/dogfood/config.toml eval run` silently evaluated
the XDG-default (empty) KB instead of the dogfood KB.
Also fixes runner.rs test that hardcoded rag-v2 after commit 5719969
bumped the default prompt_template_version to rag-v3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR-2 of fb-41 multi-hop RAG. Decompose + retrieve + synthesize 3-stage
pipeline가 `opts.multi_hop=true` 일 때 dispatch. Dynamic decide loop
는 PR-3.
- `AskOpts.multi_hop: bool` 필드 추가 + `impl Default for AskOpts`
도입 (HOTFIXES 2026-05-07 의 known limitation 해소). 9 explicit
init site 모두 `multi_hop: false` 추가 — Default 도입으로 향후
`..Default::default()` 점진 migrate 가능.
- `RagPipeline::ask` 의 entry 에 dispatcher 한 줄
(`if opts.multi_hop { return self.ask_multi_hop(...) }`).
- `RagPipeline::ask_multi_hop` 신규 method. 1) decompose LLM call
→ JSON array of strings parse, 2) 각 sub-query 로 retrieve +
chunk_id dedup pool, 3) score gate / no-chunks 가드, 4)
pack_context (single-pass 와 helper 공유), 5) synthesize LLM
call w/ MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT, 6) citation extract
+ Answer build. `prompt_template_version` = "rag-multi-hop-v1"
로 stamp — eval `compare` 가 single-pass vs multi-hop 분리.
- Prompt const 신규: MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT +
MULTI_HOP_DECOMPOSE_USER_TEMPLATE + MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT
+ PROMPT_TEMPLATE_VERSION_MULTI_HOP + MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
- `kebab_core::RefusalReason::MultiHopDecomposeFailed` variant 신규.
Cascade: kebab-store-sqlite `refusal_reason_label` + kebab-tui `ask
refusal render` exhaustive match 갱신.
- `parse_decompose_response` + `strip_markdown_json_fence` helper —
markdown code fence (```json / ```) strip + JSON array of strings
parse + trim + drop empty + cap at MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
None 반환 시 caller 가 `MultiHopDecomposeFailed` refusal.
Tests (55 passing total, 8 신규):
- 6 unit (parse_decompose_response 의 bare array / fence variants /
garbage / cap / trim 회귀 핀).
- 2 integration: `ask_multi_hop_dispatches_and_decompose_garbage_refuses`
(decompose garbage → MultiHopDecomposeFailed + 정확히 1 LLM call) +
`ask_with_multi_hop_false_keeps_single_pass_path` (회귀 핀, 기존
caller 자동 backwards-compat).
Happy-path multi-hop (decompose 성공 → synthesize) 의 integration
test 는 ScriptedLm helper 가 PR-3 의 decide loop 와 함께 도입될
때 같이 추가. 현 `MockLanguageModel` 는 canned single response 라
2-LLM-call sequence 핀 불가.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-1 of fb-41 multi-hop RAG (spec: docs/superpowers/specs/2026-05-25-
p9-fb-41-multi-hop-rag-design.md, plan: docs/superpowers/plans/2026-
05-25-p9-fb-41-multi-hop-rag.md).
XL 작업의 첫 PR — baseline 측정 anchor 만 추가. RAG pipeline 미변경,
fixture file + parse 회귀 핀.
사용자 결정 4 axis (2026-05-25):
- approach: query decomposition (LLM 서브-질문)
- trigger: explicit `--multi-hop` flag
- MVP scope: dynamic N-hop (LLM 이 depth 결정, decompose seed +
ReAct-style decide loop hybrid)
- eval: multi-hop golden set 먼저 (본 PR)
본 PR:
- `fixtures/multi_hop_golden.yaml` 신규. 15 question (5 cross-doc +
5 intra-doc + 5 single-fact negative). 기존 `GoldenQuery` struct
그대로 사용 — 별 loader / type 변경 없음. `expected_chunk_ids`
비어 있어 curator 가 `kebab ingest` 후 채울 수 있는 template
형태. `must_contain` 으로 baseline 측정 가능 (P5-2 metric).
- `crates/kebab-eval/tests/loader.rs::loads_multi_hop_golden_fixture`
신규 회귀 핀. fixture parse OK + 15 question + 5/5/5 bucket
분포 + 모든 question 에 must_contain 최소 1 개.
baseline 측정 protocol (별 run, commit 에 artifact 안 포함):
1. v0.17.2 binary 로 single-pass `kebab eval run --fixture
multi_hop_golden.yaml` 실행
2. P@5, P@10, must_contain pass rate, citation_coverage 캡처
3. PR-3 (dynamic iter 머지) 후 동일 fixture + `multi_hop=true` 로
재실행 → Δ 비교
PR 분할 6 단계 (plan 참조): PR-1 (본 PR — fixture only), PR-2
(RagPipeline::ask_multi_hop fixed depth=2), PR-3 (dynamic iter),
PR-4 (CLI flag + wire), PR-5 (MCP + SKILL.md), PR-6 (TUI toggle +
trace render). 마지막 PR 후 v0.18.0 cut.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.
- wire_search_response::search_json_truncates_with_max_tokens +
search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
0-hit. apples/banana/cherry/durian/elder 5-char unique words
로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
regenerate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`runner_lexical_is_deterministic_per_query_payload` 가 full-suite 첫 실행에서
간헐적으로 `elapsed_ms: 0` vs `elapsed_ms: 1` 차이로 깨지는 timing flake 가
있었음 (PR #140 회차 0 의 full-suite 실행에서 관찰).
원인: per_query 전체 JSON 을 byte-identical 비교하는데 QueryResult.elapsed_ms
가 timing 기반이라 µs-scale wall-clock jitter 가 그대로 비교에 들어감. 의도는
"timing 외에 byte-identical" — 인접 snapshot test #7 은 projection 으로
timing 을 명시적으로 제외하지만 #6 은 누락.
Fix: 비교 직전 양쪽 run 의 elapsed_ms 를 0 으로 normalize. 의도 그대로
표현하고 다른 field 의 결정성 검증은 보존. 50회 반복 stress 통과 (이전:
간헐 실패).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
kebab eval compare now surfaces precision_at_k_chunk delta in both
human-readable table + deltas JSON. Snapshot fixture regenerated
additively.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add missing score_kind field to SearchHit constructors in:
- kebab-tui/tests/search.rs::make_hit()
- kebab-eval/tests/metrics_and_compare.rs::hit()
- kebab-eval/src/metrics.rs::hit()
All test fixtures default to Rrf (hybrid mode), matching the field's
Default impl and the test semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec PR #59 의 §3.8 multi-turn behaviour 구현. RAG facade 가 prior
turns 받아 prompt 에 prepend, retrieval query expansion 적용,
Answer 에 conversation_id / turn_index 채움.
신규 (kebab-core):
- Answer 에 conversation_id (Option<String>) / turn_index (Option<u32>)
field 추가. serde skip_serializing_if 로 single-shot 의 wire
output 변경 0 (기존 외부 wrapper 영향 없음).
- Turn struct (question + answer + citations + created_at).
- RefusalReason::LlmStreamAborted variant.
신규 (kebab-rag):
- AskOpts 에 history (Vec<Turn>) / conversation_id / turn_index 3 field.
- AskOpts::single_shot(mode) helper.
- RagPipeline::ask_with_history(query, history, conversation_id,
turn_index, opts) — combined opts 로 ask 호출.
- expand_query_with_history: history.last() 의 answer 첫 200 자
concat 해 SearchQuery.text 확장 (spec §3.8 의 \"cheap concat\";
LLM-based standalone-question rewriting 은 P+).
- serialize_history + remaining_history_budget_chars: spec 의 priority
enforcement — system+question 필수, retrieved chunks 가 차지한
뒤 남은 char budget 안에서 newest 우선, oldest drop.
- ask 본문: history 가 비어있지 않으면 [이전 대화] 블록을 user
prompt 위에 prepend. Answer 생성 site 3 곳 (정상 / NoChunks /
ScoreGate refuse) 모두 conversation_id / turn_index 채움.
신규 (kebab-store-sqlite):
- refusal_reason_label 가 LlmStreamAborted → 'llm_stream_aborted'.
기존 caller 변경 (single-shot 동작 동일):
- kebab-cli main.rs Cmd::Ask: AskOpts 에 history=Vec::new(),
conversation_id=None, turn_index=None 명시 (CLI multi-turn 은
p9-fb-18 의 --session/--repl 가 채움).
- kebab-tui src/ask.rs spawn site 동일 (multi-turn UI 는 p9-fb-16).
- kebab-eval runner.rs golden eval 동일 (single-shot per query).
- kebab-app tests/ask_smoke.rs / kebab-tui tests/ask.rs / kebab-rag
tests/pipeline.rs / kebab-eval metrics.rs Answer literal 갱신.
Test:
- 9 신규 lib unit (expand_query 4 / serialize_history 3 / remaining_budget 2).
- 기존 12 PASS 회귀 0.
Plan 갱신:
- p9-fb-15 status planned → in_progress. 머지 후 한 줄 commit
으로 completed flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>