fix(rag): fb-41 PR-7 — multi-hop pre-decompose score-gate (S7 hallucination 회귀 핀)
v0.18 cut 전 fb-41 multi-hop RAG 도그푸딩에서 발견된 **safety regression** fix. 자세한 도그푸딩 결과는 `tasks/HOTFIXES.md` 의 2026-05-25 fb-41 pre-v0.18 entry + `/build/cache/dogfood-v018/results/SUMMARY.md` 참조. ## 문제 (S7) Query: `What is the chemical formula of caffeine?` (KB 에 없는 fact). - Single-pass `kebab ask`: retrieve top score 가 default `rag.score_gate = 0.30` 미만 → `refuse_score_gate` → 안전한 refusal. - Multi-hop `kebab ask --multi-hop`: **`grounded = true`**, 본문 `"카페인의 화학식은 C₉H₁₅N₃O 입니다 [#6]"` (hallucination — 실제 C₈H₁₀N₄O₂) + `[#6]` 가 Adam optimizer chunk 의 `g_t = ∂L/∂θ_i` 본문을 인용 (시각적 short structured token 매칭 trigger). 원인: `ask_multi_hop` 의 score-gate 검사가 *pool 의 top_score* 만 봤다. multi-hop 의 pool 은 5 sub-queries 의 union — 한 sub-query 의 top score 가 gate 위면 다른 chunks 가 원본 query 와 무관해도 gate 통과 + synth → LLM hallucinate. ## Fix `ask_multi_hop` entry 에 **pre-decompose probe** 추가: 1. *원본 query* 로 retrieve 한 번 (LLM call 0회, ~ms). 2. probe empty → `refuse_no_chunks(None)` (decompose 안 함, hops=None). 3. probe top_score < gate → `refuse_score_gate(None)` (decompose 안 함). 4. probe pass → 기존 decompose / decide / synthesize flow 그대로. Multi-hop 의 safety floor 가 single-pass 와 정확히 일치 — multi-hop 은 *원본 query 가 이미 KB 범위 내* 일 때만 cross-doc reasoning 추가. 비용: 한 번의 retrieve (수 ms), LLM call 없음. multi-hop 의 LLM-dominated latency 대비 무시 가능. ## Tests 신규 3 회귀 핀 (`crates/kebab-rag/tests/multi_hop.rs`): - `multi_hop_below_probe_gate_refuses_before_any_llm_call` — **S7 직접 회귀 핀**. low-score chunk + empty LM script → score_gate refusal, LM calls 0회, hops=None. fix revert 시 즉시 panic. - `multi_hop_empty_probe_pool_refuses_before_any_llm_call` — empty retrieve 시 NoChunks refusal, LM calls 0회. - `multi_hop_above_probe_gate_proceeds_to_decompose` — probe pass 시 full multi-hop flow 정상 (decompose + decide + synth). 기존 7 multi-hop test 의 `ScriptedRetriever` 에 *probe-pass entry* prepend + `retriever_handle.calls()` expectation +1. test 2 / test 4 처럼 entry 두 개였던 곳도 prepend (3 entries). `multi_hop_refuse_no_chunks_preserves_hops_trace` / `multi_hop_refuse_score_gate_preserves_hops_trace` 의 의미 좁힘 — 이제 *decompose-driven* refusal (probe pass 후 sub-query retrieve 가 empty 또는 below-gate) 만 검증. *probe-driven* refusal 은 hops=None (decompose 안 함) — 신규 test 가 그 path 핀. ## 검증 - `cargo test -p kebab-rag -j 1` — 10 multi-hop (7 갱신 + 3 신규) + 19 pipeline + 31 unit + 3 prompt_template + 3 streaming 모두 통과. 회귀 없음. - `cargo clippy -p kebab-rag --all-targets -j 1 -- -D warnings` clean. - 단일 crate 직렬 build (16 GB RAM 제약). ## 변경 없음 - Wire schema — `Answer.hops` shape 동일, `refusal_reason` enum 동일. - 다른 도그푸딩 발견 (synthesize citation 일관성, latency, binary path confusion) — v0.18.1 또는 별 PR 의 책임. HOTFIXES 의 "다른 도그푸딩 발견" 절에 명시. ## 다음 PR-7 머지 후: 1. Workspace `Cargo.toml` version 0.17.2 → 0.18.0 (minor bump). 2. HANDOFF.md / INDEX.md 갱신 + frozen design §3.8 multi-hop sub-section. 3. `gitea-release v0.18.0 --auto-notes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -619,6 +619,60 @@ impl RagPipeline {
|
||||
/// eval `compare` can isolate multi-hop runs from single-pass.
|
||||
pub fn ask_multi_hop(&self, query: &str, opts: AskOpts) -> Result<Answer> {
|
||||
let started = std::time::Instant::now();
|
||||
let k_effective = opts.k.max(self.config.search.default_k);
|
||||
|
||||
// ── 0. Pre-decompose score-gate probe (v0.18 dogfood fix) ──────────
|
||||
//
|
||||
// p9-fb-41 v0.18 pre-cut dogfood (`/build/cache/dogfood-v018/
|
||||
// results/SUMMARY.md`) found that an out-of-corpus query
|
||||
// ("What is the chemical formula of caffeine?") on the multi-
|
||||
// hop path returned `grounded=true` with hallucinated content +
|
||||
// a misattributed citation marker. Cause: multi-hop's 5-sub-
|
||||
// query union pool fills with chunks each loosely matching one
|
||||
// sub-query, then the post-pool score gate (which inspects
|
||||
// `pool[0].fusion_score`) sees a sub-query's hit and passes —
|
||||
// even though the *original* query never matched anything
|
||||
// above the gate.
|
||||
//
|
||||
// Fix: probe the original query exactly the way single-pass
|
||||
// `ask` would, before any decompose / decide LLM call. If
|
||||
// top_score < gate (or hits empty), refuse with the same
|
||||
// envelope single-pass would emit. Multi-hop's safety floor
|
||||
// is now identical to single-pass's — multi-hop only *adds*
|
||||
// the cross-doc reasoning when the original query is already
|
||||
// in scope.
|
||||
//
|
||||
// Cost: one extra retrieve call (~ms, no LLM). Negligible vs.
|
||||
// the LLM-dominated multi-hop latency.
|
||||
let probe_query = SearchQuery {
|
||||
text: query.to_string(),
|
||||
mode: opts.mode,
|
||||
k: k_effective,
|
||||
filters: SearchFilters::default(),
|
||||
};
|
||||
let mut probe_hits = self
|
||||
.retriever
|
||||
.search(&probe_query)
|
||||
.context("kb-rag: multi-hop probe retriever.search")?;
|
||||
let probe_now = OffsetDateTime::now_utc();
|
||||
let probe_threshold = self.config.search.stale_threshold_days;
|
||||
for h in &mut probe_hits {
|
||||
h.stale = compute_stale(h.indexed_at, probe_now, probe_threshold);
|
||||
}
|
||||
if probe_hits.is_empty() {
|
||||
return self.refuse_no_chunks(query, &opts, k_effective, started, None);
|
||||
}
|
||||
if probe_hits[0].retrieval.fusion_score < self.config.rag.score_gate {
|
||||
return self.refuse_score_gate(
|
||||
query,
|
||||
&opts,
|
||||
&probe_hits,
|
||||
k_effective,
|
||||
started,
|
||||
None,
|
||||
);
|
||||
}
|
||||
|
||||
let mut hops: Vec<HopRecord> = Vec::new();
|
||||
|
||||
// ── 1. Decompose (iter 0) ──────────────────────────────────────────
|
||||
@@ -647,7 +701,7 @@ impl RagPipeline {
|
||||
// is needed. The LLM emits new sub-queries (continue) or `[]`
|
||||
// (stop); the loop also breaks when `max_depth` or
|
||||
// `max_pool_chunks` cap fires (`forced_stop = true`).
|
||||
let k_effective = opts.k.max(self.config.search.default_k);
|
||||
// `k_effective` already computed at the probe step above.
|
||||
let max_depth = self.config.rag.multi_hop_max_depth;
|
||||
let max_pool = self.config.rag.multi_hop_max_pool_chunks as usize;
|
||||
let mut pool: Vec<SearchHit> = Vec::new();
|
||||
|
||||
@@ -58,7 +58,9 @@ fn multi_hop_decide_stop_triggers_synthesize() {
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits]));
|
||||
// PR-7: ScriptedRetriever entry 0 = probe retrieve (pre-decompose
|
||||
// score-gate), entry 1 = decompose-driven retrieve for "q1".
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
@@ -84,8 +86,8 @@ fn multi_hop_decide_stop_triggers_synthesize() {
|
||||
);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
1,
|
||||
"single sub-query → single retrieval"
|
||||
2,
|
||||
"probe retrieve + 1 sub-query retrieve = 2"
|
||||
);
|
||||
|
||||
let hops = answer.hops.expect("multi-hop happy path stamps Some(hops)");
|
||||
@@ -115,9 +117,10 @@ fn multi_hop_decide_continue_adds_more_chunks() {
|
||||
let did2 = id32("d2");
|
||||
env.seed_chunk(&cid1, &did1, "notes/a.md", "Chunk one.", &["A"]);
|
||||
env.seed_chunk(&cid2, &did2, "notes/b.md", "Chunk two.", &["B"]);
|
||||
// iter 1 retrieves chunk 1; iter 2 retrieves chunk 2 (different
|
||||
// chunk_id → pool grows).
|
||||
// PR-7: entry 0 = probe (above gate), entry 1 = iter 1 retrieves
|
||||
// chunk 1, entry 2 = iter 2 retrieves chunk 2.
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![
|
||||
vec![mk_hit(1, &cid1, &did1, "notes/a.md", 0.85, &["A"])],
|
||||
vec![mk_hit(1, &cid1, &did1, "notes/a.md", 0.85, &["A"])],
|
||||
vec![mk_hit(1, &cid2, &did2, "notes/b.md", 0.80, &["B"])],
|
||||
]));
|
||||
@@ -146,8 +149,8 @@ fn multi_hop_decide_continue_adds_more_chunks() {
|
||||
);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
2,
|
||||
"iter 1 retrieves q1, iter 2 retrieves q2"
|
||||
3,
|
||||
"probe + iter 1 retrieves q1 + iter 2 retrieves q2"
|
||||
);
|
||||
assert_eq!(
|
||||
answer.retrieval.chunks_returned, 2,
|
||||
@@ -187,7 +190,8 @@ fn multi_hop_max_depth_force_stops() {
|
||||
cfg.rag.multi_hop_max_depth = 1;
|
||||
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits]));
|
||||
// PR-7: entry 0 = probe, entry 1 = decompose-driven retrieve.
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
@@ -210,7 +214,7 @@ fn multi_hop_max_depth_force_stops() {
|
||||
2,
|
||||
"depth-cap skips decide → only decompose + synthesize"
|
||||
);
|
||||
assert_eq!(retriever_handle.calls(), 1);
|
||||
assert_eq!(retriever_handle.calls(), 2, "probe + 1 decompose retrieve");
|
||||
|
||||
let hops = answer.hops.expect("happy path stamps hops");
|
||||
assert_eq!(hops.len(), 3, "[Decompose, Decide(forced_stop), Synthesize]");
|
||||
@@ -238,9 +242,10 @@ fn multi_hop_pool_chunks_dedup_by_chunk_id() {
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Shared chunk text.", &["X"]);
|
||||
// Both sub-queries retrieve the same chunk_id — dedup must
|
||||
// keep exactly one pool entry.
|
||||
// keep exactly one pool entry. PR-7: entry 0 = probe.
|
||||
let shared_hit = mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["X"]);
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![
|
||||
vec![shared_hit.clone()],
|
||||
vec![shared_hit.clone()],
|
||||
vec![shared_hit],
|
||||
]));
|
||||
@@ -262,8 +267,8 @@ fn multi_hop_pool_chunks_dedup_by_chunk_id() {
|
||||
assert!(answer.grounded);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
2,
|
||||
"two sub-queries → two retrieval calls"
|
||||
3,
|
||||
"probe + two sub-query retrieves"
|
||||
);
|
||||
assert_eq!(
|
||||
answer.retrieval.chunks_returned, 1,
|
||||
@@ -295,7 +300,8 @@ fn multi_hop_decide_parse_failure_falls_through_to_synthesize() {
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits]));
|
||||
// PR-7: entry 0 = probe, entry 1 = decompose-driven retrieve.
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
// Decide LLM emits non-JSON garbage. Spec §9: this is NOT a
|
||||
@@ -347,15 +353,25 @@ fn multi_hop_decide_parse_failure_falls_through_to_synthesize() {
|
||||
//
|
||||
// PR-3b-ii widens `refuse_no_chunks` to accept `hops:
|
||||
// Option<Vec<HopRecord>>` and wires `ask_multi_hop` to forward the
|
||||
// partial trace. This test pins that contract — a refusal Answer
|
||||
// still carries the decompose + decide hops the loop accumulated
|
||||
// before pool came up empty.
|
||||
// partial trace. PR-7 added a pre-decompose probe — so this test
|
||||
// now exercises the *decompose-driven* empty-pool path: probe
|
||||
// passes (KB has at least one relevant chunk), decompose emits
|
||||
// sub-queries, but the sub-query retrieve hits nothing → pool stays
|
||||
// empty → refuse_no_chunks with the partial hop trace preserved.
|
||||
// (For the *probe-driven* refusal, see
|
||||
// `multi_hop_empty_probe_pool_refuses_before_any_llm_call` —
|
||||
// that path returns hops=None because decompose never ran.)
|
||||
|
||||
#[test]
|
||||
fn multi_hop_refuse_no_chunks_preserves_hops_trace() {
|
||||
let env = RagEnv::new();
|
||||
// Retriever returns 0 hits — pool stays empty → refuse_no_chunks.
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![vec![]]));
|
||||
let cid = id32("c1");
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
|
||||
let probe_hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
// PR-7: entry 0 = probe (passes gate), entry 1 = decompose-driven
|
||||
// retrieve (empty — sub-query returned nothing).
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![probe_hits, vec![]]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
@@ -373,7 +389,11 @@ fn multi_hop_refuse_no_chunks_preserves_hops_trace() {
|
||||
|
||||
assert!(!answer.grounded);
|
||||
assert_eq!(answer.refusal_reason, Some(RefusalReason::NoChunks));
|
||||
assert_eq!(retriever_handle.calls(), 1, "single sub-query → single retrieve");
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
2,
|
||||
"probe (passes) + 1 decompose-driven retrieve (empty)"
|
||||
);
|
||||
assert_eq!(lm_handle.calls(), 1, "decompose only — decide skipped (empty pool), no synthesize");
|
||||
|
||||
let hops = answer
|
||||
@@ -398,14 +418,25 @@ fn multi_hop_refuse_no_chunks_preserves_hops_trace() {
|
||||
|
||||
#[test]
|
||||
fn multi_hop_refuse_score_gate_preserves_hops_trace() {
|
||||
// PR-7 narrowed this path: with the pre-decompose probe gate,
|
||||
// the *probe* must pass (high-score chunk) for decompose to
|
||||
// run at all. The *decompose-driven* retrieve can then return
|
||||
// a below-gate hit that triggers the post-pool gate refusal —
|
||||
// which is the surface that preserves hops.
|
||||
//
|
||||
// For the *probe-driven* gate refusal (single-pass-equivalent
|
||||
// safety floor), see
|
||||
// `multi_hop_below_probe_gate_refuses_before_any_llm_call` —
|
||||
// that returns hops=None because decompose never ran.
|
||||
let env = RagEnv::new();
|
||||
let (cid, did) = seed_low_score_chunk(&env);
|
||||
// Top score 0.10 is well below the default gate (0.30) — the
|
||||
// score-gate refusal fires after the pool has been built, so
|
||||
// the decide LLM call did run and the hop trace contains both
|
||||
// Decompose and Decide entries.
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/low.md", 0.10, &["Low"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits]));
|
||||
let (low_cid, low_did) = seed_low_score_chunk(&env);
|
||||
let high_cid = id32("c_high");
|
||||
let high_did = id32("d_high");
|
||||
env.seed_chunk(&high_cid, &high_did, "notes/high.md", "high score body", &["High"]);
|
||||
|
||||
let probe_hits = vec![mk_hit(1, &high_cid, &high_did, "notes/high.md", 0.85, &["High"])];
|
||||
let decompose_hits = vec![mk_hit(1, &low_cid, &low_did, "notes/low.md", 0.10, &["Low"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![probe_hits, decompose_hits]));
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
// decompose + decide (pool not empty so decide fires) — synthesize
|
||||
@@ -456,3 +487,134 @@ fn seed_low_score_chunk(env: &RagEnv) -> (String, String) {
|
||||
env.seed_chunk(&cid, &did, "notes/low.md", "low score text", &["Low"]);
|
||||
(cid, did)
|
||||
}
|
||||
|
||||
// ── p9-fb-41 v0.18 dogfood fix: pre-decompose score-gate probe ────────────
|
||||
//
|
||||
// Out-of-corpus query that single-pass would have refused via
|
||||
// score-gate must also refuse on the multi-hop path — *before* any
|
||||
// decompose / decide / synthesize LLM call. Otherwise the decompose
|
||||
// can emit sub-queries that pull in chunks loosely matching each
|
||||
// sub-query, fill the pool past the gate, and let the synthesize
|
||||
// hallucinate over chunks that were never relevant to the *original*
|
||||
// query. Dogfood S7 (`/build/cache/dogfood-v018/results/SUMMARY.md`)
|
||||
// is the symptom; these tests pin the fix.
|
||||
|
||||
#[test]
|
||||
fn multi_hop_below_probe_gate_refuses_before_any_llm_call() {
|
||||
let env = RagEnv::new();
|
||||
let cid = id32("c_low");
|
||||
let did = id32("d_low");
|
||||
env.seed_chunk(&cid, &did, "notes/low.md", "low score body", &["Low"]);
|
||||
// Single hit far below the default 0.30 gate.
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/low.md", 0.05, &["Low"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
// Empty LM script — ANY LLM call panics on exhaustion. The fix
|
||||
// must short-circuit before decompose.
|
||||
let lm = Arc::new(ScriptedLm::new(vec![]));
|
||||
let lm_handle = lm.clone();
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let pipeline =
|
||||
RagPipeline::new(env.config.clone(), retriever_dyn, lm_dyn, env.sqlite.clone());
|
||||
|
||||
let answer = pipeline.ask("out-of-corpus query", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(!answer.grounded);
|
||||
assert_eq!(answer.refusal_reason, Some(RefusalReason::ScoreGate));
|
||||
assert_eq!(
|
||||
lm_handle.calls(),
|
||||
0,
|
||||
"below-gate must short-circuit BEFORE any LLM call (no decompose, decide, or synthesize)"
|
||||
);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
1,
|
||||
"only the probe retrieve happened — no decompose-driven retrieves"
|
||||
);
|
||||
// S7 dogfood: in the pre-fix world the multi-hop path would have
|
||||
// returned grounded=true with hallucinated content. This test
|
||||
// pins the safe envelope.
|
||||
assert!(
|
||||
answer.hops.is_none(),
|
||||
"pre-decompose refusal carries no hop trace (decompose never ran)"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_empty_probe_pool_refuses_before_any_llm_call() {
|
||||
let env = RagEnv::new();
|
||||
// Retriever returns 0 hits — probe is empty.
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![vec![]]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
let lm = Arc::new(ScriptedLm::new(vec![]));
|
||||
let lm_handle = lm.clone();
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let pipeline =
|
||||
RagPipeline::new(env.config.clone(), retriever_dyn, lm_dyn, env.sqlite.clone());
|
||||
|
||||
let answer = pipeline.ask("q", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(!answer.grounded);
|
||||
assert_eq!(answer.refusal_reason, Some(RefusalReason::NoChunks));
|
||||
assert_eq!(
|
||||
lm_handle.calls(),
|
||||
0,
|
||||
"empty probe must short-circuit BEFORE any LLM call"
|
||||
);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
1,
|
||||
"only the probe retrieve happened — no decompose retrieves"
|
||||
);
|
||||
assert!(answer.hops.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_above_probe_gate_proceeds_to_decompose() {
|
||||
// Sanity counterpart: a query that PASSES the probe gate still
|
||||
// exercises the full multi-hop flow (decompose → decide → synth).
|
||||
// Guards against the fix accidentally short-circuiting valid
|
||||
// multi-hop calls.
|
||||
let env = RagEnv::new();
|
||||
let cid = id32("c1");
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
|
||||
// Probe retrieve returns a high-score hit (above gate),
|
||||
// decompose-driven retrieve returns the same chunk again.
|
||||
let probe_hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let decompose_hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![probe_hits, decompose_hits]));
|
||||
let retriever_handle = retriever.clone();
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
|
||||
let lm = Arc::new(ScriptedLm::new(vec![
|
||||
r#"["q1"]"#,
|
||||
r#"[]"#,
|
||||
"answer [#1]",
|
||||
]));
|
||||
let lm_handle = lm.clone();
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let pipeline =
|
||||
RagPipeline::new(env.config.clone(), retriever_dyn, lm_dyn, env.sqlite.clone());
|
||||
|
||||
let answer = pipeline.ask("valid query", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(answer.grounded);
|
||||
assert_eq!(answer.refusal_reason, None);
|
||||
assert_eq!(
|
||||
lm_handle.calls(),
|
||||
3,
|
||||
"decompose + decide + synthesize all ran"
|
||||
);
|
||||
assert_eq!(
|
||||
retriever_handle.calls(),
|
||||
2,
|
||||
"probe retrieve + decompose-driven retrieve"
|
||||
);
|
||||
let hops = answer.hops.expect("happy path stamps hops");
|
||||
assert_eq!(hops.len(), 3);
|
||||
}
|
||||
|
||||
@@ -14,6 +14,58 @@ historical contract that was implemented; this file accumulates the
|
||||
deltas so phase 5+ readers can find the live behavior without diffing
|
||||
git history.
|
||||
|
||||
## 2026-05-25 — fb-41 pre-v0.18 dogfood: multi-hop score-gate 우회 (S7 hallucination 회귀 핀)
|
||||
|
||||
v0.18.0 cut 전 fb-41 multi-hop RAG 도그푸딩 (`/build/cache/dogfood-v018/`, 33 assets / 205 chunks corpus — 16 신규 markdown 5 클러스터 + v017 carryover, gemma3:4b CPU only / 16 GB RAM) 에서 발견된 **score_gate 우회 + hallucination 케이스**.
|
||||
|
||||
### Symptom (S7)
|
||||
|
||||
Query: `"What is the chemical formula of caffeine?"` (KB 에 없는 fact).
|
||||
- **Single-pass** `kebab ask`: retrieve 의 top score 가 default `rag.score_gate = 0.30` 미만 → `refuse_score_gate` → 안전한 refusal.
|
||||
- **Multi-hop** `kebab ask --multi-hop`: `grounded = true`, 본문 `"카페인의 화학식은 C₉H₁₅N₃O 입니다 [#6]"` (**hallucination** — 실제 C₈H₁₀N₄O₂) + `[#6]` 가 *Adam optimizer chunk* 의 `g_t = ∂L/∂θ_i` 본문을 인용 (시각적으로 화학식 비슷한 short structured token 매칭 trigger).
|
||||
|
||||
### Root cause
|
||||
|
||||
`ask_multi_hop` (`crates/kebab-rag/src/pipeline.rs`) 의 score-gate 검사가 *pool 의 top_score* 만 봄. multi-hop 의 pool 은 *5 sub-queries 의 union* — 한 sub-query 의 top score 가 gate 위면 pool 의 다른 chunks 가 원본 query 와 무관해도 gate 통과 + synthesize 단계 진입 → LLM 이 chunks 위에서 hallucinate.
|
||||
|
||||
같은 query 에 대해 single-pass 는 *원본 query 의 retrieve top score* 검사 → reject. multi-hop 만 hole.
|
||||
|
||||
### Fix (PR-7)
|
||||
|
||||
`ask_multi_hop` entry 에 **pre-decompose probe** 추가:
|
||||
1. *원본 query* 로 retrieve 한 번 (LLM call 0회, ~ms).
|
||||
2. probe empty → `refuse_no_chunks(None)` (decompose 안 함, hop trace 도 없음).
|
||||
3. probe top_score < gate → `refuse_score_gate(None)` (decompose 안 함).
|
||||
4. probe pass → 기존 decompose / decide / synthesize flow 그대로.
|
||||
|
||||
Multi-hop 의 safety floor 가 single-pass 와 정확히 일치 — multi-hop 은 *원본 query 가 이미 KB 범위 내* 일 때만 cross-doc reasoning 추가.
|
||||
|
||||
### Test 갱신
|
||||
|
||||
- 신규 3 회귀 핀 (`crates/kebab-rag/tests/multi_hop.rs`):
|
||||
- `multi_hop_below_probe_gate_refuses_before_any_llm_call` — S7 회귀 직접 핀. low-score chunk + empty LM script → score_gate refusal, LM calls 0회, hops=None.
|
||||
- `multi_hop_empty_probe_pool_refuses_before_any_llm_call` — empty retrieve 시 NoChunks refusal, LM calls 0회.
|
||||
- `multi_hop_above_probe_gate_proceeds_to_decompose` — probe pass 시 full multi-hop flow (decompose + decide + synth) 정상 동작.
|
||||
- 기존 7 multi-hop test 의 `ScriptedRetriever` 에 probe entry prepend + `retriever_handle.calls()` expectation +1.
|
||||
- `multi_hop_refuse_no_chunks_preserves_hops_trace` / `multi_hop_refuse_score_gate_preserves_hops_trace` 의 의미 좁힘 — 이제 *decompose-driven* refusal (probe pass 후 sub-query retrieve 가 empty 또는 below-gate) 만 검증. *probe-driven* refusal 은 hops=None (decompose 안 함) — 신규 test 가 그 path 핀.
|
||||
|
||||
### 다른 도그푸딩 발견 (별 fix 보류)
|
||||
|
||||
같은 도그푸딩에서 발견된 다른 항목들 — PR-7 본 fix 의 scope 밖, v0.18.1 또는 후속 PR 대상:
|
||||
|
||||
- **synthesize citation marker 일관성 부족** (S1/S2/S3, P1) — 30-chunk pool 의 large prompt 에서 gemma3:4b 가 `[#N]` citation rule 잃음 → 답변 본문 정상이나 `grounded=false (LlmSelfJudge)` 로 노출. **권장 mitigation**: `MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT` 의 citation rule 강화 또는 `multi_hop_max_pool_chunks` default 30 → 15.
|
||||
- **latency 20-25× cost** — single-pass 30s vs multi-hop 590-685s (synthesize 단계가 cost dominant). spec 의 "2-5× LLM cost" 보다 큼. pool size 축소가 mitigation.
|
||||
- **release binary path confusion** — `/home/altair823/kebab/target/release/kebab` (v0.17.1 stale) vs `/build/out/cargo-target/release/kebab` (v0.17.2 latest). CARGO_TARGET_DIR env 의 영향. docs 한 줄 권장.
|
||||
|
||||
### 사용자 영향
|
||||
|
||||
PR-7 머지 후 (v0.18.0 cut 직전):
|
||||
- multi-hop 의 safety 가 single-pass 와 동일. KB 밖 query 가 hallucination 답변 못 받음.
|
||||
- Multi-hop 의 정상 use case (compound / cross-doc reasoning) 는 영향 없음 — probe 가 통과하면 기존 flow 그대로.
|
||||
- Wire 변경 없음 (`Answer.hops` 의 노출 동일, refusal_reason 값 동일).
|
||||
|
||||
Cross-link: `/build/cache/dogfood-v018/results/SUMMARY.md` (전체 dogfood 보고서), spec `docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md`.
|
||||
|
||||
## 2026-05-25 — v0.17.0 post-dogfood: `[models.llm] request_timeout_secs` 노브 + 권장 모델 가이드
|
||||
|
||||
v0.17.0 후속 도그푸딩에서 발견: 사용자가 default `gemma4:e4b` (8B Q4, 9.6 GB) 를 CPU only / 16 GB RAM 환경에서 시도 시 첫 RAG 답변이 5 분 (hard-coded 300 s) 한도를 항상 넘겨 `error: kb-rag: llm.generate_stream` 으로 떨어졌다. 메모리도 ollama RSS 10.7 GB / free 2 GB 까지 압박. 후속 도그푸딩 32 분 / 199 mem-monitor sample 결과는 `tasks/HOTFIXES.md` 의 본 entry 와 conversation 의 도그푸딩 보고 참조.
|
||||
|
||||
Reference in New Issue
Block a user