feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)

PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9.

- crates/kebab-rag/src/pipeline.rs:
  - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기).
  - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed).
  - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable).
  - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate).
  - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure.
- crates/kebab-app/src/app.rs:
  - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result<Self> 그대로 — caller cascading 0).
  - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가.
- crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs:
  - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented).
  - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true.
  - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed.
  - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None.
  - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable.
  - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전.
- integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling).

검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 00:55:02 +00:00
parent 681c48b2a3
commit 00ffe9c792
8 changed files with 539 additions and 29 deletions

1
Cargo.lock generated
View File

@@ -4142,6 +4142,7 @@ dependencies = [
"kebab-embed-local",
"kebab-llm",
"kebab-llm-local",
"kebab-nli",
"kebab-normalize",
"kebab-parse-code",
"kebab-parse-image",

View File

@@ -23,6 +23,11 @@ kebab-embed-local = { path = "../kebab-embed-local" }
kebab-llm = { path = "../kebab-llm" }
kebab-llm-local = { path = "../kebab-llm-local" }
kebab-rag = { path = "../kebab-rag" }
# p9-fb-41 PR-9c-2: facade construction of OnnxNliVerifier when
# `[rag] nli_threshold > 0`. Trait-only consumption via kebab-rag's
# `with_verifier`; no kebab-nli internals leak into kebab-app code
# beyond the construction site in `open_with_config`.
kebab-nli = { path = "../kebab-nli" }
# P6-4: image extractor + OCR + caption adapters live here. App
# threads them into the per-asset dispatch (see `ingest_one_asset`
# image branch). Trait-only consumption — no `kebab-parse-image`

View File

@@ -133,6 +133,17 @@ pub struct App {
/// `corpus_revision` snapshot embedded in `SearchCacheKey`
/// invalidates every entry the moment a new ingest commit lands.
search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<SearchHit>>>>,
/// p9-fb-41 PR-9c-2: NLI verifier built eagerly at
/// `open_with_config` time when `config.rag.nli_threshold > 0`,
/// consumed by `RagPipeline::with_verifier` on every `ask` /
/// `ask_with_session` call. `None` when the gate is disabled
/// (default, threshold = 0) — multi-hop skips step 8.5 entirely
/// and single-pass never touches the verifier.
///
/// Built eagerly (not lazy) so the `open_with_config` `?`
/// propagation surfaces NLI model construction errors at App
/// boot time, before any user query runs.
pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
}
/// p9-fb-19: cache key for `App::search`. Includes every field that
@@ -193,6 +204,21 @@ impl App {
// `None` (cache disabled — every search hits the retrievers).
let search_cache = NonZeroUsize::new(config.search.cache_capacity)
.map(|cap| Mutex::new(LruCache::new(cap)));
// p9-fb-41 PR-9c-2: build the NLI verifier when the gate is
// enabled. App carries it on `RagPipeline` via
// `with_verifier` so the rag crate doesn't have to know about
// kebab-nli construction. Failure (`?`) surfaces as a user-
// facing error at App boot — never a panic in the pipeline's
// `expect("verifier must be Some when nli_threshold > 0.0")`.
let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> =
if config.rag.nli_threshold > 0.0 {
let v = kebab_nli::OnnxNliVerifier::new(&config).context(
"kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)",
)?;
Some(Arc::new(v))
} else {
None
};
Ok(Self {
config,
sqlite: Arc::new(sqlite),
@@ -200,6 +226,7 @@ impl App {
vector: OnceLock::new(),
llm: OnceLock::new(),
search_cache,
pipeline_verifier,
})
}
@@ -553,9 +580,26 @@ impl App {
pub fn ask(&self, query: &str, opts: AskOpts) -> Result<Answer> {
let retriever = self.build_retriever(opts.mode)?;
let llm = self.llm()?;
let pipeline = self.build_pipeline(retriever, llm);
pipeline.ask(query, opts)
}
/// p9-fb-41 PR-9c-2: shared pipeline builder used by [`Self::ask`]
/// and [`Self::ask_with_session`]. Attaches the App-built NLI
/// verifier (when `cfg.rag.nli_threshold > 0`) via
/// `RagPipeline::with_verifier`, keeping the construction site in
/// a single place so the two call paths can't drift.
fn build_pipeline(
&self,
retriever: Arc<dyn Retriever>,
llm: Arc<dyn LanguageModel>,
) -> RagPipeline {
let pipeline =
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
pipeline.ask(query, opts)
match &self.pipeline_verifier {
Some(v) => pipeline.with_verifier(v.clone()),
None => pipeline,
}
}
/// p9-fb-18: shared retriever-stack builder used by [`Self::ask`]
@@ -660,10 +704,11 @@ impl App {
// p9-fb-18 R1: shared retriever builder removes the prior
// copy of `ask`'s 35-line stack — see [`Self::build_retriever`].
// p9-fb-41 PR-9c-2: shared `build_pipeline` attaches the NLI
// verifier when the gate is enabled.
let retriever = self.build_retriever(opts.mode)?;
let llm = self.llm()?;
let pipeline =
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
let pipeline = self.build_pipeline(retriever, llm);
let answer = pipeline.ask_with_history(
query,
history,

View File

@@ -22,4 +22,4 @@ pub use kebab_core::{Answer, AnswerCitation, AnswerRetrievalSummary, RefusalReas
mod pipeline;
pub use pipeline::{AskOpts, RagPipeline, StreamEvent};
pub use pipeline::{AskOpts, MAX_NLI_PREMISE_CHARS, RagPipeline, StreamEvent, truncate_for_nli};

View File

@@ -42,7 +42,7 @@ use kebab_core::{
Answer, AnswerCitation, AnswerRetrievalSummary, Citation, FinishReason,
GenerateRequest, HopKind, HopRecord, LanguageModel, ModelRef, RefusalReason,
Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery, TokenChunk,
TokenUsage, TraceId, Turn,
TokenUsage, TraceId, Turn, VerificationSummary,
};
use kebab_core::versions::PromptTemplateVersion;
use kebab_store_sqlite::SqliteStore;
@@ -197,13 +197,11 @@ pub struct RagPipeline {
retriever: Arc<dyn Retriever>,
llm: Arc<dyn LanguageModel>,
docs: Arc<SqliteStore>,
/// p9-fb-41 PR-9c-1: optional NLI verifier injected via
/// [`Self::with_verifier`]. Not yet read — PR-9c-2 wires the
/// `ask_multi_hop` step 8.5 (post-synthesize gate) that consumes
/// it. Until then the field is `#[allow(dead_code)]`; the
/// attribute is removed in the PR-9c-2 commit that adds the
/// read site so leftover dead code can never sneak in.
#[allow(dead_code)]
/// p9-fb-41 PR-9c-1/PR-9c-2: optional NLI verifier injected via
/// [`Self::with_verifier`]. Consumed by `ask_multi_hop` step 8.5
/// (post-synthesize gate) when `cfg.rag.nli_threshold > 0`.
/// `None` when the gate is disabled — single-pass `ask` never
/// touches this field.
verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
}
@@ -231,17 +229,12 @@ impl RagPipeline {
}
}
/// p9-fb-41 PR-9c-1: inject the post-synthesize NLI verifier.
/// Caller (kebab-app facade, PR-9c-2) builds an
/// p9-fb-41 PR-9c-1/PR-9c-2: inject the post-synthesize NLI
/// verifier. Caller (kebab-app facade) builds an
/// `Arc<OnnxNliVerifier>` from `cfg.models.nli` when
/// `cfg.rag.nli_threshold > 0`, then chains
/// `RagPipeline::new(...).with_verifier(v)`.
///
/// Currently unused — PR-9c-2 wires the read site (step 8.5 of
/// `ask_multi_hop`). `#[allow(dead_code)]` survives only until
/// that PR's commit, which removes it together with adding the
/// hook that reads `self.verifier`.
#[allow(dead_code)]
/// `RagPipeline::new(...).with_verifier(v)`. Consumed by
/// `ask_multi_hop` step 8.5 (post-synthesize gate).
pub fn with_verifier(mut self, v: Arc<dyn kebab_nli::NliVerifier>) -> Self {
self.verifier = Some(v);
self
@@ -1031,6 +1024,48 @@ impl RagPipeline {
(false, Some(RefusalReason::LlmSelfJudge))
};
// ── 8.5 NLI groundedness verification (multi-hop only, v0.18) ─────
// spec §2.7: single-pass `ask` keeps the LlmSelfJudge gate as-is;
// NLI verification is multi-hop only this round.
//
// Empty answer guard: if synthesize bailed (stream abort / LM
// crash), `acc` is empty. That path has its own refusal
// (LlmStreamAborted) above; skipping the NLI gate here avoids
// tokenizing an empty hypothesis (degenerate CLS-SEP-SEP that
// would yield a near-uniform softmax and a misleading nli_passed).
let verification = if self.config.rag.nli_threshold > 0.0 && !acc.trim().is_empty() {
let v = self.verifier.as_ref().expect(
"verifier must be Some when nli_threshold > 0.0 \
(kebab-app's open_with_config enforces this invariant)",
);
let (truncated_premise, _was_truncated) = truncate_for_nli(&packed_text, &acc);
match v.score(&truncated_premise, &acc) {
Ok(scores) => {
let passed = scores.entailment >= self.config.rag.nli_threshold;
Some(VerificationSummary {
nli_score: scores.entailment,
nli_threshold: self.config.rag.nli_threshold,
nli_passed: passed,
})
}
Err(e) => {
tracing::warn!(
target: "kebab-rag",
error = %e,
"NLI verifier failed (model unavailable / inference err); refusing"
);
return self.refuse_nli_model_unavailable(query, &opts, hops, started);
}
}
} else {
None
};
if let Some(v) = &verification
&& !v.nli_passed
{
return self.refuse_nli_verification(query, &opts, hops, *v, started);
}
// ── 8. Build Answer ────────────────────────────────────────────────
let cited_set: std::collections::BTreeSet<u32> = extracted.iter().copied().collect();
let citations: Vec<AnswerCitation> = packed_entries
@@ -1101,11 +1136,10 @@ impl RagPipeline {
// currently lose the trace (cleanup deferred — would
// require widening helper signatures, PR-3b-ii / follow-up).
hops: Some(hops),
// p9-fb-41 PR-9c-1: surface-only field — PR-9c-2 wires
// step 8.5 between citation-validate and Answer-build to
// stamp this with the actual NLI score when
// `cfg.rag.nli_threshold > 0`. Until then, stays None.
verification: None,
// p9-fb-41 PR-9c-2: step 8.5 stamped this when
// `cfg.rag.nli_threshold > 0`. None when the gate is
// disabled (default).
verification,
};
tracing::debug!(
@@ -1554,6 +1588,137 @@ impl RagPipeline {
}
Ok(answer)
}
/// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI gate failure —
/// `RefusalReason::NliVerificationFailed`. The synthesized answer
/// existed (acc was non-empty) but the entailment score fell below
/// `cfg.rag.nli_threshold`. We stamp the `VerificationSummary` on
/// the wire so the user can see what score was rejected.
fn refuse_nli_verification(
&self,
query: &str,
opts: &AskOpts,
hops: Vec<HopRecord>,
v: VerificationSummary,
started: std::time::Instant,
) -> Result<Answer> {
let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
let k_effective = opts.k.max(self.config.search.default_k);
let answer = Answer {
answer: "근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음."
.to_string(),
citations: Vec::new(),
grounded: false,
refusal_reason: Some(RefusalReason::NliVerificationFailed),
model: self.llm.model_ref(),
embedding: embedding_ref_for(opts.mode, &self.config),
prompt_template_version: PromptTemplateVersion(
PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
),
retrieval: AnswerRetrievalSummary {
trace_id,
mode: opts.mode,
k: k_effective,
score_gate: self.config.rag.score_gate,
top_score: 0.0,
chunks_returned: 0,
chunks_used: 0,
},
usage: TokenUsage {
prompt_tokens: 0,
completion_tokens: 0,
latency_ms: elapsed_ms,
},
created_at: OffsetDateTime::now_utc(),
conversation_id: opts.conversation_id.clone(),
turn_index: opts.turn_index,
// PR-9c-2: NLI refusal still carries the hop trace built
// up to step 8.5 — synthesize ran, so the trace is the
// full decompose+decide chain (terminal Synthesize hop is
// NOT appended for the refusal path; cleanup deferred to
// a follow-up if the user-visible trace shape needs the
// synthesize entry).
hops: Some(hops),
verification: Some(v),
};
if let Some(sink) = &opts.stream_sink {
let _ = sink.send(StreamEvent::Final {
answer: answer.clone(),
});
}
if let Err(e) = self.docs.put_answer(&answer, query, None) {
tracing::warn!(
target: "kebab-rag",
error = %e,
"kb-rag: put_answer (NliVerificationFailed) failed"
);
}
Ok(answer)
}
/// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI model
/// unavailable — `RefusalReason::NliModelUnavailable`. The verifier
/// raised an inference/download error so we cannot summarize the
/// verification result; `verification` is `None`. Treat as a soft
/// refusal — the user can opt out by setting `[rag] nli_threshold
/// = 0` and retrying.
fn refuse_nli_model_unavailable(
&self,
query: &str,
opts: &AskOpts,
hops: Vec<HopRecord>,
started: std::time::Instant,
) -> Result<Answer> {
let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
let k_effective = opts.k.max(self.config.search.default_k);
let answer = Answer {
answer: "근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능."
.to_string(),
citations: Vec::new(),
grounded: false,
refusal_reason: Some(RefusalReason::NliModelUnavailable),
model: self.llm.model_ref(),
embedding: embedding_ref_for(opts.mode, &self.config),
prompt_template_version: PromptTemplateVersion(
PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
),
retrieval: AnswerRetrievalSummary {
trace_id,
mode: opts.mode,
k: k_effective,
score_gate: self.config.rag.score_gate,
top_score: 0.0,
chunks_returned: 0,
chunks_used: 0,
},
usage: TokenUsage {
prompt_tokens: 0,
completion_tokens: 0,
latency_ms: elapsed_ms,
},
created_at: OffsetDateTime::now_utc(),
conversation_id: opts.conversation_id.clone(),
turn_index: opts.turn_index,
hops: Some(hops),
// No VerificationSummary — verification didn't happen.
verification: None,
};
if let Some(sink) = &opts.stream_sink {
let _ = sink.send(StreamEvent::Final {
answer: answer.clone(),
});
}
if let Err(e) = self.docs.put_answer(&answer, query, None) {
tracing::warn!(
target: "kebab-rag",
error = %e,
"kb-rag: put_answer (NliModelUnavailable) failed"
);
}
Ok(answer)
}
}
// ── Helpers ────────────────────────────────────────────────────────────────
@@ -1623,6 +1788,35 @@ pub(crate) const PROMPT_TEMPLATE_VERSION_MULTI_HOP: &str = "rag-multi-hop-v1";
/// the same PR.
pub(crate) const MULTI_HOP_MAX_SUB_QUERIES_HARD_CAP: usize = 10;
/// p9-fb-41 PR-9c-2: premise budget for NLI input. mDeBERTa-v3's
/// positional embedding caps at 512 tokens; with the hypothesis +
/// special-token budget reserved (~32 chars conservative), the
/// premise gets ≈1600 chars at 4 chars/token (English BPE baseline).
/// Korean SentencePiece is denser (1-2 char/token) — the tokenizer's
/// `OnlyFirst` strategy (configured in kebab-nli) is the backup
/// truncation when the char-based budget still overflows the token
/// limit. v0.18.1 candidate: token-count-based budget once we have
/// measured KR truncation rates from dogfood retest.
pub const MAX_NLI_PREMISE_CHARS: usize = 4 * 400;
/// p9-fb-41 PR-9c-2: truncate `premise` to fit the NLI input budget
/// while preserving `hypothesis` in full. Returns `(truncated_premise,
/// was_truncated)`. `was_truncated` is informational for tracing —
/// the v0.18 wire doesn't surface it; a v0.19+ extension might.
///
/// `_hypothesis` is currently unused — placeholder for the v0.18.1
/// token-budget version that would carve the budget *around* the
/// hypothesis. Kept on the signature to preserve the contract from
/// spec §2.2.3 / spec §3 PR-9c-2.
pub fn truncate_for_nli(premise: &str, _hypothesis: &str) -> (String, bool) {
if premise.chars().count() <= MAX_NLI_PREMISE_CHARS {
(premise.to_string(), false)
} else {
let truncated: String = premise.chars().take(MAX_NLI_PREMISE_CHARS).collect();
(truncated, true)
}
}
const MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT: &str = "당신은 사용자의 질문을 다단계 검색에 필요한 sub-question 들로 분해하는 도구다.\n- multi-hop 정보가 필요한 경우 독립적으로 검색 가능한 sub-question 들로 분해한다.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).\n- 원본이 이미 단순하면 원본 그대로 1 개만 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.";
const MULTI_HOP_DECIDE_SYSTEM_PROMPT: &str = "당신은 multi-hop 검색의 매 iter 에서 \"추가 retrieval 이 필요한가?\" 를 판단하는 도구다.\n- 지금까지 모은 [근거] 가 [원본 질문] 의 모든 측면을 cover 하는지 평가한다.\n- 추가가 필요하면 새 sub-question 들 (이미 모은 정보로 답할 수 없는 부분만, 독립적으로 검색 가능한 형태로) 을 JSON array of strings 로 반환한다.\n- 충분하면 빈 array `[]` 를 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).";

View File

@@ -12,13 +12,14 @@
#![allow(dead_code)]
use std::sync::Arc;
use std::sync::{Arc, Mutex};
use kebab_config::Config;
use kebab_core::{
ChunkerVersion, ChunkId, Citation, DocumentId, IndexVersion, RetrievalDetail,
Retriever, SearchHit, SearchMode, SearchQuery, WorkspacePath,
};
use kebab_nli::{NliScores, NliVerifier};
use kebab_store_sqlite::SqliteStore;
use rusqlite::params;
use tempfile::TempDir;
@@ -384,3 +385,73 @@ impl kebab_core::LanguageModel for ScriptedLm {
Ok(Box::new(chunks.into_iter().map(Ok)))
}
}
/// p9-fb-41 PR-9c-2: mock NLI verifier for multi-hop step 8.5 tests.
/// Three constructors mirror the test scenarios:
/// - [`MockNliVerifier::pass`] — high entailment score (0.9), `nli_passed`
/// is true at the production default threshold (0.5).
/// - [`MockNliVerifier::fail`] — low entailment (0.1), refuses at any
/// threshold > 0.1.
/// - [`MockNliVerifier::err`] — returns an `anyhow::Error` so the pipeline
/// surfaces `RefusalReason::NliModelUnavailable`.
///
/// `call_count` instrumented (Mutex-wrapped usize) so a test can assert
/// the verifier ran the expected number of times — useful for pinning
/// the "threshold = 0 skips verify" invariant when the verifier is
/// nonetheless attached to the pipeline.
pub struct MockNliVerifier {
pub mode: MockMode,
pub call_count: Mutex<usize>,
}
pub enum MockMode {
/// Return these scores. Used by pass / fail variants.
Scores(NliScores),
/// Return an `anyhow::Error`. Used by the err variant.
Err(String),
}
impl MockNliVerifier {
pub fn pass() -> Arc<Self> {
Arc::new(Self {
mode: MockMode::Scores(NliScores {
entailment: 0.9,
neutral: 0.07,
contradiction: 0.03,
}),
call_count: Mutex::new(0),
})
}
pub fn fail() -> Arc<Self> {
Arc::new(Self {
mode: MockMode::Scores(NliScores {
entailment: 0.1,
neutral: 0.4,
contradiction: 0.5,
}),
call_count: Mutex::new(0),
})
}
pub fn err() -> Arc<Self> {
Arc::new(Self {
mode: MockMode::Err("mock NLI unavailable".into()),
call_count: Mutex::new(0),
})
}
pub fn calls(&self) -> usize {
*self.call_count.lock().unwrap()
}
}
impl NliVerifier for MockNliVerifier {
fn score(&self, _premise: &str, _hypothesis: &str) -> anyhow::Result<NliScores> {
*self.call_count.lock().unwrap() += 1;
match &self.mode {
MockMode::Scores(s) => Ok(*s),
MockMode::Err(e) => anyhow::bail!("{e}"),
}
}
}

View File

@@ -26,9 +26,10 @@ mod common;
use std::sync::Arc;
use common::{RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
use common::{MockNliVerifier, RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
use kebab_core::{HopKind, LanguageModel, RefusalReason, Retriever, SearchMode};
use kebab_rag::{AskOpts, RagPipeline};
use kebab_nli::NliVerifier;
use kebab_rag::{AskOpts, RagPipeline, truncate_for_nli};
/// Default `AskOpts` for multi-hop tests: deterministic seed,
/// lexical mode (so the test crate doesn't need to wire up an
@@ -618,3 +619,195 @@ fn multi_hop_above_probe_gate_proceeds_to_decompose() {
let hops = answer.hops.expect("happy path stamps hops");
assert_eq!(hops.len(), 3);
}
// ── p9-fb-41 PR-9c-2: step 8.5 NLI verification tests ──────────────────────
//
// Five tests pin the NLI hook on the multi-hop path:
// 1. `multi_hop_nli_pass_keeps_grounded` — entailment 0.9 ≥ threshold 0.5 →
// happy path, `verification.nli_passed = true`.
// 2. `multi_hop_nli_fail_refuses` — entailment 0.1 < threshold 0.5 →
// refusal with `RefusalReason::NliVerificationFailed` + verification stamp.
// 3. `multi_hop_nli_disabled_skip_verify` — threshold 0.0 → verify skipped,
// `Answer.verification` stays `None` (no verifier attached).
// 4. `multi_hop_nli_model_unavailable_refuses` — verifier returns `Err` →
// refusal with `RefusalReason::NliModelUnavailable` + `verification = None`.
// 5. `multi_hop_truncate_for_nli_preserves_hypothesis` — pure unit test on
// `truncate_for_nli`'s char-budget contract.
/// Helper to build a "valid multi-hop happy-path" scenario where probe +
/// decompose retrieves the same single chunk, decompose emits one
/// sub-query, decide signals stop, and synthesize produces a cited
/// answer. Returns the seeded `RagEnv`, scripted retriever (so the
/// test can assert call count), and scripted LM with the 3-call
/// script ready.
fn happy_multi_hop_env() -> (RagEnv, Arc<ScriptedRetriever>, Arc<ScriptedLm>) {
let env = RagEnv::new();
let cid = id32("c1");
let did = id32("d1");
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
let lm = Arc::new(ScriptedLm::new(vec![
r#"["q1"]"#,
r#"[]"#,
"answer body [#1]",
]));
(env, retriever, lm)
}
#[test]
fn multi_hop_nli_pass_keeps_grounded() {
let (env, retriever, lm) = happy_multi_hop_env();
let mut cfg = env.config.clone();
cfg.rag.nli_threshold = 0.5;
let retriever_dyn: Arc<dyn Retriever> = retriever;
let lm_dyn: Arc<dyn LanguageModel> = lm;
let verifier = MockNliVerifier::pass();
let verifier_handle = verifier.clone();
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
let pipeline =
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
.with_verifier(verifier_dyn);
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
assert!(answer.grounded, "NLI-pass synthesize must stay grounded");
assert_eq!(answer.refusal_reason, None);
assert_eq!(
verifier_handle.calls(),
1,
"verifier called exactly once on the synthesized answer"
);
let v = answer
.verification
.expect("nli_threshold > 0 stamps Some(verification)");
assert!(v.nli_passed, "entailment 0.9 ≥ threshold 0.5");
assert!((v.nli_score - 0.9).abs() < 1e-5, "got: {}", v.nli_score);
assert!((v.nli_threshold - 0.5).abs() < 1e-5);
}
#[test]
fn multi_hop_nli_fail_refuses() {
let (env, retriever, lm) = happy_multi_hop_env();
let mut cfg = env.config.clone();
cfg.rag.nli_threshold = 0.5;
let retriever_dyn: Arc<dyn Retriever> = retriever;
let lm_dyn: Arc<dyn LanguageModel> = lm;
let verifier = MockNliVerifier::fail();
let verifier_handle = verifier.clone();
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
let pipeline =
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
.with_verifier(verifier_dyn);
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
assert!(!answer.grounded);
assert_eq!(
answer.refusal_reason,
Some(RefusalReason::NliVerificationFailed)
);
assert_eq!(verifier_handle.calls(), 1);
let v = answer
.verification
.expect("refusal still stamps verification summary");
assert!(!v.nli_passed, "entailment 0.1 < threshold 0.5");
assert!((v.nli_score - 0.1).abs() < 1e-5, "got: {}", v.nli_score);
}
#[test]
fn multi_hop_nli_disabled_skip_verify() {
let (env, retriever, lm) = happy_multi_hop_env();
// Default config keeps `nli_threshold = 0.0` — gate disabled. No
// verifier is attached to the pipeline; the hook short-circuits
// entirely (`Answer.verification` stays `None`).
let cfg = env.config.clone();
assert!(
(cfg.rag.nli_threshold - 0.0).abs() < f32::EPSILON,
"default nli_threshold must be 0.0 (gate disabled)"
);
let retriever_dyn: Arc<dyn Retriever> = retriever;
let lm_dyn: Arc<dyn LanguageModel> = lm;
// No `with_verifier` call — pipeline.verifier stays None.
let pipeline =
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone());
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
assert!(answer.grounded);
assert_eq!(answer.refusal_reason, None);
assert!(
answer.verification.is_none(),
"threshold = 0.0 must skip step 8.5 and leave verification = None"
);
}
#[test]
fn multi_hop_nli_model_unavailable_refuses() {
let (env, retriever, lm) = happy_multi_hop_env();
let mut cfg = env.config.clone();
cfg.rag.nli_threshold = 0.5;
let retriever_dyn: Arc<dyn Retriever> = retriever;
let lm_dyn: Arc<dyn LanguageModel> = lm;
let verifier = MockNliVerifier::err();
let verifier_handle = verifier.clone();
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
let pipeline =
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
.with_verifier(verifier_dyn);
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
assert!(!answer.grounded);
assert_eq!(
answer.refusal_reason,
Some(RefusalReason::NliModelUnavailable)
);
assert_eq!(verifier_handle.calls(), 1, "verifier was invoked once before failing");
assert!(
answer.verification.is_none(),
"NliModelUnavailable: can't summarize a verification that didn't happen"
);
}
#[test]
fn multi_hop_truncate_for_nli_preserves_hypothesis() {
// Long premise (>1600 chars) gets truncated, short hypothesis is
// passed unchanged (signature placeholder for v0.18.1 token-budget
// version). MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600.
let long_premise: String = "a".repeat(2000);
let (truncated, was_truncated) = truncate_for_nli(&long_premise, "short hypothesis");
assert!(was_truncated);
assert_eq!(
truncated.chars().count(),
1600,
"premise truncated to MAX_NLI_PREMISE_CHARS"
);
// Short premise (under budget): no truncation, `was_truncated = false`.
let short_premise = "short premise text";
let (passthrough, was_truncated) = truncate_for_nli(short_premise, "anything");
assert!(!was_truncated);
assert_eq!(passthrough, short_premise);
// Multi-byte safety: 1600 Korean chars (3 bytes each in UTF-8) fits
// within the char budget even though byte length exceeds 4800.
let kr_short: String = "".repeat(1600);
let (passthrough_kr, was_truncated_kr) = truncate_for_nli(&kr_short, "h");
assert!(!was_truncated_kr, "1600 KR chars == budget, no truncation");
assert_eq!(passthrough_kr.chars().count(), 1600);
// Multi-byte over-budget: truncation must count chars, not bytes.
let kr_long: String = "".repeat(2000);
let (truncated_kr, was_truncated_kr) = truncate_for_nli(&kr_long, "h");
assert!(was_truncated_kr);
assert_eq!(
truncated_kr.chars().count(),
1600,
"char-based truncation must not over-cut on multi-byte input"
);
}

View File

@@ -88,6 +88,7 @@ Input:
- For follow-up turns on the same topic, pass `session_id` (e.g. `"team-onboarding-2026-05"`) and reuse it across the conversation. Sessions persist until `kebab reset --data-only`.
- p9-fb-40: 기본 `prompt_template_version = "rag-v2"`. 답변이 더 strict — fact 인용 시 verbatim span, 학습 지식 동원 금지, 근거 모호 시 "확실하지 않다" 출현 가능. user 가 `[rag] prompt_template_version = "rag-v1"` 명시 시 legacy 동작.
- **p9-fb-41 `multi_hop: true`** — opt the ask into the multi-hop pipeline. The query is decomposed into sub-questions, each retrieved independently (LLM-driven decide loop, up to `rag.multi_hop_max_depth` iters), then synthesized over the merged chunk pool. Cost trade-off: 25× LLM calls vs. single-pass. **Use** for compound questions ("X 와 Y 의 차이는?", prereq chains, cross-doc reasoning where one chunk alone is insufficient). **Don't** for simple fact-finding (single-pass is faster + cheaper). When set, `answer.v1.hops[]` carries the per-hop trace (`{iter, kind, sub_queries[], context_chunks_added, forced_stop, llm_call_ms}`) — surface a brief "Searched in N hops" note when the trace is non-trivial. Decompose-failure (model emitted non-JSON) → `refusal_reason = "multi_hop_decompose_failed"`; treat like any other refusal.
- **v0.18+ multi-hop NLI verification** — multi-hop ask (`mcp__kebab__ask` with `multi_hop: true`) runs a post-synthesize NLI groundedness gate when `[rag] nli_threshold > 0` is set in the user's config. `answer.v1.verification.nli_passed == true` means the generated answer is entailed by the retrieved chunks (grounded); `false` means the answer is refused with `refusal_reason = "nli_verification_failed"` and the `verification` block still ships so the agent can show what entailment score was rejected. Threshold tuning: 0.5 is the production default, 0.9 is strict mode. If the NLI model download / inference fails the pipeline emits `refusal_reason = "nli_model_unavailable"` — user-side workaround is `[rag] nli_threshold = 0` then retry multi-hop. Single-pass `ask` (multi_hop: false / unset) is unaffected — it keeps the LLM self-judge gate as the only verification.
### `mcp__kebab__fetch` — when you need raw text