feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)

PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9. - crates/kebab-rag/src/pipeline.rs: - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기). - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed). - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable). - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate). - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure. - crates/kebab-app/src/app.rs: - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result<Self> 그대로 — caller cascading 0). - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가. - crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs: - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented). - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true. - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed. - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None. - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable. - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전. - integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling). 검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean. Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 00:55:02 +00:00
parent 681c48b2a3
commit 00ffe9c792
8 changed files with 539 additions and 29 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4142,6 +4142,7 @@ dependencies = [
 "kebab-embed-local",
 "kebab-llm",
 "kebab-llm-local",
+ "kebab-nli",
 "kebab-normalize",
 "kebab-parse-code",
 "kebab-parse-image",
--- a/crates/kebab-app/Cargo.toml
+++ b/crates/kebab-app/Cargo.toml
@@ -23,6 +23,11 @@ kebab-embed-local = { path = "../kebab-embed-local" }
 kebab-llm = { path = "../kebab-llm" }
 kebab-llm-local = { path = "../kebab-llm-local" }
 kebab-rag = { path = "../kebab-rag" }
+# p9-fb-41 PR-9c-2: facade construction of OnnxNliVerifier when
+# `[rag] nli_threshold > 0`. Trait-only consumption via kebab-rag's
+# `with_verifier`; no kebab-nli internals leak into kebab-app code
+# beyond the construction site in `open_with_config`.
+kebab-nli = { path = "../kebab-nli" }
 # P6-4: image extractor + OCR + caption adapters live here. App
 # threads them into the per-asset dispatch (see `ingest_one_asset`
 # image branch). Trait-only consumption — no `kebab-parse-image`
--- a/crates/kebab-app/src/app.rs
+++ b/crates/kebab-app/src/app.rs
@@ -133,6 +133,17 @@ pub struct App {
    /// `corpus_revision` snapshot embedded in `SearchCacheKey`
    /// invalidates every entry the moment a new ingest commit lands.
    search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<SearchHit>>>>,
+    /// p9-fb-41 PR-9c-2: NLI verifier built eagerly at
+    /// `open_with_config` time when `config.rag.nli_threshold > 0`,
+    /// consumed by `RagPipeline::with_verifier` on every `ask` /
+    /// `ask_with_session` call. `None` when the gate is disabled
+    /// (default, threshold = 0) — multi-hop skips step 8.5 entirely
+    /// and single-pass never touches the verifier.
+    ///
+    /// Built eagerly (not lazy) so the `open_with_config` `?`
+    /// propagation surfaces NLI model construction errors at App
+    /// boot time, before any user query runs.
+    pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
 }

 /// p9-fb-19: cache key for `App::search`. Includes every field that
@@ -193,6 +204,21 @@ impl App {
        // `None` (cache disabled — every search hits the retrievers).
        let search_cache = NonZeroUsize::new(config.search.cache_capacity)
            .map(|cap| Mutex::new(LruCache::new(cap)));
+        // p9-fb-41 PR-9c-2: build the NLI verifier when the gate is
+        // enabled. App carries it on `RagPipeline` via
+        // `with_verifier` so the rag crate doesn't have to know about
+        // kebab-nli construction. Failure (`?`) surfaces as a user-
+        // facing error at App boot — never a panic in the pipeline's
+        // `expect("verifier must be Some when nli_threshold > 0.0")`.
+        let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> =
+            if config.rag.nli_threshold > 0.0 {
+                let v = kebab_nli::OnnxNliVerifier::new(&config).context(
+                    "kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)",
+                )?;
+                Some(Arc::new(v))
+            } else {
+                None
+            };
        Ok(Self {
            config,
            sqlite: Arc::new(sqlite),
@@ -200,6 +226,7 @@ impl App {
            vector: OnceLock::new(),
            llm: OnceLock::new(),
            search_cache,
+            pipeline_verifier,
        })
    }

@@ -553,9 +580,26 @@ impl App {
    pub fn ask(&self, query: &str, opts: AskOpts) -> Result<Answer> {
        let retriever = self.build_retriever(opts.mode)?;
        let llm = self.llm()?;
+        let pipeline = self.build_pipeline(retriever, llm);
+        pipeline.ask(query, opts)
+    }
+
+    /// p9-fb-41 PR-9c-2: shared pipeline builder used by [`Self::ask`]
+    /// and [`Self::ask_with_session`]. Attaches the App-built NLI
+    /// verifier (when `cfg.rag.nli_threshold > 0`) via
+    /// `RagPipeline::with_verifier`, keeping the construction site in
+    /// a single place so the two call paths can't drift.
+    fn build_pipeline(
+        &self,
+        retriever: Arc<dyn Retriever>,
+        llm: Arc<dyn LanguageModel>,
+    ) -> RagPipeline {
        let pipeline =
            RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
-        pipeline.ask(query, opts)
+        match &self.pipeline_verifier {
+            Some(v) => pipeline.with_verifier(v.clone()),
+            None => pipeline,
+        }
    }

    /// p9-fb-18: shared retriever-stack builder used by [`Self::ask`]
@@ -660,10 +704,11 @@ impl App {

        // p9-fb-18 R1: shared retriever builder removes the prior
        // copy of `ask`'s 35-line stack — see [`Self::build_retriever`].
+        // p9-fb-41 PR-9c-2: shared `build_pipeline` attaches the NLI
+        // verifier when the gate is enabled.
        let retriever = self.build_retriever(opts.mode)?;
        let llm = self.llm()?;
-        let pipeline =
-            RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
+        let pipeline = self.build_pipeline(retriever, llm);
        let answer = pipeline.ask_with_history(
            query,
            history,
--- a/crates/kebab-rag/src/lib.rs
+++ b/crates/kebab-rag/src/lib.rs
@@ -22,4 +22,4 @@ pub use kebab_core::{Answer, AnswerCitation, AnswerRetrievalSummary, RefusalReas

 mod pipeline;

-pub use pipeline::{AskOpts, RagPipeline, StreamEvent};
+pub use pipeline::{AskOpts, MAX_NLI_PREMISE_CHARS, RagPipeline, StreamEvent, truncate_for_nli};
--- a/crates/kebab-rag/src/pipeline.rs
+++ b/crates/kebab-rag/src/pipeline.rs
@@ -42,7 +42,7 @@ use kebab_core::{
    Answer, AnswerCitation, AnswerRetrievalSummary, Citation, FinishReason,
    GenerateRequest, HopKind, HopRecord, LanguageModel, ModelRef, RefusalReason,
    Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery, TokenChunk,
-    TokenUsage, TraceId, Turn,
+    TokenUsage, TraceId, Turn, VerificationSummary,
 };
 use kebab_core::versions::PromptTemplateVersion;
 use kebab_store_sqlite::SqliteStore;
@@ -197,13 +197,11 @@ pub struct RagPipeline {
    retriever: Arc<dyn Retriever>,
    llm: Arc<dyn LanguageModel>,
    docs: Arc<SqliteStore>,
-    /// p9-fb-41 PR-9c-1: optional NLI verifier injected via
-    /// [`Self::with_verifier`]. Not yet read — PR-9c-2 wires the
-    /// `ask_multi_hop` step 8.5 (post-synthesize gate) that consumes
-    /// it. Until then the field is `#[allow(dead_code)]`; the
-    /// attribute is removed in the PR-9c-2 commit that adds the
-    /// read site so leftover dead code can never sneak in.
-    #[allow(dead_code)]
+    /// p9-fb-41 PR-9c-1/PR-9c-2: optional NLI verifier injected via
+    /// [`Self::with_verifier`]. Consumed by `ask_multi_hop` step 8.5
+    /// (post-synthesize gate) when `cfg.rag.nli_threshold > 0`.
+    /// `None` when the gate is disabled — single-pass `ask` never
+    /// touches this field.
    verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
 }

@@ -231,17 +229,12 @@ impl RagPipeline {
        }
    }

-    /// p9-fb-41 PR-9c-1: inject the post-synthesize NLI verifier.
-    /// Caller (kebab-app facade, PR-9c-2) builds an
+    /// p9-fb-41 PR-9c-1/PR-9c-2: inject the post-synthesize NLI
+    /// verifier. Caller (kebab-app facade) builds an
    /// `Arc<OnnxNliVerifier>` from `cfg.models.nli` when
    /// `cfg.rag.nli_threshold > 0`, then chains
-    /// `RagPipeline::new(...).with_verifier(v)`.
-    ///
-    /// Currently unused — PR-9c-2 wires the read site (step 8.5 of
-    /// `ask_multi_hop`). `#[allow(dead_code)]` survives only until
-    /// that PR's commit, which removes it together with adding the
-    /// hook that reads `self.verifier`.
-    #[allow(dead_code)]
+    /// `RagPipeline::new(...).with_verifier(v)`. Consumed by
+    /// `ask_multi_hop` step 8.5 (post-synthesize gate).
    pub fn with_verifier(mut self, v: Arc<dyn kebab_nli::NliVerifier>) -> Self {
        self.verifier = Some(v);
        self
@@ -1031,6 +1024,48 @@ impl RagPipeline {
            (false, Some(RefusalReason::LlmSelfJudge))
        };

+        // ── 8.5 NLI groundedness verification (multi-hop only, v0.18) ─────
+        // spec §2.7: single-pass `ask` keeps the LlmSelfJudge gate as-is;
+        // NLI verification is multi-hop only this round.
+        //
+        // Empty answer guard: if synthesize bailed (stream abort / LM
+        // crash), `acc` is empty. That path has its own refusal
+        // (LlmStreamAborted) above; skipping the NLI gate here avoids
+        // tokenizing an empty hypothesis (degenerate CLS-SEP-SEP that
+        // would yield a near-uniform softmax and a misleading nli_passed).
+        let verification = if self.config.rag.nli_threshold > 0.0 && !acc.trim().is_empty() {
+            let v = self.verifier.as_ref().expect(
+                "verifier must be Some when nli_threshold > 0.0 \
+                 (kebab-app's open_with_config enforces this invariant)",
+            );
+            let (truncated_premise, _was_truncated) = truncate_for_nli(&packed_text, &acc);
+            match v.score(&truncated_premise, &acc) {
+                Ok(scores) => {
+                    let passed = scores.entailment >= self.config.rag.nli_threshold;
+                    Some(VerificationSummary {
+                        nli_score: scores.entailment,
+                        nli_threshold: self.config.rag.nli_threshold,
+                        nli_passed: passed,
+                    })
+                }
+                Err(e) => {
+                    tracing::warn!(
+                        target: "kebab-rag",
+                        error = %e,
+                        "NLI verifier failed (model unavailable / inference err); refusing"
+                    );
+                    return self.refuse_nli_model_unavailable(query, &opts, hops, started);
+                }
+            }
+        } else {
+            None
+        };
+        if let Some(v) = &verification
+            && !v.nli_passed
+        {
+            return self.refuse_nli_verification(query, &opts, hops, *v, started);
+        }
+
        // ── 8. Build Answer ────────────────────────────────────────────────
        let cited_set: std::collections::BTreeSet<u32> = extracted.iter().copied().collect();
        let citations: Vec<AnswerCitation> = packed_entries
@@ -1101,11 +1136,10 @@ impl RagPipeline {
            // currently lose the trace (cleanup deferred — would
            // require widening helper signatures, PR-3b-ii / follow-up).
            hops: Some(hops),
-            // p9-fb-41 PR-9c-1: surface-only field — PR-9c-2 wires
-            // step 8.5 between citation-validate and Answer-build to
-            // stamp this with the actual NLI score when
-            // `cfg.rag.nli_threshold > 0`. Until then, stays None.
-            verification: None,
+            // p9-fb-41 PR-9c-2: step 8.5 stamped this when
+            // `cfg.rag.nli_threshold > 0`. None when the gate is
+            // disabled (default).
+            verification,
        };

        tracing::debug!(
@@ -1554,6 +1588,137 @@ impl RagPipeline {
        }
        Ok(answer)
    }
+
+    /// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI gate failure —
+    /// `RefusalReason::NliVerificationFailed`. The synthesized answer
+    /// existed (acc was non-empty) but the entailment score fell below
+    /// `cfg.rag.nli_threshold`. We stamp the `VerificationSummary` on
+    /// the wire so the user can see what score was rejected.
+    fn refuse_nli_verification(
+        &self,
+        query: &str,
+        opts: &AskOpts,
+        hops: Vec<HopRecord>,
+        v: VerificationSummary,
+        started: std::time::Instant,
+    ) -> Result<Answer> {
+        let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
+        let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
+        let k_effective = opts.k.max(self.config.search.default_k);
+        let answer = Answer {
+            answer: "근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음."
+                .to_string(),
+            citations: Vec::new(),
+            grounded: false,
+            refusal_reason: Some(RefusalReason::NliVerificationFailed),
+            model: self.llm.model_ref(),
+            embedding: embedding_ref_for(opts.mode, &self.config),
+            prompt_template_version: PromptTemplateVersion(
+                PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
+            ),
+            retrieval: AnswerRetrievalSummary {
+                trace_id,
+                mode: opts.mode,
+                k: k_effective,
+                score_gate: self.config.rag.score_gate,
+                top_score: 0.0,
+                chunks_returned: 0,
+                chunks_used: 0,
+            },
+            usage: TokenUsage {
+                prompt_tokens: 0,
+                completion_tokens: 0,
+                latency_ms: elapsed_ms,
+            },
+            created_at: OffsetDateTime::now_utc(),
+            conversation_id: opts.conversation_id.clone(),
+            turn_index: opts.turn_index,
+            // PR-9c-2: NLI refusal still carries the hop trace built
+            // up to step 8.5 — synthesize ran, so the trace is the
+            // full decompose+decide chain (terminal Synthesize hop is
+            // NOT appended for the refusal path; cleanup deferred to
+            // a follow-up if the user-visible trace shape needs the
+            // synthesize entry).
+            hops: Some(hops),
+            verification: Some(v),
+        };
+        if let Some(sink) = &opts.stream_sink {
+            let _ = sink.send(StreamEvent::Final {
+                answer: answer.clone(),
+            });
+        }
+        if let Err(e) = self.docs.put_answer(&answer, query, None) {
+            tracing::warn!(
+                target: "kebab-rag",
+                error = %e,
+                "kb-rag: put_answer (NliVerificationFailed) failed"
+            );
+        }
+        Ok(answer)
+    }
+
+    /// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI model
+    /// unavailable — `RefusalReason::NliModelUnavailable`. The verifier
+    /// raised an inference/download error so we cannot summarize the
+    /// verification result; `verification` is `None`. Treat as a soft
+    /// refusal — the user can opt out by setting `[rag] nli_threshold
+    /// = 0` and retrying.
+    fn refuse_nli_model_unavailable(
+        &self,
+        query: &str,
+        opts: &AskOpts,
+        hops: Vec<HopRecord>,
+        started: std::time::Instant,
+    ) -> Result<Answer> {
+        let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
+        let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
+        let k_effective = opts.k.max(self.config.search.default_k);
+        let answer = Answer {
+            answer: "근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능."
+                .to_string(),
+            citations: Vec::new(),
+            grounded: false,
+            refusal_reason: Some(RefusalReason::NliModelUnavailable),
+            model: self.llm.model_ref(),
+            embedding: embedding_ref_for(opts.mode, &self.config),
+            prompt_template_version: PromptTemplateVersion(
+                PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
+            ),
+            retrieval: AnswerRetrievalSummary {
+                trace_id,
+                mode: opts.mode,
+                k: k_effective,
+                score_gate: self.config.rag.score_gate,
+                top_score: 0.0,
+                chunks_returned: 0,
+                chunks_used: 0,
+            },
+            usage: TokenUsage {
+                prompt_tokens: 0,
+                completion_tokens: 0,
+                latency_ms: elapsed_ms,
+            },
+            created_at: OffsetDateTime::now_utc(),
+            conversation_id: opts.conversation_id.clone(),
+            turn_index: opts.turn_index,
+            hops: Some(hops),
+            // No VerificationSummary — verification didn't happen.
+            verification: None,
+        };
+        if let Some(sink) = &opts.stream_sink {
+            let _ = sink.send(StreamEvent::Final {
+                answer: answer.clone(),
+            });
+        }
+        if let Err(e) = self.docs.put_answer(&answer, query, None) {
+            tracing::warn!(
+                target: "kebab-rag",
+                error = %e,
+                "kb-rag: put_answer (NliModelUnavailable) failed"
+            );
+        }
+        Ok(answer)
+    }
 }

 // ── Helpers ────────────────────────────────────────────────────────────────
@@ -1623,6 +1788,35 @@ pub(crate) const PROMPT_TEMPLATE_VERSION_MULTI_HOP: &str = "rag-multi-hop-v1";
 /// the same PR.
 pub(crate) const MULTI_HOP_MAX_SUB_QUERIES_HARD_CAP: usize = 10;

+/// p9-fb-41 PR-9c-2: premise budget for NLI input. mDeBERTa-v3's
+/// positional embedding caps at 512 tokens; with the hypothesis +
+/// special-token budget reserved (~32 chars conservative), the
+/// premise gets ≈1600 chars at 4 chars/token (English BPE baseline).
+/// Korean SentencePiece is denser (1-2 char/token) — the tokenizer's
+/// `OnlyFirst` strategy (configured in kebab-nli) is the backup
+/// truncation when the char-based budget still overflows the token
+/// limit. v0.18.1 candidate: token-count-based budget once we have
+/// measured KR truncation rates from dogfood retest.
+pub const MAX_NLI_PREMISE_CHARS: usize = 4 * 400;
+
+/// p9-fb-41 PR-9c-2: truncate `premise` to fit the NLI input budget
+/// while preserving `hypothesis` in full. Returns `(truncated_premise,
+/// was_truncated)`. `was_truncated` is informational for tracing —
+/// the v0.18 wire doesn't surface it; a v0.19+ extension might.
+///
+/// `_hypothesis` is currently unused — placeholder for the v0.18.1
+/// token-budget version that would carve the budget *around* the
+/// hypothesis. Kept on the signature to preserve the contract from
+/// spec §2.2.3 / spec §3 PR-9c-2.
+pub fn truncate_for_nli(premise: &str, _hypothesis: &str) -> (String, bool) {
+    if premise.chars().count() <= MAX_NLI_PREMISE_CHARS {
+        (premise.to_string(), false)
+    } else {
+        let truncated: String = premise.chars().take(MAX_NLI_PREMISE_CHARS).collect();
+        (truncated, true)
+    }
+}
+
 const MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT: &str = "당신은 사용자의 질문을 다단계 검색에 필요한 sub-question 들로 분해하는 도구다.\n- multi-hop 정보가 필요한 경우 독립적으로 검색 가능한 sub-question 들로 분해한다.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).\n- 원본이 이미 단순하면 원본 그대로 1 개만 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.";

 const MULTI_HOP_DECIDE_SYSTEM_PROMPT: &str = "당신은 multi-hop 검색의 매 iter 에서 \"추가 retrieval 이 필요한가?\" 를 판단하는 도구다.\n- 지금까지 모은 [근거] 가 [원본 질문] 의 모든 측면을 cover 하는지 평가한다.\n- 추가가 필요하면 새 sub-question 들 (이미 모은 정보로 답할 수 없는 부분만, 독립적으로 검색 가능한 형태로) 을 JSON array of strings 로 반환한다.\n- 충분하면 빈 array `[]` 를 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).";
--- a/crates/kebab-rag/tests/common/mod.rs
+++ b/crates/kebab-rag/tests/common/mod.rs
@@ -12,13 +12,14 @@

 #![allow(dead_code)]

-use std::sync::Arc;
+use std::sync::{Arc, Mutex};

 use kebab_config::Config;
 use kebab_core::{
    ChunkerVersion, ChunkId, Citation, DocumentId, IndexVersion, RetrievalDetail,
    Retriever, SearchHit, SearchMode, SearchQuery, WorkspacePath,
 };
+use kebab_nli::{NliScores, NliVerifier};
 use kebab_store_sqlite::SqliteStore;
 use rusqlite::params;
 use tempfile::TempDir;
@@ -384,3 +385,73 @@ impl kebab_core::LanguageModel for ScriptedLm {
        Ok(Box::new(chunks.into_iter().map(Ok)))
    }
 }
+
+/// p9-fb-41 PR-9c-2: mock NLI verifier for multi-hop step 8.5 tests.
+/// Three constructors mirror the test scenarios:
+/// - [`MockNliVerifier::pass`] — high entailment score (0.9), `nli_passed`
+///   is true at the production default threshold (0.5).
+/// - [`MockNliVerifier::fail`] — low entailment (0.1), refuses at any
+///   threshold > 0.1.
+/// - [`MockNliVerifier::err`] — returns an `anyhow::Error` so the pipeline
+///   surfaces `RefusalReason::NliModelUnavailable`.
+///
+/// `call_count` instrumented (Mutex-wrapped usize) so a test can assert
+/// the verifier ran the expected number of times — useful for pinning
+/// the "threshold = 0 skips verify" invariant when the verifier is
+/// nonetheless attached to the pipeline.
+pub struct MockNliVerifier {
+    pub mode: MockMode,
+    pub call_count: Mutex<usize>,
+}
+
+pub enum MockMode {
+    /// Return these scores. Used by pass / fail variants.
+    Scores(NliScores),
+    /// Return an `anyhow::Error`. Used by the err variant.
+    Err(String),
+}
+
+impl MockNliVerifier {
+    pub fn pass() -> Arc<Self> {
+        Arc::new(Self {
+            mode: MockMode::Scores(NliScores {
+                entailment: 0.9,
+                neutral: 0.07,
+                contradiction: 0.03,
+            }),
+            call_count: Mutex::new(0),
+        })
+    }
+
+    pub fn fail() -> Arc<Self> {
+        Arc::new(Self {
+            mode: MockMode::Scores(NliScores {
+                entailment: 0.1,
+                neutral: 0.4,
+                contradiction: 0.5,
+            }),
+            call_count: Mutex::new(0),
+        })
+    }
+
+    pub fn err() -> Arc<Self> {
+        Arc::new(Self {
+            mode: MockMode::Err("mock NLI unavailable".into()),
+            call_count: Mutex::new(0),
+        })
+    }
+
+    pub fn calls(&self) -> usize {
+        *self.call_count.lock().unwrap()
+    }
+}
+
+impl NliVerifier for MockNliVerifier {
+    fn score(&self, _premise: &str, _hypothesis: &str) -> anyhow::Result<NliScores> {
+        *self.call_count.lock().unwrap() += 1;
+        match &self.mode {
+            MockMode::Scores(s) => Ok(*s),
+            MockMode::Err(e) => anyhow::bail!("{e}"),
+        }
+    }
+}
--- a/crates/kebab-rag/tests/multi_hop.rs
+++ b/crates/kebab-rag/tests/multi_hop.rs
@@ -26,9 +26,10 @@ mod common;

 use std::sync::Arc;

-use common::{RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
+use common::{MockNliVerifier, RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
 use kebab_core::{HopKind, LanguageModel, RefusalReason, Retriever, SearchMode};
-use kebab_rag::{AskOpts, RagPipeline};
+use kebab_nli::NliVerifier;
+use kebab_rag::{AskOpts, RagPipeline, truncate_for_nli};

 /// Default `AskOpts` for multi-hop tests: deterministic seed,
 /// lexical mode (so the test crate doesn't need to wire up an
@@ -618,3 +619,195 @@ fn multi_hop_above_probe_gate_proceeds_to_decompose() {
    let hops = answer.hops.expect("happy path stamps hops");
    assert_eq!(hops.len(), 3);
 }
+
+// ── p9-fb-41 PR-9c-2: step 8.5 NLI verification tests ──────────────────────
+//
+// Five tests pin the NLI hook on the multi-hop path:
+// 1. `multi_hop_nli_pass_keeps_grounded` — entailment 0.9 ≥ threshold 0.5 →
+//    happy path, `verification.nli_passed = true`.
+// 2. `multi_hop_nli_fail_refuses` — entailment 0.1 < threshold 0.5 →
+//    refusal with `RefusalReason::NliVerificationFailed` + verification stamp.
+// 3. `multi_hop_nli_disabled_skip_verify` — threshold 0.0 → verify skipped,
+//    `Answer.verification` stays `None` (no verifier attached).
+// 4. `multi_hop_nli_model_unavailable_refuses` — verifier returns `Err` →
+//    refusal with `RefusalReason::NliModelUnavailable` + `verification = None`.
+// 5. `multi_hop_truncate_for_nli_preserves_hypothesis` — pure unit test on
+//    `truncate_for_nli`'s char-budget contract.
+
+/// Helper to build a "valid multi-hop happy-path" scenario where probe +
+/// decompose retrieves the same single chunk, decompose emits one
+/// sub-query, decide signals stop, and synthesize produces a cited
+/// answer. Returns the seeded `RagEnv`, scripted retriever (so the
+/// test can assert call count), and scripted LM with the 3-call
+/// script ready.
+fn happy_multi_hop_env() -> (RagEnv, Arc<ScriptedRetriever>, Arc<ScriptedLm>) {
+    let env = RagEnv::new();
+    let cid = id32("c1");
+    let did = id32("d1");
+    env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
+    let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
+    let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
+    let lm = Arc::new(ScriptedLm::new(vec![
+        r#"["q1"]"#,
+        r#"[]"#,
+        "answer body [#1]",
+    ]));
+    (env, retriever, lm)
+}
+
+#[test]
+fn multi_hop_nli_pass_keeps_grounded() {
+    let (env, retriever, lm) = happy_multi_hop_env();
+    let mut cfg = env.config.clone();
+    cfg.rag.nli_threshold = 0.5;
+
+    let retriever_dyn: Arc<dyn Retriever> = retriever;
+    let lm_dyn: Arc<dyn LanguageModel> = lm;
+    let verifier = MockNliVerifier::pass();
+    let verifier_handle = verifier.clone();
+    let verifier_dyn: Arc<dyn NliVerifier> = verifier;
+    let pipeline =
+        RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
+            .with_verifier(verifier_dyn);
+
+    let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
+
+    assert!(answer.grounded, "NLI-pass synthesize must stay grounded");
+    assert_eq!(answer.refusal_reason, None);
+    assert_eq!(
+        verifier_handle.calls(),
+        1,
+        "verifier called exactly once on the synthesized answer"
+    );
+    let v = answer
+        .verification
+        .expect("nli_threshold > 0 stamps Some(verification)");
+    assert!(v.nli_passed, "entailment 0.9 ≥ threshold 0.5");
+    assert!((v.nli_score - 0.9).abs() < 1e-5, "got: {}", v.nli_score);
+    assert!((v.nli_threshold - 0.5).abs() < 1e-5);
+}
+
+#[test]
+fn multi_hop_nli_fail_refuses() {
+    let (env, retriever, lm) = happy_multi_hop_env();
+    let mut cfg = env.config.clone();
+    cfg.rag.nli_threshold = 0.5;
+
+    let retriever_dyn: Arc<dyn Retriever> = retriever;
+    let lm_dyn: Arc<dyn LanguageModel> = lm;
+    let verifier = MockNliVerifier::fail();
+    let verifier_handle = verifier.clone();
+    let verifier_dyn: Arc<dyn NliVerifier> = verifier;
+    let pipeline =
+        RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
+            .with_verifier(verifier_dyn);
+
+    let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
+
+    assert!(!answer.grounded);
+    assert_eq!(
+        answer.refusal_reason,
+        Some(RefusalReason::NliVerificationFailed)
+    );
+    assert_eq!(verifier_handle.calls(), 1);
+    let v = answer
+        .verification
+        .expect("refusal still stamps verification summary");
+    assert!(!v.nli_passed, "entailment 0.1 < threshold 0.5");
+    assert!((v.nli_score - 0.1).abs() < 1e-5, "got: {}", v.nli_score);
+}
+
+#[test]
+fn multi_hop_nli_disabled_skip_verify() {
+    let (env, retriever, lm) = happy_multi_hop_env();
+    // Default config keeps `nli_threshold = 0.0` — gate disabled. No
+    // verifier is attached to the pipeline; the hook short-circuits
+    // entirely (`Answer.verification` stays `None`).
+    let cfg = env.config.clone();
+    assert!(
+        (cfg.rag.nli_threshold - 0.0).abs() < f32::EPSILON,
+        "default nli_threshold must be 0.0 (gate disabled)"
+    );
+
+    let retriever_dyn: Arc<dyn Retriever> = retriever;
+    let lm_dyn: Arc<dyn LanguageModel> = lm;
+    // No `with_verifier` call — pipeline.verifier stays None.
+    let pipeline =
+        RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone());
+
+    let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
+
+    assert!(answer.grounded);
+    assert_eq!(answer.refusal_reason, None);
+    assert!(
+        answer.verification.is_none(),
+        "threshold = 0.0 must skip step 8.5 and leave verification = None"
+    );
+}
+
+#[test]
+fn multi_hop_nli_model_unavailable_refuses() {
+    let (env, retriever, lm) = happy_multi_hop_env();
+    let mut cfg = env.config.clone();
+    cfg.rag.nli_threshold = 0.5;
+
+    let retriever_dyn: Arc<dyn Retriever> = retriever;
+    let lm_dyn: Arc<dyn LanguageModel> = lm;
+    let verifier = MockNliVerifier::err();
+    let verifier_handle = verifier.clone();
+    let verifier_dyn: Arc<dyn NliVerifier> = verifier;
+    let pipeline =
+        RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
+            .with_verifier(verifier_dyn);
+
+    let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
+
+    assert!(!answer.grounded);
+    assert_eq!(
+        answer.refusal_reason,
+        Some(RefusalReason::NliModelUnavailable)
+    );
+    assert_eq!(verifier_handle.calls(), 1, "verifier was invoked once before failing");
+    assert!(
+        answer.verification.is_none(),
+        "NliModelUnavailable: can't summarize a verification that didn't happen"
+    );
+}
+
+#[test]
+fn multi_hop_truncate_for_nli_preserves_hypothesis() {
+    // Long premise (>1600 chars) gets truncated, short hypothesis is
+    // passed unchanged (signature placeholder for v0.18.1 token-budget
+    // version). MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600.
+    let long_premise: String = "a".repeat(2000);
+    let (truncated, was_truncated) = truncate_for_nli(&long_premise, "short hypothesis");
+    assert!(was_truncated);
+    assert_eq!(
+        truncated.chars().count(),
+        1600,
+        "premise truncated to MAX_NLI_PREMISE_CHARS"
+    );
+
+    // Short premise (under budget): no truncation, `was_truncated = false`.
+    let short_premise = "short premise text";
+    let (passthrough, was_truncated) = truncate_for_nli(short_premise, "anything");
+    assert!(!was_truncated);
+    assert_eq!(passthrough, short_premise);
+
+    // Multi-byte safety: 1600 Korean chars (3 bytes each in UTF-8) fits
+    // within the char budget even though byte length exceeds 4800.
+    let kr_short: String = "가".repeat(1600);
+    let (passthrough_kr, was_truncated_kr) = truncate_for_nli(&kr_short, "h");
+    assert!(!was_truncated_kr, "1600 KR chars == budget, no truncation");
+    assert_eq!(passthrough_kr.chars().count(), 1600);
+
+    // Multi-byte over-budget: truncation must count chars, not bytes.
+    let kr_long: String = "가".repeat(2000);
+    let (truncated_kr, was_truncated_kr) = truncate_for_nli(&kr_long, "h");
+    assert!(was_truncated_kr);
+    assert_eq!(
+        truncated_kr.chars().count(),
+        1600,
+        "char-based truncation must not over-cut on multi-byte input"
+    );
+}
--- a/integrations/claude-code/kebab/SKILL.md
+++ b/integrations/claude-code/kebab/SKILL.md
@@ -88,6 +88,7 @@ Input:
 - For follow-up turns on the same topic, pass `session_id` (e.g. `"team-onboarding-2026-05"`) and reuse it across the conversation. Sessions persist until `kebab reset --data-only`.
 - p9-fb-40: 기본 `prompt_template_version = "rag-v2"`. 답변이 더 strict — fact 인용 시 verbatim span, 학습 지식 동원 금지, 근거 모호 시 "확실하지 않다" 출현 가능. user 가 `[rag] prompt_template_version = "rag-v1"` 명시 시 legacy 동작.
 - **p9-fb-41 `multi_hop: true`** — opt the ask into the multi-hop pipeline. The query is decomposed into sub-questions, each retrieved independently (LLM-driven decide loop, up to `rag.multi_hop_max_depth` iters), then synthesized over the merged chunk pool. Cost trade-off: 2–5× LLM calls vs. single-pass. **Use** for compound questions ("X 와 Y 의 차이는?", prereq chains, cross-doc reasoning where one chunk alone is insufficient). **Don't** for simple fact-finding (single-pass is faster + cheaper). When set, `answer.v1.hops[]` carries the per-hop trace (`{iter, kind, sub_queries[], context_chunks_added, forced_stop, llm_call_ms}`) — surface a brief "Searched in N hops" note when the trace is non-trivial. Decompose-failure (model emitted non-JSON) → `refusal_reason = "multi_hop_decompose_failed"`; treat like any other refusal.
+- **v0.18+ multi-hop NLI verification** — multi-hop ask (`mcp__kebab__ask` with `multi_hop: true`) runs a post-synthesize NLI groundedness gate when `[rag] nli_threshold > 0` is set in the user's config. `answer.v1.verification.nli_passed == true` means the generated answer is entailed by the retrieved chunks (grounded); `false` means the answer is refused with `refusal_reason = "nli_verification_failed"` and the `verification` block still ships so the agent can show what entailment score was rejected. Threshold tuning: 0.5 is the production default, 0.9 is strict mode. If the NLI model download / inference fails the pipeline emits `refusal_reason = "nli_model_unavailable"` — user-side workaround is `[rag] nli_threshold = 0` then retry multi-hop. Single-pass `ask` (multi_hop: false / unset) is unaffected — it keeps the LLM self-judge gate as the only verification.

 ### `mcp__kebab__fetch` — when you need raw text