From 00ffe9c792b5101a8bf2fcfc6f0ed409c4881271 Mon Sep 17 00:00:00 2001 From: altair823 Date: Tue, 26 May 2026 00:55:02 +0000 Subject: [PATCH] =?UTF-8?q?feat(rag):=20fb-41=20PR-9c-2=20=E2=80=94=20pipe?= =?UTF-8?q?line=20integration=20+=20mock=20test=20+=20SKILL.md=20(?= =?UTF-8?q?=E2=98=85=20NLI=20=EC=8B=A4=20=ED=99=9C=EC=84=B1=ED=99=94)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9. - crates/kebab-rag/src/pipeline.rs: - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기). - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed). - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable). - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate). - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure. - crates/kebab-app/src/app.rs: - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result 그대로 — caller cascading 0). - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가. - crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs: - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented). - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true. - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed. - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None. - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable. - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전. - integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling). 검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean. Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움. Co-Authored-By: Claude Opus 4.7 (1M context) --- Cargo.lock | 1 + crates/kebab-app/Cargo.toml | 5 + crates/kebab-app/src/app.rs | 51 ++++- crates/kebab-rag/src/lib.rs | 2 +- crates/kebab-rag/src/pipeline.rs | 238 +++++++++++++++++++++--- crates/kebab-rag/tests/common/mod.rs | 73 +++++++- crates/kebab-rag/tests/multi_hop.rs | 197 +++++++++++++++++++- integrations/claude-code/kebab/SKILL.md | 1 + 8 files changed, 539 insertions(+), 29 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index af56201..a0bb5de 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -4142,6 +4142,7 @@ dependencies = [ "kebab-embed-local", "kebab-llm", "kebab-llm-local", + "kebab-nli", "kebab-normalize", "kebab-parse-code", "kebab-parse-image", diff --git a/crates/kebab-app/Cargo.toml b/crates/kebab-app/Cargo.toml index 5dfca4b..164e72c 100644 --- a/crates/kebab-app/Cargo.toml +++ b/crates/kebab-app/Cargo.toml @@ -23,6 +23,11 @@ kebab-embed-local = { path = "../kebab-embed-local" } kebab-llm = { path = "../kebab-llm" } kebab-llm-local = { path = "../kebab-llm-local" } kebab-rag = { path = "../kebab-rag" } +# p9-fb-41 PR-9c-2: facade construction of OnnxNliVerifier when +# `[rag] nli_threshold > 0`. Trait-only consumption via kebab-rag's +# `with_verifier`; no kebab-nli internals leak into kebab-app code +# beyond the construction site in `open_with_config`. +kebab-nli = { path = "../kebab-nli" } # P6-4: image extractor + OCR + caption adapters live here. App # threads them into the per-asset dispatch (see `ingest_one_asset` # image branch). Trait-only consumption — no `kebab-parse-image` diff --git a/crates/kebab-app/src/app.rs b/crates/kebab-app/src/app.rs index 0b07e0f..7eabe82 100644 --- a/crates/kebab-app/src/app.rs +++ b/crates/kebab-app/src/app.rs @@ -133,6 +133,17 @@ pub struct App { /// `corpus_revision` snapshot embedded in `SearchCacheKey` /// invalidates every entry the moment a new ingest commit lands. search_cache: Option>>>, + /// p9-fb-41 PR-9c-2: NLI verifier built eagerly at + /// `open_with_config` time when `config.rag.nli_threshold > 0`, + /// consumed by `RagPipeline::with_verifier` on every `ask` / + /// `ask_with_session` call. `None` when the gate is disabled + /// (default, threshold = 0) — multi-hop skips step 8.5 entirely + /// and single-pass never touches the verifier. + /// + /// Built eagerly (not lazy) so the `open_with_config` `?` + /// propagation surfaces NLI model construction errors at App + /// boot time, before any user query runs. + pipeline_verifier: Option>, } /// p9-fb-19: cache key for `App::search`. Includes every field that @@ -193,6 +204,21 @@ impl App { // `None` (cache disabled — every search hits the retrievers). let search_cache = NonZeroUsize::new(config.search.cache_capacity) .map(|cap| Mutex::new(LruCache::new(cap))); + // p9-fb-41 PR-9c-2: build the NLI verifier when the gate is + // enabled. App carries it on `RagPipeline` via + // `with_verifier` so the rag crate doesn't have to know about + // kebab-nli construction. Failure (`?`) surfaces as a user- + // facing error at App boot — never a panic in the pipeline's + // `expect("verifier must be Some when nli_threshold > 0.0")`. + let pipeline_verifier: Option> = + if config.rag.nli_threshold > 0.0 { + let v = kebab_nli::OnnxNliVerifier::new(&config).context( + "kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)", + )?; + Some(Arc::new(v)) + } else { + None + }; Ok(Self { config, sqlite: Arc::new(sqlite), @@ -200,6 +226,7 @@ impl App { vector: OnceLock::new(), llm: OnceLock::new(), search_cache, + pipeline_verifier, }) } @@ -553,9 +580,26 @@ impl App { pub fn ask(&self, query: &str, opts: AskOpts) -> Result { let retriever = self.build_retriever(opts.mode)?; let llm = self.llm()?; + let pipeline = self.build_pipeline(retriever, llm); + pipeline.ask(query, opts) + } + + /// p9-fb-41 PR-9c-2: shared pipeline builder used by [`Self::ask`] + /// and [`Self::ask_with_session`]. Attaches the App-built NLI + /// verifier (when `cfg.rag.nli_threshold > 0`) via + /// `RagPipeline::with_verifier`, keeping the construction site in + /// a single place so the two call paths can't drift. + fn build_pipeline( + &self, + retriever: Arc, + llm: Arc, + ) -> RagPipeline { let pipeline = RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone()); - pipeline.ask(query, opts) + match &self.pipeline_verifier { + Some(v) => pipeline.with_verifier(v.clone()), + None => pipeline, + } } /// p9-fb-18: shared retriever-stack builder used by [`Self::ask`] @@ -660,10 +704,11 @@ impl App { // p9-fb-18 R1: shared retriever builder removes the prior // copy of `ask`'s 35-line stack — see [`Self::build_retriever`]. + // p9-fb-41 PR-9c-2: shared `build_pipeline` attaches the NLI + // verifier when the gate is enabled. let retriever = self.build_retriever(opts.mode)?; let llm = self.llm()?; - let pipeline = - RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone()); + let pipeline = self.build_pipeline(retriever, llm); let answer = pipeline.ask_with_history( query, history, diff --git a/crates/kebab-rag/src/lib.rs b/crates/kebab-rag/src/lib.rs index e527dc0..1cc5328 100644 --- a/crates/kebab-rag/src/lib.rs +++ b/crates/kebab-rag/src/lib.rs @@ -22,4 +22,4 @@ pub use kebab_core::{Answer, AnswerCitation, AnswerRetrievalSummary, RefusalReas mod pipeline; -pub use pipeline::{AskOpts, RagPipeline, StreamEvent}; +pub use pipeline::{AskOpts, MAX_NLI_PREMISE_CHARS, RagPipeline, StreamEvent, truncate_for_nli}; diff --git a/crates/kebab-rag/src/pipeline.rs b/crates/kebab-rag/src/pipeline.rs index 8d77eec..1a4b648 100644 --- a/crates/kebab-rag/src/pipeline.rs +++ b/crates/kebab-rag/src/pipeline.rs @@ -42,7 +42,7 @@ use kebab_core::{ Answer, AnswerCitation, AnswerRetrievalSummary, Citation, FinishReason, GenerateRequest, HopKind, HopRecord, LanguageModel, ModelRef, RefusalReason, Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery, TokenChunk, - TokenUsage, TraceId, Turn, + TokenUsage, TraceId, Turn, VerificationSummary, }; use kebab_core::versions::PromptTemplateVersion; use kebab_store_sqlite::SqliteStore; @@ -197,13 +197,11 @@ pub struct RagPipeline { retriever: Arc, llm: Arc, docs: Arc, - /// p9-fb-41 PR-9c-1: optional NLI verifier injected via - /// [`Self::with_verifier`]. Not yet read — PR-9c-2 wires the - /// `ask_multi_hop` step 8.5 (post-synthesize gate) that consumes - /// it. Until then the field is `#[allow(dead_code)]`; the - /// attribute is removed in the PR-9c-2 commit that adds the - /// read site so leftover dead code can never sneak in. - #[allow(dead_code)] + /// p9-fb-41 PR-9c-1/PR-9c-2: optional NLI verifier injected via + /// [`Self::with_verifier`]. Consumed by `ask_multi_hop` step 8.5 + /// (post-synthesize gate) when `cfg.rag.nli_threshold > 0`. + /// `None` when the gate is disabled — single-pass `ask` never + /// touches this field. verifier: Option>, } @@ -231,17 +229,12 @@ impl RagPipeline { } } - /// p9-fb-41 PR-9c-1: inject the post-synthesize NLI verifier. - /// Caller (kebab-app facade, PR-9c-2) builds an + /// p9-fb-41 PR-9c-1/PR-9c-2: inject the post-synthesize NLI + /// verifier. Caller (kebab-app facade) builds an /// `Arc` from `cfg.models.nli` when /// `cfg.rag.nli_threshold > 0`, then chains - /// `RagPipeline::new(...).with_verifier(v)`. - /// - /// Currently unused — PR-9c-2 wires the read site (step 8.5 of - /// `ask_multi_hop`). `#[allow(dead_code)]` survives only until - /// that PR's commit, which removes it together with adding the - /// hook that reads `self.verifier`. - #[allow(dead_code)] + /// `RagPipeline::new(...).with_verifier(v)`. Consumed by + /// `ask_multi_hop` step 8.5 (post-synthesize gate). pub fn with_verifier(mut self, v: Arc) -> Self { self.verifier = Some(v); self @@ -1031,6 +1024,48 @@ impl RagPipeline { (false, Some(RefusalReason::LlmSelfJudge)) }; + // ── 8.5 NLI groundedness verification (multi-hop only, v0.18) ───── + // spec §2.7: single-pass `ask` keeps the LlmSelfJudge gate as-is; + // NLI verification is multi-hop only this round. + // + // Empty answer guard: if synthesize bailed (stream abort / LM + // crash), `acc` is empty. That path has its own refusal + // (LlmStreamAborted) above; skipping the NLI gate here avoids + // tokenizing an empty hypothesis (degenerate CLS-SEP-SEP that + // would yield a near-uniform softmax and a misleading nli_passed). + let verification = if self.config.rag.nli_threshold > 0.0 && !acc.trim().is_empty() { + let v = self.verifier.as_ref().expect( + "verifier must be Some when nli_threshold > 0.0 \ + (kebab-app's open_with_config enforces this invariant)", + ); + let (truncated_premise, _was_truncated) = truncate_for_nli(&packed_text, &acc); + match v.score(&truncated_premise, &acc) { + Ok(scores) => { + let passed = scores.entailment >= self.config.rag.nli_threshold; + Some(VerificationSummary { + nli_score: scores.entailment, + nli_threshold: self.config.rag.nli_threshold, + nli_passed: passed, + }) + } + Err(e) => { + tracing::warn!( + target: "kebab-rag", + error = %e, + "NLI verifier failed (model unavailable / inference err); refusing" + ); + return self.refuse_nli_model_unavailable(query, &opts, hops, started); + } + } + } else { + None + }; + if let Some(v) = &verification + && !v.nli_passed + { + return self.refuse_nli_verification(query, &opts, hops, *v, started); + } + // ── 8. Build Answer ──────────────────────────────────────────────── let cited_set: std::collections::BTreeSet = extracted.iter().copied().collect(); let citations: Vec = packed_entries @@ -1101,11 +1136,10 @@ impl RagPipeline { // currently lose the trace (cleanup deferred — would // require widening helper signatures, PR-3b-ii / follow-up). hops: Some(hops), - // p9-fb-41 PR-9c-1: surface-only field — PR-9c-2 wires - // step 8.5 between citation-validate and Answer-build to - // stamp this with the actual NLI score when - // `cfg.rag.nli_threshold > 0`. Until then, stays None. - verification: None, + // p9-fb-41 PR-9c-2: step 8.5 stamped this when + // `cfg.rag.nli_threshold > 0`. None when the gate is + // disabled (default). + verification, }; tracing::debug!( @@ -1554,6 +1588,137 @@ impl RagPipeline { } Ok(answer) } + + /// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI gate failure — + /// `RefusalReason::NliVerificationFailed`. The synthesized answer + /// existed (acc was non-empty) but the entailment score fell below + /// `cfg.rag.nli_threshold`. We stamp the `VerificationSummary` on + /// the wire so the user can see what score was rejected. + fn refuse_nli_verification( + &self, + query: &str, + opts: &AskOpts, + hops: Vec, + v: VerificationSummary, + started: std::time::Instant, + ) -> Result { + let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX); + let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id); + let k_effective = opts.k.max(self.config.search.default_k); + let answer = Answer { + answer: "근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음." + .to_string(), + citations: Vec::new(), + grounded: false, + refusal_reason: Some(RefusalReason::NliVerificationFailed), + model: self.llm.model_ref(), + embedding: embedding_ref_for(opts.mode, &self.config), + prompt_template_version: PromptTemplateVersion( + PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(), + ), + retrieval: AnswerRetrievalSummary { + trace_id, + mode: opts.mode, + k: k_effective, + score_gate: self.config.rag.score_gate, + top_score: 0.0, + chunks_returned: 0, + chunks_used: 0, + }, + usage: TokenUsage { + prompt_tokens: 0, + completion_tokens: 0, + latency_ms: elapsed_ms, + }, + created_at: OffsetDateTime::now_utc(), + conversation_id: opts.conversation_id.clone(), + turn_index: opts.turn_index, + // PR-9c-2: NLI refusal still carries the hop trace built + // up to step 8.5 — synthesize ran, so the trace is the + // full decompose+decide chain (terminal Synthesize hop is + // NOT appended for the refusal path; cleanup deferred to + // a follow-up if the user-visible trace shape needs the + // synthesize entry). + hops: Some(hops), + verification: Some(v), + }; + if let Some(sink) = &opts.stream_sink { + let _ = sink.send(StreamEvent::Final { + answer: answer.clone(), + }); + } + if let Err(e) = self.docs.put_answer(&answer, query, None) { + tracing::warn!( + target: "kebab-rag", + error = %e, + "kb-rag: put_answer (NliVerificationFailed) failed" + ); + } + Ok(answer) + } + + /// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI model + /// unavailable — `RefusalReason::NliModelUnavailable`. The verifier + /// raised an inference/download error so we cannot summarize the + /// verification result; `verification` is `None`. Treat as a soft + /// refusal — the user can opt out by setting `[rag] nli_threshold + /// = 0` and retrying. + fn refuse_nli_model_unavailable( + &self, + query: &str, + opts: &AskOpts, + hops: Vec, + started: std::time::Instant, + ) -> Result { + let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX); + let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id); + let k_effective = opts.k.max(self.config.search.default_k); + let answer = Answer { + answer: "근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능." + .to_string(), + citations: Vec::new(), + grounded: false, + refusal_reason: Some(RefusalReason::NliModelUnavailable), + model: self.llm.model_ref(), + embedding: embedding_ref_for(opts.mode, &self.config), + prompt_template_version: PromptTemplateVersion( + PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(), + ), + retrieval: AnswerRetrievalSummary { + trace_id, + mode: opts.mode, + k: k_effective, + score_gate: self.config.rag.score_gate, + top_score: 0.0, + chunks_returned: 0, + chunks_used: 0, + }, + usage: TokenUsage { + prompt_tokens: 0, + completion_tokens: 0, + latency_ms: elapsed_ms, + }, + created_at: OffsetDateTime::now_utc(), + conversation_id: opts.conversation_id.clone(), + turn_index: opts.turn_index, + hops: Some(hops), + // No VerificationSummary — verification didn't happen. + verification: None, + }; + if let Some(sink) = &opts.stream_sink { + let _ = sink.send(StreamEvent::Final { + answer: answer.clone(), + }); + } + if let Err(e) = self.docs.put_answer(&answer, query, None) { + tracing::warn!( + target: "kebab-rag", + error = %e, + "kb-rag: put_answer (NliModelUnavailable) failed" + ); + } + Ok(answer) + } } // ── Helpers ──────────────────────────────────────────────────────────────── @@ -1623,6 +1788,35 @@ pub(crate) const PROMPT_TEMPLATE_VERSION_MULTI_HOP: &str = "rag-multi-hop-v1"; /// the same PR. pub(crate) const MULTI_HOP_MAX_SUB_QUERIES_HARD_CAP: usize = 10; +/// p9-fb-41 PR-9c-2: premise budget for NLI input. mDeBERTa-v3's +/// positional embedding caps at 512 tokens; with the hypothesis + +/// special-token budget reserved (~32 chars conservative), the +/// premise gets ≈1600 chars at 4 chars/token (English BPE baseline). +/// Korean SentencePiece is denser (1-2 char/token) — the tokenizer's +/// `OnlyFirst` strategy (configured in kebab-nli) is the backup +/// truncation when the char-based budget still overflows the token +/// limit. v0.18.1 candidate: token-count-based budget once we have +/// measured KR truncation rates from dogfood retest. +pub const MAX_NLI_PREMISE_CHARS: usize = 4 * 400; + +/// p9-fb-41 PR-9c-2: truncate `premise` to fit the NLI input budget +/// while preserving `hypothesis` in full. Returns `(truncated_premise, +/// was_truncated)`. `was_truncated` is informational for tracing — +/// the v0.18 wire doesn't surface it; a v0.19+ extension might. +/// +/// `_hypothesis` is currently unused — placeholder for the v0.18.1 +/// token-budget version that would carve the budget *around* the +/// hypothesis. Kept on the signature to preserve the contract from +/// spec §2.2.3 / spec §3 PR-9c-2. +pub fn truncate_for_nli(premise: &str, _hypothesis: &str) -> (String, bool) { + if premise.chars().count() <= MAX_NLI_PREMISE_CHARS { + (premise.to_string(), false) + } else { + let truncated: String = premise.chars().take(MAX_NLI_PREMISE_CHARS).collect(); + (truncated, true) + } +} + const MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT: &str = "당신은 사용자의 질문을 다단계 검색에 필요한 sub-question 들로 분해하는 도구다.\n- multi-hop 정보가 필요한 경우 독립적으로 검색 가능한 sub-question 들로 분해한다.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).\n- 원본이 이미 단순하면 원본 그대로 1 개만 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지."; const MULTI_HOP_DECIDE_SYSTEM_PROMPT: &str = "당신은 multi-hop 검색의 매 iter 에서 \"추가 retrieval 이 필요한가?\" 를 판단하는 도구다.\n- 지금까지 모은 [근거] 가 [원본 질문] 의 모든 측면을 cover 하는지 평가한다.\n- 추가가 필요하면 새 sub-question 들 (이미 모은 정보로 답할 수 없는 부분만, 독립적으로 검색 가능한 형태로) 을 JSON array of strings 로 반환한다.\n- 충분하면 빈 array `[]` 를 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지)."; diff --git a/crates/kebab-rag/tests/common/mod.rs b/crates/kebab-rag/tests/common/mod.rs index acd9429..6d64c0b 100644 --- a/crates/kebab-rag/tests/common/mod.rs +++ b/crates/kebab-rag/tests/common/mod.rs @@ -12,13 +12,14 @@ #![allow(dead_code)] -use std::sync::Arc; +use std::sync::{Arc, Mutex}; use kebab_config::Config; use kebab_core::{ ChunkerVersion, ChunkId, Citation, DocumentId, IndexVersion, RetrievalDetail, Retriever, SearchHit, SearchMode, SearchQuery, WorkspacePath, }; +use kebab_nli::{NliScores, NliVerifier}; use kebab_store_sqlite::SqliteStore; use rusqlite::params; use tempfile::TempDir; @@ -384,3 +385,73 @@ impl kebab_core::LanguageModel for ScriptedLm { Ok(Box::new(chunks.into_iter().map(Ok))) } } + +/// p9-fb-41 PR-9c-2: mock NLI verifier for multi-hop step 8.5 tests. +/// Three constructors mirror the test scenarios: +/// - [`MockNliVerifier::pass`] — high entailment score (0.9), `nli_passed` +/// is true at the production default threshold (0.5). +/// - [`MockNliVerifier::fail`] — low entailment (0.1), refuses at any +/// threshold > 0.1. +/// - [`MockNliVerifier::err`] — returns an `anyhow::Error` so the pipeline +/// surfaces `RefusalReason::NliModelUnavailable`. +/// +/// `call_count` instrumented (Mutex-wrapped usize) so a test can assert +/// the verifier ran the expected number of times — useful for pinning +/// the "threshold = 0 skips verify" invariant when the verifier is +/// nonetheless attached to the pipeline. +pub struct MockNliVerifier { + pub mode: MockMode, + pub call_count: Mutex, +} + +pub enum MockMode { + /// Return these scores. Used by pass / fail variants. + Scores(NliScores), + /// Return an `anyhow::Error`. Used by the err variant. + Err(String), +} + +impl MockNliVerifier { + pub fn pass() -> Arc { + Arc::new(Self { + mode: MockMode::Scores(NliScores { + entailment: 0.9, + neutral: 0.07, + contradiction: 0.03, + }), + call_count: Mutex::new(0), + }) + } + + pub fn fail() -> Arc { + Arc::new(Self { + mode: MockMode::Scores(NliScores { + entailment: 0.1, + neutral: 0.4, + contradiction: 0.5, + }), + call_count: Mutex::new(0), + }) + } + + pub fn err() -> Arc { + Arc::new(Self { + mode: MockMode::Err("mock NLI unavailable".into()), + call_count: Mutex::new(0), + }) + } + + pub fn calls(&self) -> usize { + *self.call_count.lock().unwrap() + } +} + +impl NliVerifier for MockNliVerifier { + fn score(&self, _premise: &str, _hypothesis: &str) -> anyhow::Result { + *self.call_count.lock().unwrap() += 1; + match &self.mode { + MockMode::Scores(s) => Ok(*s), + MockMode::Err(e) => anyhow::bail!("{e}"), + } + } +} diff --git a/crates/kebab-rag/tests/multi_hop.rs b/crates/kebab-rag/tests/multi_hop.rs index c8d042a..ae21e64 100644 --- a/crates/kebab-rag/tests/multi_hop.rs +++ b/crates/kebab-rag/tests/multi_hop.rs @@ -26,9 +26,10 @@ mod common; use std::sync::Arc; -use common::{RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit}; +use common::{MockNliVerifier, RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit}; use kebab_core::{HopKind, LanguageModel, RefusalReason, Retriever, SearchMode}; -use kebab_rag::{AskOpts, RagPipeline}; +use kebab_nli::NliVerifier; +use kebab_rag::{AskOpts, RagPipeline, truncate_for_nli}; /// Default `AskOpts` for multi-hop tests: deterministic seed, /// lexical mode (so the test crate doesn't need to wire up an @@ -618,3 +619,195 @@ fn multi_hop_above_probe_gate_proceeds_to_decompose() { let hops = answer.hops.expect("happy path stamps hops"); assert_eq!(hops.len(), 3); } + +// ── p9-fb-41 PR-9c-2: step 8.5 NLI verification tests ────────────────────── +// +// Five tests pin the NLI hook on the multi-hop path: +// 1. `multi_hop_nli_pass_keeps_grounded` — entailment 0.9 ≥ threshold 0.5 → +// happy path, `verification.nli_passed = true`. +// 2. `multi_hop_nli_fail_refuses` — entailment 0.1 < threshold 0.5 → +// refusal with `RefusalReason::NliVerificationFailed` + verification stamp. +// 3. `multi_hop_nli_disabled_skip_verify` — threshold 0.0 → verify skipped, +// `Answer.verification` stays `None` (no verifier attached). +// 4. `multi_hop_nli_model_unavailable_refuses` — verifier returns `Err` → +// refusal with `RefusalReason::NliModelUnavailable` + `verification = None`. +// 5. `multi_hop_truncate_for_nli_preserves_hypothesis` — pure unit test on +// `truncate_for_nli`'s char-budget contract. + +/// Helper to build a "valid multi-hop happy-path" scenario where probe + +/// decompose retrieves the same single chunk, decompose emits one +/// sub-query, decide signals stop, and synthesize produces a cited +/// answer. Returns the seeded `RagEnv`, scripted retriever (so the +/// test can assert call count), and scripted LM with the 3-call +/// script ready. +fn happy_multi_hop_env() -> (RagEnv, Arc, Arc) { + let env = RagEnv::new(); + let cid = id32("c1"); + let did = id32("d1"); + env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]); + let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])]; + let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits])); + let lm = Arc::new(ScriptedLm::new(vec![ + r#"["q1"]"#, + r#"[]"#, + "answer body [#1]", + ])); + (env, retriever, lm) +} + +#[test] +fn multi_hop_nli_pass_keeps_grounded() { + let (env, retriever, lm) = happy_multi_hop_env(); + let mut cfg = env.config.clone(); + cfg.rag.nli_threshold = 0.5; + + let retriever_dyn: Arc = retriever; + let lm_dyn: Arc = lm; + let verifier = MockNliVerifier::pass(); + let verifier_handle = verifier.clone(); + let verifier_dyn: Arc = verifier; + let pipeline = + RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone()) + .with_verifier(verifier_dyn); + + let answer = pipeline.ask("compound", multi_hop_opts()).unwrap(); + + assert!(answer.grounded, "NLI-pass synthesize must stay grounded"); + assert_eq!(answer.refusal_reason, None); + assert_eq!( + verifier_handle.calls(), + 1, + "verifier called exactly once on the synthesized answer" + ); + let v = answer + .verification + .expect("nli_threshold > 0 stamps Some(verification)"); + assert!(v.nli_passed, "entailment 0.9 ≥ threshold 0.5"); + assert!((v.nli_score - 0.9).abs() < 1e-5, "got: {}", v.nli_score); + assert!((v.nli_threshold - 0.5).abs() < 1e-5); +} + +#[test] +fn multi_hop_nli_fail_refuses() { + let (env, retriever, lm) = happy_multi_hop_env(); + let mut cfg = env.config.clone(); + cfg.rag.nli_threshold = 0.5; + + let retriever_dyn: Arc = retriever; + let lm_dyn: Arc = lm; + let verifier = MockNliVerifier::fail(); + let verifier_handle = verifier.clone(); + let verifier_dyn: Arc = verifier; + let pipeline = + RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone()) + .with_verifier(verifier_dyn); + + let answer = pipeline.ask("compound", multi_hop_opts()).unwrap(); + + assert!(!answer.grounded); + assert_eq!( + answer.refusal_reason, + Some(RefusalReason::NliVerificationFailed) + ); + assert_eq!(verifier_handle.calls(), 1); + let v = answer + .verification + .expect("refusal still stamps verification summary"); + assert!(!v.nli_passed, "entailment 0.1 < threshold 0.5"); + assert!((v.nli_score - 0.1).abs() < 1e-5, "got: {}", v.nli_score); +} + +#[test] +fn multi_hop_nli_disabled_skip_verify() { + let (env, retriever, lm) = happy_multi_hop_env(); + // Default config keeps `nli_threshold = 0.0` — gate disabled. No + // verifier is attached to the pipeline; the hook short-circuits + // entirely (`Answer.verification` stays `None`). + let cfg = env.config.clone(); + assert!( + (cfg.rag.nli_threshold - 0.0).abs() < f32::EPSILON, + "default nli_threshold must be 0.0 (gate disabled)" + ); + + let retriever_dyn: Arc = retriever; + let lm_dyn: Arc = lm; + // No `with_verifier` call — pipeline.verifier stays None. + let pipeline = + RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone()); + + let answer = pipeline.ask("compound", multi_hop_opts()).unwrap(); + + assert!(answer.grounded); + assert_eq!(answer.refusal_reason, None); + assert!( + answer.verification.is_none(), + "threshold = 0.0 must skip step 8.5 and leave verification = None" + ); +} + +#[test] +fn multi_hop_nli_model_unavailable_refuses() { + let (env, retriever, lm) = happy_multi_hop_env(); + let mut cfg = env.config.clone(); + cfg.rag.nli_threshold = 0.5; + + let retriever_dyn: Arc = retriever; + let lm_dyn: Arc = lm; + let verifier = MockNliVerifier::err(); + let verifier_handle = verifier.clone(); + let verifier_dyn: Arc = verifier; + let pipeline = + RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone()) + .with_verifier(verifier_dyn); + + let answer = pipeline.ask("compound", multi_hop_opts()).unwrap(); + + assert!(!answer.grounded); + assert_eq!( + answer.refusal_reason, + Some(RefusalReason::NliModelUnavailable) + ); + assert_eq!(verifier_handle.calls(), 1, "verifier was invoked once before failing"); + assert!( + answer.verification.is_none(), + "NliModelUnavailable: can't summarize a verification that didn't happen" + ); +} + +#[test] +fn multi_hop_truncate_for_nli_preserves_hypothesis() { + // Long premise (>1600 chars) gets truncated, short hypothesis is + // passed unchanged (signature placeholder for v0.18.1 token-budget + // version). MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600. + let long_premise: String = "a".repeat(2000); + let (truncated, was_truncated) = truncate_for_nli(&long_premise, "short hypothesis"); + assert!(was_truncated); + assert_eq!( + truncated.chars().count(), + 1600, + "premise truncated to MAX_NLI_PREMISE_CHARS" + ); + + // Short premise (under budget): no truncation, `was_truncated = false`. + let short_premise = "short premise text"; + let (passthrough, was_truncated) = truncate_for_nli(short_premise, "anything"); + assert!(!was_truncated); + assert_eq!(passthrough, short_premise); + + // Multi-byte safety: 1600 Korean chars (3 bytes each in UTF-8) fits + // within the char budget even though byte length exceeds 4800. + let kr_short: String = "가".repeat(1600); + let (passthrough_kr, was_truncated_kr) = truncate_for_nli(&kr_short, "h"); + assert!(!was_truncated_kr, "1600 KR chars == budget, no truncation"); + assert_eq!(passthrough_kr.chars().count(), 1600); + + // Multi-byte over-budget: truncation must count chars, not bytes. + let kr_long: String = "가".repeat(2000); + let (truncated_kr, was_truncated_kr) = truncate_for_nli(&kr_long, "h"); + assert!(was_truncated_kr); + assert_eq!( + truncated_kr.chars().count(), + 1600, + "char-based truncation must not over-cut on multi-byte input" + ); +} diff --git a/integrations/claude-code/kebab/SKILL.md b/integrations/claude-code/kebab/SKILL.md index 3a3359f..42e6d28 100644 --- a/integrations/claude-code/kebab/SKILL.md +++ b/integrations/claude-code/kebab/SKILL.md @@ -88,6 +88,7 @@ Input: - For follow-up turns on the same topic, pass `session_id` (e.g. `"team-onboarding-2026-05"`) and reuse it across the conversation. Sessions persist until `kebab reset --data-only`. - p9-fb-40: 기본 `prompt_template_version = "rag-v2"`. 답변이 더 strict — fact 인용 시 verbatim span, 학습 지식 동원 금지, 근거 모호 시 "확실하지 않다" 출현 가능. user 가 `[rag] prompt_template_version = "rag-v1"` 명시 시 legacy 동작. - **p9-fb-41 `multi_hop: true`** — opt the ask into the multi-hop pipeline. The query is decomposed into sub-questions, each retrieved independently (LLM-driven decide loop, up to `rag.multi_hop_max_depth` iters), then synthesized over the merged chunk pool. Cost trade-off: 2–5× LLM calls vs. single-pass. **Use** for compound questions ("X 와 Y 의 차이는?", prereq chains, cross-doc reasoning where one chunk alone is insufficient). **Don't** for simple fact-finding (single-pass is faster + cheaper). When set, `answer.v1.hops[]` carries the per-hop trace (`{iter, kind, sub_queries[], context_chunks_added, forced_stop, llm_call_ms}`) — surface a brief "Searched in N hops" note when the trace is non-trivial. Decompose-failure (model emitted non-JSON) → `refusal_reason = "multi_hop_decompose_failed"`; treat like any other refusal. +- **v0.18+ multi-hop NLI verification** — multi-hop ask (`mcp__kebab__ask` with `multi_hop: true`) runs a post-synthesize NLI groundedness gate when `[rag] nli_threshold > 0` is set in the user's config. `answer.v1.verification.nli_passed == true` means the generated answer is entailed by the retrieved chunks (grounded); `false` means the answer is refused with `refusal_reason = "nli_verification_failed"` and the `verification` block still ships so the agent can show what entailment score was rejected. Threshold tuning: 0.5 is the production default, 0.9 is strict mode. If the NLI model download / inference fails the pipeline emits `refusal_reason = "nli_model_unavailable"` — user-side workaround is `[rag] nli_threshold = 0` then retry multi-hop. Single-pass `ask` (multi_hop: false / unset) is unaffected — it keeps the LLM self-judge gate as the only verification. ### `mcp__kebab__fetch` — when you need raw text -- 2.49.1