feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)
PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9. - crates/kebab-rag/src/pipeline.rs: - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기). - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed). - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable). - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate). - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure. - crates/kebab-app/src/app.rs: - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result<Self> 그대로 — caller cascading 0). - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가. - crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs: - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented). - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true. - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed. - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None. - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable. - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전. - integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling). 검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean. Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
1
Cargo.lock
generated
1
Cargo.lock
generated
@@ -4142,6 +4142,7 @@ dependencies = [
|
||||
"kebab-embed-local",
|
||||
"kebab-llm",
|
||||
"kebab-llm-local",
|
||||
"kebab-nli",
|
||||
"kebab-normalize",
|
||||
"kebab-parse-code",
|
||||
"kebab-parse-image",
|
||||
|
||||
@@ -23,6 +23,11 @@ kebab-embed-local = { path = "../kebab-embed-local" }
|
||||
kebab-llm = { path = "../kebab-llm" }
|
||||
kebab-llm-local = { path = "../kebab-llm-local" }
|
||||
kebab-rag = { path = "../kebab-rag" }
|
||||
# p9-fb-41 PR-9c-2: facade construction of OnnxNliVerifier when
|
||||
# `[rag] nli_threshold > 0`. Trait-only consumption via kebab-rag's
|
||||
# `with_verifier`; no kebab-nli internals leak into kebab-app code
|
||||
# beyond the construction site in `open_with_config`.
|
||||
kebab-nli = { path = "../kebab-nli" }
|
||||
# P6-4: image extractor + OCR + caption adapters live here. App
|
||||
# threads them into the per-asset dispatch (see `ingest_one_asset`
|
||||
# image branch). Trait-only consumption — no `kebab-parse-image`
|
||||
|
||||
@@ -133,6 +133,17 @@ pub struct App {
|
||||
/// `corpus_revision` snapshot embedded in `SearchCacheKey`
|
||||
/// invalidates every entry the moment a new ingest commit lands.
|
||||
search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<SearchHit>>>>,
|
||||
/// p9-fb-41 PR-9c-2: NLI verifier built eagerly at
|
||||
/// `open_with_config` time when `config.rag.nli_threshold > 0`,
|
||||
/// consumed by `RagPipeline::with_verifier` on every `ask` /
|
||||
/// `ask_with_session` call. `None` when the gate is disabled
|
||||
/// (default, threshold = 0) — multi-hop skips step 8.5 entirely
|
||||
/// and single-pass never touches the verifier.
|
||||
///
|
||||
/// Built eagerly (not lazy) so the `open_with_config` `?`
|
||||
/// propagation surfaces NLI model construction errors at App
|
||||
/// boot time, before any user query runs.
|
||||
pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
|
||||
}
|
||||
|
||||
/// p9-fb-19: cache key for `App::search`. Includes every field that
|
||||
@@ -193,6 +204,21 @@ impl App {
|
||||
// `None` (cache disabled — every search hits the retrievers).
|
||||
let search_cache = NonZeroUsize::new(config.search.cache_capacity)
|
||||
.map(|cap| Mutex::new(LruCache::new(cap)));
|
||||
// p9-fb-41 PR-9c-2: build the NLI verifier when the gate is
|
||||
// enabled. App carries it on `RagPipeline` via
|
||||
// `with_verifier` so the rag crate doesn't have to know about
|
||||
// kebab-nli construction. Failure (`?`) surfaces as a user-
|
||||
// facing error at App boot — never a panic in the pipeline's
|
||||
// `expect("verifier must be Some when nli_threshold > 0.0")`.
|
||||
let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> =
|
||||
if config.rag.nli_threshold > 0.0 {
|
||||
let v = kebab_nli::OnnxNliVerifier::new(&config).context(
|
||||
"kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)",
|
||||
)?;
|
||||
Some(Arc::new(v))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
Ok(Self {
|
||||
config,
|
||||
sqlite: Arc::new(sqlite),
|
||||
@@ -200,6 +226,7 @@ impl App {
|
||||
vector: OnceLock::new(),
|
||||
llm: OnceLock::new(),
|
||||
search_cache,
|
||||
pipeline_verifier,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -553,9 +580,26 @@ impl App {
|
||||
pub fn ask(&self, query: &str, opts: AskOpts) -> Result<Answer> {
|
||||
let retriever = self.build_retriever(opts.mode)?;
|
||||
let llm = self.llm()?;
|
||||
let pipeline = self.build_pipeline(retriever, llm);
|
||||
pipeline.ask(query, opts)
|
||||
}
|
||||
|
||||
/// p9-fb-41 PR-9c-2: shared pipeline builder used by [`Self::ask`]
|
||||
/// and [`Self::ask_with_session`]. Attaches the App-built NLI
|
||||
/// verifier (when `cfg.rag.nli_threshold > 0`) via
|
||||
/// `RagPipeline::with_verifier`, keeping the construction site in
|
||||
/// a single place so the two call paths can't drift.
|
||||
fn build_pipeline(
|
||||
&self,
|
||||
retriever: Arc<dyn Retriever>,
|
||||
llm: Arc<dyn LanguageModel>,
|
||||
) -> RagPipeline {
|
||||
let pipeline =
|
||||
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
|
||||
pipeline.ask(query, opts)
|
||||
match &self.pipeline_verifier {
|
||||
Some(v) => pipeline.with_verifier(v.clone()),
|
||||
None => pipeline,
|
||||
}
|
||||
}
|
||||
|
||||
/// p9-fb-18: shared retriever-stack builder used by [`Self::ask`]
|
||||
@@ -660,10 +704,11 @@ impl App {
|
||||
|
||||
// p9-fb-18 R1: shared retriever builder removes the prior
|
||||
// copy of `ask`'s 35-line stack — see [`Self::build_retriever`].
|
||||
// p9-fb-41 PR-9c-2: shared `build_pipeline` attaches the NLI
|
||||
// verifier when the gate is enabled.
|
||||
let retriever = self.build_retriever(opts.mode)?;
|
||||
let llm = self.llm()?;
|
||||
let pipeline =
|
||||
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
|
||||
let pipeline = self.build_pipeline(retriever, llm);
|
||||
let answer = pipeline.ask_with_history(
|
||||
query,
|
||||
history,
|
||||
|
||||
@@ -22,4 +22,4 @@ pub use kebab_core::{Answer, AnswerCitation, AnswerRetrievalSummary, RefusalReas
|
||||
|
||||
mod pipeline;
|
||||
|
||||
pub use pipeline::{AskOpts, RagPipeline, StreamEvent};
|
||||
pub use pipeline::{AskOpts, MAX_NLI_PREMISE_CHARS, RagPipeline, StreamEvent, truncate_for_nli};
|
||||
|
||||
@@ -42,7 +42,7 @@ use kebab_core::{
|
||||
Answer, AnswerCitation, AnswerRetrievalSummary, Citation, FinishReason,
|
||||
GenerateRequest, HopKind, HopRecord, LanguageModel, ModelRef, RefusalReason,
|
||||
Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery, TokenChunk,
|
||||
TokenUsage, TraceId, Turn,
|
||||
TokenUsage, TraceId, Turn, VerificationSummary,
|
||||
};
|
||||
use kebab_core::versions::PromptTemplateVersion;
|
||||
use kebab_store_sqlite::SqliteStore;
|
||||
@@ -197,13 +197,11 @@ pub struct RagPipeline {
|
||||
retriever: Arc<dyn Retriever>,
|
||||
llm: Arc<dyn LanguageModel>,
|
||||
docs: Arc<SqliteStore>,
|
||||
/// p9-fb-41 PR-9c-1: optional NLI verifier injected via
|
||||
/// [`Self::with_verifier`]. Not yet read — PR-9c-2 wires the
|
||||
/// `ask_multi_hop` step 8.5 (post-synthesize gate) that consumes
|
||||
/// it. Until then the field is `#[allow(dead_code)]`; the
|
||||
/// attribute is removed in the PR-9c-2 commit that adds the
|
||||
/// read site so leftover dead code can never sneak in.
|
||||
#[allow(dead_code)]
|
||||
/// p9-fb-41 PR-9c-1/PR-9c-2: optional NLI verifier injected via
|
||||
/// [`Self::with_verifier`]. Consumed by `ask_multi_hop` step 8.5
|
||||
/// (post-synthesize gate) when `cfg.rag.nli_threshold > 0`.
|
||||
/// `None` when the gate is disabled — single-pass `ask` never
|
||||
/// touches this field.
|
||||
verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
|
||||
}
|
||||
|
||||
@@ -231,17 +229,12 @@ impl RagPipeline {
|
||||
}
|
||||
}
|
||||
|
||||
/// p9-fb-41 PR-9c-1: inject the post-synthesize NLI verifier.
|
||||
/// Caller (kebab-app facade, PR-9c-2) builds an
|
||||
/// p9-fb-41 PR-9c-1/PR-9c-2: inject the post-synthesize NLI
|
||||
/// verifier. Caller (kebab-app facade) builds an
|
||||
/// `Arc<OnnxNliVerifier>` from `cfg.models.nli` when
|
||||
/// `cfg.rag.nli_threshold > 0`, then chains
|
||||
/// `RagPipeline::new(...).with_verifier(v)`.
|
||||
///
|
||||
/// Currently unused — PR-9c-2 wires the read site (step 8.5 of
|
||||
/// `ask_multi_hop`). `#[allow(dead_code)]` survives only until
|
||||
/// that PR's commit, which removes it together with adding the
|
||||
/// hook that reads `self.verifier`.
|
||||
#[allow(dead_code)]
|
||||
/// `RagPipeline::new(...).with_verifier(v)`. Consumed by
|
||||
/// `ask_multi_hop` step 8.5 (post-synthesize gate).
|
||||
pub fn with_verifier(mut self, v: Arc<dyn kebab_nli::NliVerifier>) -> Self {
|
||||
self.verifier = Some(v);
|
||||
self
|
||||
@@ -1031,6 +1024,48 @@ impl RagPipeline {
|
||||
(false, Some(RefusalReason::LlmSelfJudge))
|
||||
};
|
||||
|
||||
// ── 8.5 NLI groundedness verification (multi-hop only, v0.18) ─────
|
||||
// spec §2.7: single-pass `ask` keeps the LlmSelfJudge gate as-is;
|
||||
// NLI verification is multi-hop only this round.
|
||||
//
|
||||
// Empty answer guard: if synthesize bailed (stream abort / LM
|
||||
// crash), `acc` is empty. That path has its own refusal
|
||||
// (LlmStreamAborted) above; skipping the NLI gate here avoids
|
||||
// tokenizing an empty hypothesis (degenerate CLS-SEP-SEP that
|
||||
// would yield a near-uniform softmax and a misleading nli_passed).
|
||||
let verification = if self.config.rag.nli_threshold > 0.0 && !acc.trim().is_empty() {
|
||||
let v = self.verifier.as_ref().expect(
|
||||
"verifier must be Some when nli_threshold > 0.0 \
|
||||
(kebab-app's open_with_config enforces this invariant)",
|
||||
);
|
||||
let (truncated_premise, _was_truncated) = truncate_for_nli(&packed_text, &acc);
|
||||
match v.score(&truncated_premise, &acc) {
|
||||
Ok(scores) => {
|
||||
let passed = scores.entailment >= self.config.rag.nli_threshold;
|
||||
Some(VerificationSummary {
|
||||
nli_score: scores.entailment,
|
||||
nli_threshold: self.config.rag.nli_threshold,
|
||||
nli_passed: passed,
|
||||
})
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kebab-rag",
|
||||
error = %e,
|
||||
"NLI verifier failed (model unavailable / inference err); refusing"
|
||||
);
|
||||
return self.refuse_nli_model_unavailable(query, &opts, hops, started);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
None
|
||||
};
|
||||
if let Some(v) = &verification
|
||||
&& !v.nli_passed
|
||||
{
|
||||
return self.refuse_nli_verification(query, &opts, hops, *v, started);
|
||||
}
|
||||
|
||||
// ── 8. Build Answer ────────────────────────────────────────────────
|
||||
let cited_set: std::collections::BTreeSet<u32> = extracted.iter().copied().collect();
|
||||
let citations: Vec<AnswerCitation> = packed_entries
|
||||
@@ -1101,11 +1136,10 @@ impl RagPipeline {
|
||||
// currently lose the trace (cleanup deferred — would
|
||||
// require widening helper signatures, PR-3b-ii / follow-up).
|
||||
hops: Some(hops),
|
||||
// p9-fb-41 PR-9c-1: surface-only field — PR-9c-2 wires
|
||||
// step 8.5 between citation-validate and Answer-build to
|
||||
// stamp this with the actual NLI score when
|
||||
// `cfg.rag.nli_threshold > 0`. Until then, stays None.
|
||||
verification: None,
|
||||
// p9-fb-41 PR-9c-2: step 8.5 stamped this when
|
||||
// `cfg.rag.nli_threshold > 0`. None when the gate is
|
||||
// disabled (default).
|
||||
verification,
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
@@ -1554,6 +1588,137 @@ impl RagPipeline {
|
||||
}
|
||||
Ok(answer)
|
||||
}
|
||||
|
||||
/// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI gate failure —
|
||||
/// `RefusalReason::NliVerificationFailed`. The synthesized answer
|
||||
/// existed (acc was non-empty) but the entailment score fell below
|
||||
/// `cfg.rag.nli_threshold`. We stamp the `VerificationSummary` on
|
||||
/// the wire so the user can see what score was rejected.
|
||||
fn refuse_nli_verification(
|
||||
&self,
|
||||
query: &str,
|
||||
opts: &AskOpts,
|
||||
hops: Vec<HopRecord>,
|
||||
v: VerificationSummary,
|
||||
started: std::time::Instant,
|
||||
) -> Result<Answer> {
|
||||
let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
|
||||
let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
|
||||
let k_effective = opts.k.max(self.config.search.default_k);
|
||||
let answer = Answer {
|
||||
answer: "근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음."
|
||||
.to_string(),
|
||||
citations: Vec::new(),
|
||||
grounded: false,
|
||||
refusal_reason: Some(RefusalReason::NliVerificationFailed),
|
||||
model: self.llm.model_ref(),
|
||||
embedding: embedding_ref_for(opts.mode, &self.config),
|
||||
prompt_template_version: PromptTemplateVersion(
|
||||
PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
|
||||
),
|
||||
retrieval: AnswerRetrievalSummary {
|
||||
trace_id,
|
||||
mode: opts.mode,
|
||||
k: k_effective,
|
||||
score_gate: self.config.rag.score_gate,
|
||||
top_score: 0.0,
|
||||
chunks_returned: 0,
|
||||
chunks_used: 0,
|
||||
},
|
||||
usage: TokenUsage {
|
||||
prompt_tokens: 0,
|
||||
completion_tokens: 0,
|
||||
latency_ms: elapsed_ms,
|
||||
},
|
||||
created_at: OffsetDateTime::now_utc(),
|
||||
conversation_id: opts.conversation_id.clone(),
|
||||
turn_index: opts.turn_index,
|
||||
// PR-9c-2: NLI refusal still carries the hop trace built
|
||||
// up to step 8.5 — synthesize ran, so the trace is the
|
||||
// full decompose+decide chain (terminal Synthesize hop is
|
||||
// NOT appended for the refusal path; cleanup deferred to
|
||||
// a follow-up if the user-visible trace shape needs the
|
||||
// synthesize entry).
|
||||
hops: Some(hops),
|
||||
verification: Some(v),
|
||||
};
|
||||
if let Some(sink) = &opts.stream_sink {
|
||||
let _ = sink.send(StreamEvent::Final {
|
||||
answer: answer.clone(),
|
||||
});
|
||||
}
|
||||
if let Err(e) = self.docs.put_answer(&answer, query, None) {
|
||||
tracing::warn!(
|
||||
target: "kebab-rag",
|
||||
error = %e,
|
||||
"kb-rag: put_answer (NliVerificationFailed) failed"
|
||||
);
|
||||
}
|
||||
Ok(answer)
|
||||
}
|
||||
|
||||
/// p9-fb-41 PR-9c-2: refusal path for step 8.5 NLI model
|
||||
/// unavailable — `RefusalReason::NliModelUnavailable`. The verifier
|
||||
/// raised an inference/download error so we cannot summarize the
|
||||
/// verification result; `verification` is `None`. Treat as a soft
|
||||
/// refusal — the user can opt out by setting `[rag] nli_threshold
|
||||
/// = 0` and retrying.
|
||||
fn refuse_nli_model_unavailable(
|
||||
&self,
|
||||
query: &str,
|
||||
opts: &AskOpts,
|
||||
hops: Vec<HopRecord>,
|
||||
started: std::time::Instant,
|
||||
) -> Result<Answer> {
|
||||
let elapsed_ms = u32::try_from(started.elapsed().as_millis()).unwrap_or(u32::MAX);
|
||||
let trace_id = mint_trace_id(query, 0.0, &self.llm.model_ref().id);
|
||||
let k_effective = opts.k.max(self.config.search.default_k);
|
||||
let answer = Answer {
|
||||
answer: "근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능."
|
||||
.to_string(),
|
||||
citations: Vec::new(),
|
||||
grounded: false,
|
||||
refusal_reason: Some(RefusalReason::NliModelUnavailable),
|
||||
model: self.llm.model_ref(),
|
||||
embedding: embedding_ref_for(opts.mode, &self.config),
|
||||
prompt_template_version: PromptTemplateVersion(
|
||||
PROMPT_TEMPLATE_VERSION_MULTI_HOP.to_string(),
|
||||
),
|
||||
retrieval: AnswerRetrievalSummary {
|
||||
trace_id,
|
||||
mode: opts.mode,
|
||||
k: k_effective,
|
||||
score_gate: self.config.rag.score_gate,
|
||||
top_score: 0.0,
|
||||
chunks_returned: 0,
|
||||
chunks_used: 0,
|
||||
},
|
||||
usage: TokenUsage {
|
||||
prompt_tokens: 0,
|
||||
completion_tokens: 0,
|
||||
latency_ms: elapsed_ms,
|
||||
},
|
||||
created_at: OffsetDateTime::now_utc(),
|
||||
conversation_id: opts.conversation_id.clone(),
|
||||
turn_index: opts.turn_index,
|
||||
hops: Some(hops),
|
||||
// No VerificationSummary — verification didn't happen.
|
||||
verification: None,
|
||||
};
|
||||
if let Some(sink) = &opts.stream_sink {
|
||||
let _ = sink.send(StreamEvent::Final {
|
||||
answer: answer.clone(),
|
||||
});
|
||||
}
|
||||
if let Err(e) = self.docs.put_answer(&answer, query, None) {
|
||||
tracing::warn!(
|
||||
target: "kebab-rag",
|
||||
error = %e,
|
||||
"kb-rag: put_answer (NliModelUnavailable) failed"
|
||||
);
|
||||
}
|
||||
Ok(answer)
|
||||
}
|
||||
}
|
||||
|
||||
// ── Helpers ────────────────────────────────────────────────────────────────
|
||||
@@ -1623,6 +1788,35 @@ pub(crate) const PROMPT_TEMPLATE_VERSION_MULTI_HOP: &str = "rag-multi-hop-v1";
|
||||
/// the same PR.
|
||||
pub(crate) const MULTI_HOP_MAX_SUB_QUERIES_HARD_CAP: usize = 10;
|
||||
|
||||
/// p9-fb-41 PR-9c-2: premise budget for NLI input. mDeBERTa-v3's
|
||||
/// positional embedding caps at 512 tokens; with the hypothesis +
|
||||
/// special-token budget reserved (~32 chars conservative), the
|
||||
/// premise gets ≈1600 chars at 4 chars/token (English BPE baseline).
|
||||
/// Korean SentencePiece is denser (1-2 char/token) — the tokenizer's
|
||||
/// `OnlyFirst` strategy (configured in kebab-nli) is the backup
|
||||
/// truncation when the char-based budget still overflows the token
|
||||
/// limit. v0.18.1 candidate: token-count-based budget once we have
|
||||
/// measured KR truncation rates from dogfood retest.
|
||||
pub const MAX_NLI_PREMISE_CHARS: usize = 4 * 400;
|
||||
|
||||
/// p9-fb-41 PR-9c-2: truncate `premise` to fit the NLI input budget
|
||||
/// while preserving `hypothesis` in full. Returns `(truncated_premise,
|
||||
/// was_truncated)`. `was_truncated` is informational for tracing —
|
||||
/// the v0.18 wire doesn't surface it; a v0.19+ extension might.
|
||||
///
|
||||
/// `_hypothesis` is currently unused — placeholder for the v0.18.1
|
||||
/// token-budget version that would carve the budget *around* the
|
||||
/// hypothesis. Kept on the signature to preserve the contract from
|
||||
/// spec §2.2.3 / spec §3 PR-9c-2.
|
||||
pub fn truncate_for_nli(premise: &str, _hypothesis: &str) -> (String, bool) {
|
||||
if premise.chars().count() <= MAX_NLI_PREMISE_CHARS {
|
||||
(premise.to_string(), false)
|
||||
} else {
|
||||
let truncated: String = premise.chars().take(MAX_NLI_PREMISE_CHARS).collect();
|
||||
(truncated, true)
|
||||
}
|
||||
}
|
||||
|
||||
const MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT: &str = "당신은 사용자의 질문을 다단계 검색에 필요한 sub-question 들로 분해하는 도구다.\n- multi-hop 정보가 필요한 경우 독립적으로 검색 가능한 sub-question 들로 분해한다.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).\n- 원본이 이미 단순하면 원본 그대로 1 개만 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.";
|
||||
|
||||
const MULTI_HOP_DECIDE_SYSTEM_PROMPT: &str = "당신은 multi-hop 검색의 매 iter 에서 \"추가 retrieval 이 필요한가?\" 를 판단하는 도구다.\n- 지금까지 모은 [근거] 가 [원본 질문] 의 모든 측면을 cover 하는지 평가한다.\n- 추가가 필요하면 새 sub-question 들 (이미 모은 정보로 답할 수 없는 부분만, 독립적으로 검색 가능한 형태로) 을 JSON array of strings 로 반환한다.\n- 충분하면 빈 array `[]` 를 반환한다.\n- 응답은 JSON array of strings 만 출력한다. 다른 prose / markdown fence / 설명 금지.\n- 각 sub-question 은 자기 자신만으로 의미가 통해야 한다 (대명사 / \"위 답변\" 같은 reference 금지).";
|
||||
|
||||
@@ -12,13 +12,14 @@
|
||||
|
||||
#![allow(dead_code)]
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
use kebab_config::Config;
|
||||
use kebab_core::{
|
||||
ChunkerVersion, ChunkId, Citation, DocumentId, IndexVersion, RetrievalDetail,
|
||||
Retriever, SearchHit, SearchMode, SearchQuery, WorkspacePath,
|
||||
};
|
||||
use kebab_nli::{NliScores, NliVerifier};
|
||||
use kebab_store_sqlite::SqliteStore;
|
||||
use rusqlite::params;
|
||||
use tempfile::TempDir;
|
||||
@@ -384,3 +385,73 @@ impl kebab_core::LanguageModel for ScriptedLm {
|
||||
Ok(Box::new(chunks.into_iter().map(Ok)))
|
||||
}
|
||||
}
|
||||
|
||||
/// p9-fb-41 PR-9c-2: mock NLI verifier for multi-hop step 8.5 tests.
|
||||
/// Three constructors mirror the test scenarios:
|
||||
/// - [`MockNliVerifier::pass`] — high entailment score (0.9), `nli_passed`
|
||||
/// is true at the production default threshold (0.5).
|
||||
/// - [`MockNliVerifier::fail`] — low entailment (0.1), refuses at any
|
||||
/// threshold > 0.1.
|
||||
/// - [`MockNliVerifier::err`] — returns an `anyhow::Error` so the pipeline
|
||||
/// surfaces `RefusalReason::NliModelUnavailable`.
|
||||
///
|
||||
/// `call_count` instrumented (Mutex-wrapped usize) so a test can assert
|
||||
/// the verifier ran the expected number of times — useful for pinning
|
||||
/// the "threshold = 0 skips verify" invariant when the verifier is
|
||||
/// nonetheless attached to the pipeline.
|
||||
pub struct MockNliVerifier {
|
||||
pub mode: MockMode,
|
||||
pub call_count: Mutex<usize>,
|
||||
}
|
||||
|
||||
pub enum MockMode {
|
||||
/// Return these scores. Used by pass / fail variants.
|
||||
Scores(NliScores),
|
||||
/// Return an `anyhow::Error`. Used by the err variant.
|
||||
Err(String),
|
||||
}
|
||||
|
||||
impl MockNliVerifier {
|
||||
pub fn pass() -> Arc<Self> {
|
||||
Arc::new(Self {
|
||||
mode: MockMode::Scores(NliScores {
|
||||
entailment: 0.9,
|
||||
neutral: 0.07,
|
||||
contradiction: 0.03,
|
||||
}),
|
||||
call_count: Mutex::new(0),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn fail() -> Arc<Self> {
|
||||
Arc::new(Self {
|
||||
mode: MockMode::Scores(NliScores {
|
||||
entailment: 0.1,
|
||||
neutral: 0.4,
|
||||
contradiction: 0.5,
|
||||
}),
|
||||
call_count: Mutex::new(0),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn err() -> Arc<Self> {
|
||||
Arc::new(Self {
|
||||
mode: MockMode::Err("mock NLI unavailable".into()),
|
||||
call_count: Mutex::new(0),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn calls(&self) -> usize {
|
||||
*self.call_count.lock().unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
impl NliVerifier for MockNliVerifier {
|
||||
fn score(&self, _premise: &str, _hypothesis: &str) -> anyhow::Result<NliScores> {
|
||||
*self.call_count.lock().unwrap() += 1;
|
||||
match &self.mode {
|
||||
MockMode::Scores(s) => Ok(*s),
|
||||
MockMode::Err(e) => anyhow::bail!("{e}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,9 +26,10 @@ mod common;
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use common::{RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
|
||||
use common::{MockNliVerifier, RagEnv, ScriptedLm, ScriptedRetriever, id32, mk_hit};
|
||||
use kebab_core::{HopKind, LanguageModel, RefusalReason, Retriever, SearchMode};
|
||||
use kebab_rag::{AskOpts, RagPipeline};
|
||||
use kebab_nli::NliVerifier;
|
||||
use kebab_rag::{AskOpts, RagPipeline, truncate_for_nli};
|
||||
|
||||
/// Default `AskOpts` for multi-hop tests: deterministic seed,
|
||||
/// lexical mode (so the test crate doesn't need to wire up an
|
||||
@@ -618,3 +619,195 @@ fn multi_hop_above_probe_gate_proceeds_to_decompose() {
|
||||
let hops = answer.hops.expect("happy path stamps hops");
|
||||
assert_eq!(hops.len(), 3);
|
||||
}
|
||||
|
||||
// ── p9-fb-41 PR-9c-2: step 8.5 NLI verification tests ──────────────────────
|
||||
//
|
||||
// Five tests pin the NLI hook on the multi-hop path:
|
||||
// 1. `multi_hop_nli_pass_keeps_grounded` — entailment 0.9 ≥ threshold 0.5 →
|
||||
// happy path, `verification.nli_passed = true`.
|
||||
// 2. `multi_hop_nli_fail_refuses` — entailment 0.1 < threshold 0.5 →
|
||||
// refusal with `RefusalReason::NliVerificationFailed` + verification stamp.
|
||||
// 3. `multi_hop_nli_disabled_skip_verify` — threshold 0.0 → verify skipped,
|
||||
// `Answer.verification` stays `None` (no verifier attached).
|
||||
// 4. `multi_hop_nli_model_unavailable_refuses` — verifier returns `Err` →
|
||||
// refusal with `RefusalReason::NliModelUnavailable` + `verification = None`.
|
||||
// 5. `multi_hop_truncate_for_nli_preserves_hypothesis` — pure unit test on
|
||||
// `truncate_for_nli`'s char-budget contract.
|
||||
|
||||
/// Helper to build a "valid multi-hop happy-path" scenario where probe +
|
||||
/// decompose retrieves the same single chunk, decompose emits one
|
||||
/// sub-query, decide signals stop, and synthesize produces a cited
|
||||
/// answer. Returns the seeded `RagEnv`, scripted retriever (so the
|
||||
/// test can assert call count), and scripted LM with the 3-call
|
||||
/// script ready.
|
||||
fn happy_multi_hop_env() -> (RagEnv, Arc<ScriptedRetriever>, Arc<ScriptedLm>) {
|
||||
let env = RagEnv::new();
|
||||
let cid = id32("c1");
|
||||
let did = id32("d1");
|
||||
env.seed_chunk(&cid, &did, "notes/a.md", "Body text.", &["Intro"]);
|
||||
let hits = vec![mk_hit(1, &cid, &did, "notes/a.md", 0.85, &["Intro"])];
|
||||
let retriever = Arc::new(ScriptedRetriever::new(vec![hits.clone(), hits]));
|
||||
let lm = Arc::new(ScriptedLm::new(vec![
|
||||
r#"["q1"]"#,
|
||||
r#"[]"#,
|
||||
"answer body [#1]",
|
||||
]));
|
||||
(env, retriever, lm)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_nli_pass_keeps_grounded() {
|
||||
let (env, retriever, lm) = happy_multi_hop_env();
|
||||
let mut cfg = env.config.clone();
|
||||
cfg.rag.nli_threshold = 0.5;
|
||||
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let verifier = MockNliVerifier::pass();
|
||||
let verifier_handle = verifier.clone();
|
||||
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
|
||||
let pipeline =
|
||||
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
|
||||
.with_verifier(verifier_dyn);
|
||||
|
||||
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(answer.grounded, "NLI-pass synthesize must stay grounded");
|
||||
assert_eq!(answer.refusal_reason, None);
|
||||
assert_eq!(
|
||||
verifier_handle.calls(),
|
||||
1,
|
||||
"verifier called exactly once on the synthesized answer"
|
||||
);
|
||||
let v = answer
|
||||
.verification
|
||||
.expect("nli_threshold > 0 stamps Some(verification)");
|
||||
assert!(v.nli_passed, "entailment 0.9 ≥ threshold 0.5");
|
||||
assert!((v.nli_score - 0.9).abs() < 1e-5, "got: {}", v.nli_score);
|
||||
assert!((v.nli_threshold - 0.5).abs() < 1e-5);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_nli_fail_refuses() {
|
||||
let (env, retriever, lm) = happy_multi_hop_env();
|
||||
let mut cfg = env.config.clone();
|
||||
cfg.rag.nli_threshold = 0.5;
|
||||
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let verifier = MockNliVerifier::fail();
|
||||
let verifier_handle = verifier.clone();
|
||||
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
|
||||
let pipeline =
|
||||
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
|
||||
.with_verifier(verifier_dyn);
|
||||
|
||||
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(!answer.grounded);
|
||||
assert_eq!(
|
||||
answer.refusal_reason,
|
||||
Some(RefusalReason::NliVerificationFailed)
|
||||
);
|
||||
assert_eq!(verifier_handle.calls(), 1);
|
||||
let v = answer
|
||||
.verification
|
||||
.expect("refusal still stamps verification summary");
|
||||
assert!(!v.nli_passed, "entailment 0.1 < threshold 0.5");
|
||||
assert!((v.nli_score - 0.1).abs() < 1e-5, "got: {}", v.nli_score);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_nli_disabled_skip_verify() {
|
||||
let (env, retriever, lm) = happy_multi_hop_env();
|
||||
// Default config keeps `nli_threshold = 0.0` — gate disabled. No
|
||||
// verifier is attached to the pipeline; the hook short-circuits
|
||||
// entirely (`Answer.verification` stays `None`).
|
||||
let cfg = env.config.clone();
|
||||
assert!(
|
||||
(cfg.rag.nli_threshold - 0.0).abs() < f32::EPSILON,
|
||||
"default nli_threshold must be 0.0 (gate disabled)"
|
||||
);
|
||||
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
// No `with_verifier` call — pipeline.verifier stays None.
|
||||
let pipeline =
|
||||
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone());
|
||||
|
||||
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(answer.grounded);
|
||||
assert_eq!(answer.refusal_reason, None);
|
||||
assert!(
|
||||
answer.verification.is_none(),
|
||||
"threshold = 0.0 must skip step 8.5 and leave verification = None"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_nli_model_unavailable_refuses() {
|
||||
let (env, retriever, lm) = happy_multi_hop_env();
|
||||
let mut cfg = env.config.clone();
|
||||
cfg.rag.nli_threshold = 0.5;
|
||||
|
||||
let retriever_dyn: Arc<dyn Retriever> = retriever;
|
||||
let lm_dyn: Arc<dyn LanguageModel> = lm;
|
||||
let verifier = MockNliVerifier::err();
|
||||
let verifier_handle = verifier.clone();
|
||||
let verifier_dyn: Arc<dyn NliVerifier> = verifier;
|
||||
let pipeline =
|
||||
RagPipeline::new(cfg, retriever_dyn, lm_dyn, env.sqlite.clone())
|
||||
.with_verifier(verifier_dyn);
|
||||
|
||||
let answer = pipeline.ask("compound", multi_hop_opts()).unwrap();
|
||||
|
||||
assert!(!answer.grounded);
|
||||
assert_eq!(
|
||||
answer.refusal_reason,
|
||||
Some(RefusalReason::NliModelUnavailable)
|
||||
);
|
||||
assert_eq!(verifier_handle.calls(), 1, "verifier was invoked once before failing");
|
||||
assert!(
|
||||
answer.verification.is_none(),
|
||||
"NliModelUnavailable: can't summarize a verification that didn't happen"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn multi_hop_truncate_for_nli_preserves_hypothesis() {
|
||||
// Long premise (>1600 chars) gets truncated, short hypothesis is
|
||||
// passed unchanged (signature placeholder for v0.18.1 token-budget
|
||||
// version). MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600.
|
||||
let long_premise: String = "a".repeat(2000);
|
||||
let (truncated, was_truncated) = truncate_for_nli(&long_premise, "short hypothesis");
|
||||
assert!(was_truncated);
|
||||
assert_eq!(
|
||||
truncated.chars().count(),
|
||||
1600,
|
||||
"premise truncated to MAX_NLI_PREMISE_CHARS"
|
||||
);
|
||||
|
||||
// Short premise (under budget): no truncation, `was_truncated = false`.
|
||||
let short_premise = "short premise text";
|
||||
let (passthrough, was_truncated) = truncate_for_nli(short_premise, "anything");
|
||||
assert!(!was_truncated);
|
||||
assert_eq!(passthrough, short_premise);
|
||||
|
||||
// Multi-byte safety: 1600 Korean chars (3 bytes each in UTF-8) fits
|
||||
// within the char budget even though byte length exceeds 4800.
|
||||
let kr_short: String = "가".repeat(1600);
|
||||
let (passthrough_kr, was_truncated_kr) = truncate_for_nli(&kr_short, "h");
|
||||
assert!(!was_truncated_kr, "1600 KR chars == budget, no truncation");
|
||||
assert_eq!(passthrough_kr.chars().count(), 1600);
|
||||
|
||||
// Multi-byte over-budget: truncation must count chars, not bytes.
|
||||
let kr_long: String = "가".repeat(2000);
|
||||
let (truncated_kr, was_truncated_kr) = truncate_for_nli(&kr_long, "h");
|
||||
assert!(was_truncated_kr);
|
||||
assert_eq!(
|
||||
truncated_kr.chars().count(),
|
||||
1600,
|
||||
"char-based truncation must not over-cut on multi-byte input"
|
||||
);
|
||||
}
|
||||
|
||||
@@ -88,6 +88,7 @@ Input:
|
||||
- For follow-up turns on the same topic, pass `session_id` (e.g. `"team-onboarding-2026-05"`) and reuse it across the conversation. Sessions persist until `kebab reset --data-only`.
|
||||
- p9-fb-40: 기본 `prompt_template_version = "rag-v2"`. 답변이 더 strict — fact 인용 시 verbatim span, 학습 지식 동원 금지, 근거 모호 시 "확실하지 않다" 출현 가능. user 가 `[rag] prompt_template_version = "rag-v1"` 명시 시 legacy 동작.
|
||||
- **p9-fb-41 `multi_hop: true`** — opt the ask into the multi-hop pipeline. The query is decomposed into sub-questions, each retrieved independently (LLM-driven decide loop, up to `rag.multi_hop_max_depth` iters), then synthesized over the merged chunk pool. Cost trade-off: 2–5× LLM calls vs. single-pass. **Use** for compound questions ("X 와 Y 의 차이는?", prereq chains, cross-doc reasoning where one chunk alone is insufficient). **Don't** for simple fact-finding (single-pass is faster + cheaper). When set, `answer.v1.hops[]` carries the per-hop trace (`{iter, kind, sub_queries[], context_chunks_added, forced_stop, llm_call_ms}`) — surface a brief "Searched in N hops" note when the trace is non-trivial. Decompose-failure (model emitted non-JSON) → `refusal_reason = "multi_hop_decompose_failed"`; treat like any other refusal.
|
||||
- **v0.18+ multi-hop NLI verification** — multi-hop ask (`mcp__kebab__ask` with `multi_hop: true`) runs a post-synthesize NLI groundedness gate when `[rag] nli_threshold > 0` is set in the user's config. `answer.v1.verification.nli_passed == true` means the generated answer is entailed by the retrieved chunks (grounded); `false` means the answer is refused with `refusal_reason = "nli_verification_failed"` and the `verification` block still ships so the agent can show what entailment score was rejected. Threshold tuning: 0.5 is the production default, 0.9 is strict mode. If the NLI model download / inference fails the pipeline emits `refusal_reason = "nli_model_unavailable"` — user-side workaround is `[rag] nli_threshold = 0` then retry multi-hop. Single-pass `ask` (multi_hop: false / unset) is unaffected — it keeps the LLM self-judge gate as the only verification.
|
||||
|
||||
### `mcp__kebab__fetch` — when you need raw text
|
||||
|
||||
|
||||
Reference in New Issue
Block a user