feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF)

Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever
wrapper around LanceVectorStore (P3-3) into a single Retriever that
dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and
vector candidates via Reciprocal Rank Fusion and populates the full
RetrievalDetail per SearchHit so kb search --explain can attribute
scores back to each side.

Public surface (kb-search crate):
- pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>,
  Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction.
- pub struct HybridRetriever { lexical, vector, fusion, k }.
- pub enum FusionPolicy { Rrf { k_rrf: u32 } }.

VectorRetriever:
- Embeds query.text as EmbeddingKind::Query before delegating to
  VectorStore::search(query_vec, query.k * 2, &query.filters). Over-
  fetches by ×2 for filter losses; LanceVectorStore applies the
  filters internally so they propagate naturally.
- Hydrates each VectorHit into a full SearchHit by joining on
  chunk_id in a single IN-clause batch (no N+1): doc_path,
  section_label, chunker_version, source_spans for citation, plus
  embedding_model from embedder.model_id().
- Snippet trimmed to config.search.snippet_chars (vector mode lacks
  FTS5 highlighting; chunk text prefix is the next-best signal).
- Citation built from the chunk's first source span via the shared
  citation_helper module — extracted from lexical.rs so both
  retrievers compute citations identically (Byte/empty fallback to
  Line{1,1} preserved with tracing::warn).
- RetrievalDetail.method = Vector for standalone calls; both
  fusion_score and vector_score set to the LanceVectorStore-shifted
  cosine score; lexical_* None.

HybridRetriever:
- Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail.
- Hybrid mode runs both retrievers with k * 2 fanout, fuses with
  RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC
  with deterministic tiebreaker (lex_rank ASC then chunk_id ASC),
  takes top query.k. Fusion math runs in f64 throughout; cast to
  f32 only at the SearchHit boundary where bounded magnitude (≤
  ~0.033 at k_rrf=60) makes f32 precision sufficient for ranking.
- Per-hit lexical preferred for snippet/citation/heading_path/
  chunker_version/embedding_model when the chunk appears in both
  retrievers — FTS5 highlighting is more user-relevant than vector's
  truncated text. Vector-only chunks fall through to vector hit data.
- index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at
  construction; mismatched lex/vec versions trigger a tracing::warn
  so users notice stale indexes (spec line 143).

kb-search additions:
- citation_helper.rs — pub(crate) citation_from_first_span shared
  between lexical and vector retrievers. Extracted from lexical.rs;
  no behavior drift.

Tests (38 default + 3 ignored):
- 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within
  f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid
  preserves single-side hits with the missing side's RetrievalDetail
  None, deterministic tiebreaker on identical fused scores, composite
  index_version, mismatched-version warn at construction.
- 2 unit tests in vector.rs covering the snippet-prefix and citation
  fallback paths.
- 11 unit tests in lexical.rs (unchanged from P2-2).
- 13 lexical integration tests (unchanged).
- 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus
  recall (lex returns A,B; vec returns C,D; hybrid returns all 4),
  determinism over two queries, snapshot stability against
  tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was
  regenerated against this branch on an AVX-enabled VM and contains
  4 real chunks (c1/c2 lex+vec, c3/c4 vec-only).
- KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of
  silently passing — matches the P3-2/P3-3 fail-loud-instead-of-
  silent-pass philosophy.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite,
kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing
kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow).
kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder>
trait object; the concrete adapter is runtime-injected by kb-app.

Out of scope: reranker (P+), score calibration across modes (RRF is
rank-comparable so absolute calibration is P+), multimodal retrieval
(P6+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 11:22:21 +00:00
parent 60a0290d85
commit ccd49ef546
10 changed files with 1532 additions and 73 deletions

2
Cargo.lock generated
View File

@@ -3473,7 +3473,9 @@ dependencies = [
"globset",
"kb-config",
"kb-core",
"kb-embed",
"kb-store-sqlite",
"kb-store-vector",
"rusqlite",
"serde_json",
"tempfile",

View File

@@ -11,6 +11,14 @@ description = "Retriever implementations for kb (P2-2 lexical FTS5; P3 vector
kb-core = { path = "../kb-core" }
kb-config = { path = "../kb-config" }
kb-store-sqlite = { path = "../kb-store-sqlite" }
# P3-4 hybrid retriever wraps a `dyn VectorStore` (typically backed by
# `kb-store-vector::LanceVectorStore`) and a `dyn Embedder` (any P3-2
# adapter). Listed as a runtime dep so callers can construct
# `VectorRetriever::new` against the trait objects without a concrete
# adapter — the concrete adapter (`kb-embed-local`) stays out of this
# crate per the spec's Forbidden deps list.
kb-store-vector = { path = "../kb-store-vector" }
kb-embed = { path = "../kb-embed" }
rusqlite = { workspace = true }
globset = { workspace = true }
serde_json = { workspace = true }
@@ -20,3 +28,8 @@ anyhow = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }
# Hybrid integration tests inject a `MockEmbedder` (kb-embed `mock`
# feature) and stand up a real `LanceVectorStore` on a tmp directory.
# The mock-retriever unit tests (the bulk of the hybrid suite) do not
# need either, but the integration / snapshot lane does.
kb-embed = { path = "../kb-embed", features = ["mock"] }

View File

@@ -0,0 +1,74 @@
//! Shared helpers for building `kb_core::Citation` values from a
//! chunk's first `SourceSpan`.
//!
//! Both the lexical and vector retrievers join against the same
//! `chunks.source_spans_json` column and need identical mapping logic
//! so cross-mode citation strings round-trip byte-identically (a
//! requirement for the hybrid retriever's tie-break on chunk_id and
//! for the `search --explain` output documented in design §0 Q3 and
//! §1.6). Living here means a future PDF / image / audio extractor can
//! enrich the mapping in one place rather than two.
use kb_core::{Citation, SourceSpan, WorkspacePath};
/// Build a `Citation` from the chunk's first `SourceSpan`. P1 markdown
/// only emits `Line`, so the other variants are mostly defensive — we
/// forward them as faithfully as possible so a future PDF / image
/// extractor can flow through without churn.
///
/// `chunk_id` is taken only for diagnostic logging when the span shape
/// has no Citation mapping (`Byte`-spans, empty arrays).
pub(crate) fn citation_from_first_span(
chunk_id: &str,
path: WorkspacePath,
section: Option<String>,
first_span: Option<&SourceSpan>,
) -> Citation {
match first_span {
Some(SourceSpan::Line { start, end }) => Citation::Line {
path,
start: *start,
end: *end,
section,
},
Some(SourceSpan::Page { page, .. }) => Citation::Page {
path,
page: *page,
section,
},
Some(SourceSpan::Region { x, y, w, h }) => Citation::Region {
path,
x: *x,
y: *y,
w: *w,
h: *h,
},
Some(SourceSpan::Time { start_ms, end_ms }) => Citation::Time {
path,
start_ms: *start_ms,
end_ms: *end_ms,
speaker: None,
},
// Byte-spans don't have a Citation variant. Fall back to a
// Line citation pointing at the document head — better than
// fabricating a position. Spans-empty falls into the same
// branch.
other @ (Some(SourceSpan::Byte { .. }) | None) => {
let span_shape = match other {
Some(_) => "Byte",
None => "empty array",
};
tracing::warn!(
chunk_id,
span_shape,
"kb-search: SourceSpan has no Citation mapping; falling back to Line {{1, 1}}"
);
Citation::Line {
path,
start: 1,
end: 1,
section,
}
}
}
}

View File

@@ -0,0 +1,603 @@
//! Hybrid retriever — design §3.7 / §6.4 / §0 Q3 / §1.6.
//!
//! Composes a lexical and a vector retriever (both `dyn Retriever`)
//! and dispatches by `SearchMode`. For `Hybrid`, results are fused via
//! Reciprocal Rank Fusion (RRF):
//!
//! ```text
//! score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))
//! ```
//!
//! where `rank_m(c)` is the 1-based rank of chunk `c` in retriever
//! `m`'s output (chunks not appearing in `m` contribute 0).
//!
//! Each `SearchHit.retrieval` is rebuilt with the per-mode scores /
//! ranks the fusion observed, so `kb search --explain` (§1.6) can
//! show users exactly which retriever contributed what to the final
//! ordering.
use std::collections::HashMap;
use std::sync::Arc;
use anyhow::Result;
use kb_core::{
IndexVersion, RetrievalDetail, Retriever, SearchHit, SearchMode, SearchQuery,
};
/// Default `k_rrf` if `kb-config::SearchCfg::rrf_k` is misconfigured.
/// Matches §6.4's documented default (60).
const DEFAULT_K_RRF: u32 = 60;
/// When fanning out for hybrid fusion we ask each side for `k *
/// HYBRID_FANOUT_MULTIPLIER` candidates so the disjoint set of
/// chunks (those a single retriever surfaces but the other does not)
/// is wide enough to feed a useful fused top-k.
///
/// `2` is the spec-suggested floor; raising it helps recall on
/// adversarial corpora at linear cost. Documented in
/// `tasks/p3/p3-4-hybrid-fusion.md` "Risks / notes".
const HYBRID_FANOUT_MULTIPLIER: usize = 2;
/// Default `k` when `SearchQuery::k == 0`. Mirrors §6.4 default_k=10.
const DEFAULT_K: usize = 10;
/// Fusion algorithm. Today only Reciprocal Rank Fusion is supported;
/// listing as an enum so future score-calibration policies (P+) can
/// land without an API break.
#[derive(Clone, Copy, Debug)]
pub enum FusionPolicy {
/// Reciprocal Rank Fusion. `k_rrf` is the standard rank-bias
/// hyperparameter (§6.4); larger values flatten the rank-bias
/// curve, smaller values privilege top-of-list hits.
Rrf { k_rrf: u32 },
}
/// Hybrid retriever composing a lexical and a vector retriever.
///
/// For chunks that appear in both retrievers, the lexical-side hit
/// supplies `snippet`, `citation`, `heading_path`, `chunker_version`,
/// and `embedding_model` — lexical search has FTS5 highlighting that's
/// more user-relevant than the vector retriever's truncated text.
/// Vector-only chunks fall through to the vector hit's data verbatim.
/// This matches `kb search --explain` (§1.6) expectations for snippet
/// provenance.
pub struct HybridRetriever {
lexical: Arc<dyn Retriever>,
vector: Arc<dyn Retriever>,
fusion: FusionPolicy,
/// Default `k` for queries that arrive with `k == 0`. Pulled from
/// `config.search.default_k` at construction.
default_k: usize,
}
impl HybridRetriever {
/// Construct from a `kb-config` Config + the two underlying
/// retrievers. Reads `config.search.hybrid_fusion` (only `"rrf"`
/// is recognised today) and `config.search.rrf_k`.
pub fn new(
config: &kb_config::Config,
lexical: Arc<dyn Retriever>,
vector: Arc<dyn Retriever>,
) -> Self {
let fusion = parse_fusion(&config.search.hybrid_fusion, config.search.rrf_k);
let default_k = if config.search.default_k == 0 {
DEFAULT_K
} else {
config.search.default_k
};
// Surface mismatched index_version up front so users see it
// (e.g. lexical at v2, vector at v1 means a stale index that
// the user should refresh). Spec line 144 calls this out as
// a "flag at construction".
let lex_iv = lexical.index_version();
let vec_iv = vector.index_version();
if lex_iv.0 != vec_iv.0 {
tracing::warn!(
target: "kb-search",
lexical_index = %lex_iv.0,
vector_index = %vec_iv.0,
"kb-search hybrid: lexical and vector index_version differ; consider re-indexing"
);
}
Self {
lexical,
vector,
fusion,
default_k,
}
}
/// Construct with explicit policy / `k`. Used by tests that want
/// to pin RRF parameters without going through `kb-config`.
pub fn with_policy(
lexical: Arc<dyn Retriever>,
vector: Arc<dyn Retriever>,
fusion: FusionPolicy,
default_k: usize,
) -> Self {
Self {
lexical,
vector,
fusion,
default_k: if default_k == 0 { DEFAULT_K } else { default_k },
}
}
}
impl Retriever for HybridRetriever {
fn search(&self, query: &SearchQuery) -> Result<Vec<SearchHit>> {
match query.mode {
SearchMode::Lexical => self.lexical.search(query),
SearchMode::Vector => self.vector.search(query),
SearchMode::Hybrid => self.fuse(query),
}
}
fn index_version(&self) -> IndexVersion {
// Composite token so callers (e.g. snapshot tests) can detect
// either side drifting without inspecting both retrievers.
let lex = self.lexical.index_version().0;
let vec = self.vector.index_version().0;
IndexVersion(format!("hybrid:{lex}+{vec}"))
}
}
impl HybridRetriever {
fn fuse(&self, query: &SearchQuery) -> Result<Vec<SearchHit>> {
let target_k = if query.k == 0 { self.default_k } else { query.k };
// Fanout: ask each retriever for `target_k * MULTIPLIER` so
// the disjoint set of candidates is wide enough. The two
// per-side queries are identical (same text, k, mode, filters);
// only the dispatch differs, so we share one `SearchQuery`.
let fanout_k = target_k.saturating_mul(HYBRID_FANOUT_MULTIPLIER);
let lex_query = SearchQuery {
k: fanout_k,
..query.clone()
};
let lex_hits = self.lexical.search(&lex_query)?;
let vec_hits = self.vector.search(&lex_query)?;
tracing::debug!(
lex = lex_hits.len(),
vec = vec_hits.len(),
target_k,
"kb-search hybrid: pre-fusion candidate counts"
);
// Build (chunk_id → (rank, hit)) maps. The rank stored here
// is the `rank` field on each retriever's output, which is
// already 1-based by both LexicalRetriever and VectorRetriever
// (and any well-behaved Retriever should mirror).
let lex_index: HashMap<String, (u32, SearchHit)> = lex_hits
.into_iter()
.map(|h| (h.chunk_id.0.clone(), (h.rank, h)))
.collect();
let vec_index: HashMap<String, (u32, SearchHit)> = vec_hits
.into_iter()
.map(|h| (h.chunk_id.0.clone(), (h.rank, h)))
.collect();
// Union of chunk_ids from both sides.
let mut all_ids: Vec<String> = Vec::with_capacity(lex_index.len() + vec_index.len());
for k in lex_index.keys() {
all_ids.push(k.clone());
}
for k in vec_index.keys() {
if !lex_index.contains_key(k) {
all_ids.push(k.clone());
}
}
// Compute fused score per chunk.
let FusionPolicy::Rrf { k_rrf } = self.fusion;
let k_rrf_f = f64::from(k_rrf);
struct Scored {
chunk_id: String,
rrf: f64,
lex_rank: Option<u32>,
vec_rank: Option<u32>,
}
let mut scored: Vec<Scored> = all_ids
.into_iter()
.map(|cid| {
let lex_rank = lex_index.get(&cid).map(|(r, _)| *r);
let vec_rank = vec_index.get(&cid).map(|(r, _)| *r);
let mut rrf = 0.0_f64;
if let Some(r) = lex_rank {
rrf += 1.0 / (k_rrf_f + f64::from(r));
}
if let Some(r) = vec_rank {
rrf += 1.0 / (k_rrf_f + f64::from(r));
}
Scored {
chunk_id: cid,
rrf,
lex_rank,
vec_rank,
}
})
.collect();
// Sort: rrf DESC, then lex_rank ASC (None last), then chunk_id ASC.
// f64 ordering uses `total_cmp` so NaN stays deterministic
// (won't occur today — k_rrf > 0 → denominators > 0 — but
// total_cmp keeps the sort stable under future tweaks).
scored.sort_by(|a, b| {
b.rrf
.total_cmp(&a.rrf)
.then_with(|| {
let am = a.lex_rank.unwrap_or(u32::MAX);
let bm = b.lex_rank.unwrap_or(u32::MAX);
am.cmp(&bm)
})
.then_with(|| a.chunk_id.cmp(&b.chunk_id))
});
// Build final SearchHits, taking the top `target_k`.
let mut hits: Vec<SearchHit> = Vec::with_capacity(target_k.min(scored.len()));
let mut rank: u32 = 0;
for s in scored.into_iter().take(target_k) {
// Pull the underlying hit. Prefer the lexical side when
// available — its snippet has FTS5 highlighting which
// gives users the most useful preview. Fall back to
// vector if the chunk only appeared in vector results.
let mut base = match (lex_index.get(&s.chunk_id), vec_index.get(&s.chunk_id)) {
(Some((_, lex)), _) => lex.clone(),
(None, Some((_, vec))) => vec.clone(),
// `all_ids` is the union of `lex_index` and
// `vec_index` keys, so this arm cannot fire.
(None, None) => {
unreachable!("chunk_id was in union but absent from both indices")
}
};
// `unwrap_or(fusion_score)` covers a defensive-coding case
// that doesn't arise today: when a chunk only appears in
// one retriever, RRF sums a single term so `fusion_score`
// already equals that side's normalized score, making the
// fallback harmless.
let lex_score = lex_index
.get(&s.chunk_id)
.map(|(_, h)| h.retrieval.lexical_score.unwrap_or(h.retrieval.fusion_score));
let vec_score = vec_index
.get(&s.chunk_id)
.map(|(_, h)| h.retrieval.vector_score.unwrap_or(h.retrieval.fusion_score));
rank = rank.saturating_add(1);
base.rank = rank;
base.retrieval = RetrievalDetail {
method: SearchMode::Hybrid,
// RRF is computed in f64 inside `fuse` and cast to f32
// here at the boundary. `1/(k_rrf+rank)` is bounded
// roughly in `(0, 2/k_rrf]` (≤ ~0.033 at k_rrf=60), so
// the magnitude is well within f32 range and f32
// precision is more than sufficient for ranking.
fusion_score: s.rrf as f32,
lexical_score: lex_score,
vector_score: vec_score,
lexical_rank: s.lex_rank,
vector_rank: s.vec_rank,
};
hits.push(base);
}
tracing::debug!(rows = hits.len(), "kb-search hybrid: search done");
Ok(hits)
}
}
/// Parse the `hybrid_fusion` config string into a [`FusionPolicy`].
/// Today only `"rrf"` is recognised; anything else falls back to RRF
/// with a warn log so misconfiguration is visible but not fatal.
fn parse_fusion(name: &str, k_rrf: u32) -> FusionPolicy {
let k = if k_rrf == 0 { DEFAULT_K_RRF } else { k_rrf };
match name {
"rrf" => FusionPolicy::Rrf { k_rrf: k },
other => {
tracing::warn!(
target: "kb-search",
policy = other,
"kb-search hybrid: unknown fusion policy; falling back to RRF"
);
FusionPolicy::Rrf { k_rrf: k }
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use kb_core::{
ChunkId, ChunkerVersion, Citation, DocumentId, IndexVersion, SearchFilters,
SearchHit, SearchMode, WorkspacePath,
};
use std::sync::Mutex;
/// Test double: returns a canned `Vec<SearchHit>` and records
/// every call so we can assert delegation.
struct CannedRetriever {
hits: Vec<SearchHit>,
calls: Mutex<Vec<SearchQuery>>,
version: IndexVersion,
}
impl CannedRetriever {
fn new(hits: Vec<SearchHit>, version: &str) -> Self {
Self {
hits,
calls: Mutex::new(Vec::new()),
version: IndexVersion(version.to_string()),
}
}
}
impl Retriever for CannedRetriever {
fn search(&self, query: &SearchQuery) -> Result<Vec<SearchHit>> {
self.calls.lock().unwrap().push(query.clone());
Ok(self.hits.clone())
}
fn index_version(&self) -> IndexVersion {
self.version.clone()
}
}
fn wp(p: &str) -> WorkspacePath {
WorkspacePath::new(p.to_string()).unwrap()
}
/// Build a synthetic `SearchHit`. Most fields take inert defaults
/// because the hybrid logic only reads `chunk_id`, `rank`,
/// `retrieval.{lexical,vector}_score`, and (transitively) the rest
/// when building the fused output.
fn mk_hit(
chunk_id: &str,
rank: u32,
method: SearchMode,
score: f32,
) -> SearchHit {
let cid = ChunkId(chunk_id.to_string());
let did = DocumentId(format!("d-{chunk_id}"));
let path = wp(&format!("notes/{chunk_id}.md"));
SearchHit {
rank,
chunk_id: cid,
doc_id: did,
doc_path: path.clone(),
heading_path: vec![],
section_label: None,
snippet: format!("snippet for {chunk_id}"),
citation: Citation::Line {
path,
start: 1,
end: 1,
section: None,
},
retrieval: RetrievalDetail {
method,
fusion_score: score,
lexical_score: matches!(method, SearchMode::Lexical | SearchMode::Hybrid)
.then_some(score),
vector_score: matches!(method, SearchMode::Vector | SearchMode::Hybrid)
.then_some(score),
lexical_rank: matches!(method, SearchMode::Lexical | SearchMode::Hybrid)
.then_some(rank),
vector_rank: matches!(method, SearchMode::Vector | SearchMode::Hybrid)
.then_some(rank),
},
index_version: IndexVersion("v1".to_string()),
embedding_model: None,
chunker_version: ChunkerVersion("v1".to_string()),
}
}
fn rrf_policy(k_rrf: u32) -> FusionPolicy {
FusionPolicy::Rrf { k_rrf }
}
fn make_query(mode: SearchMode, k: usize) -> SearchQuery {
SearchQuery {
text: "rust".to_string(),
mode,
k,
filters: SearchFilters::default(),
}
}
#[test]
fn hybrid_lexical_mode_delegates_to_lexical() {
let lex_hits = vec![mk_hit("aaaa", 1, SearchMode::Lexical, 0.9)];
let lex = Arc::new(CannedRetriever::new(lex_hits.clone(), "lex-v1"));
let vec = Arc::new(CannedRetriever::new(vec![], "vec-v1"));
let h = HybridRetriever::with_policy(lex.clone(), vec.clone(), rrf_policy(60), 5);
let out = h.search(&make_query(SearchMode::Lexical, 5)).unwrap();
assert_eq!(out, lex_hits, "lexical mode must pass through verbatim");
assert_eq!(lex.calls.lock().unwrap().len(), 1, "lexical called once");
assert_eq!(vec.calls.lock().unwrap().len(), 0, "vector NOT called");
}
#[test]
fn hybrid_vector_mode_delegates_to_vector() {
let vec_hits = vec![mk_hit("bbbb", 1, SearchMode::Vector, 0.8)];
let lex = Arc::new(CannedRetriever::new(vec![], "lex-v1"));
let vec = Arc::new(CannedRetriever::new(vec_hits.clone(), "vec-v1"));
let h = HybridRetriever::with_policy(lex.clone(), vec.clone(), rrf_policy(60), 5);
let out = h.search(&make_query(SearchMode::Vector, 5)).unwrap();
assert_eq!(out, vec_hits, "vector mode must pass through verbatim");
assert_eq!(lex.calls.lock().unwrap().len(), 0, "lexical NOT called");
assert_eq!(vec.calls.lock().unwrap().len(), 1, "vector called once");
}
#[test]
fn hybrid_chunk_only_in_lexical_keeps_vector_none() {
// Chunk X is in lexical only.
let lex = Arc::new(CannedRetriever::new(
vec![mk_hit("xxxx", 1, SearchMode::Lexical, 0.9)],
"lex-v1",
));
let vec = Arc::new(CannedRetriever::new(
vec![mk_hit("yyyy", 1, SearchMode::Vector, 0.8)],
"vec-v1",
));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 5);
let out = h.search(&make_query(SearchMode::Hybrid, 5)).unwrap();
// Both X and Y are present.
let xx = out.iter().find(|h| h.chunk_id.0 == "xxxx").unwrap();
assert_eq!(xx.retrieval.method, SearchMode::Hybrid);
assert!(xx.retrieval.lexical_score.is_some());
assert_eq!(xx.retrieval.vector_score, None);
assert_eq!(xx.retrieval.lexical_rank, Some(1));
assert_eq!(xx.retrieval.vector_rank, None);
assert!(xx.retrieval.fusion_score > 0.0);
let yy = out.iter().find(|h| h.chunk_id.0 == "yyyy").unwrap();
assert_eq!(yy.retrieval.lexical_score, None);
assert!(yy.retrieval.vector_score.is_some());
assert_eq!(yy.retrieval.lexical_rank, None);
assert_eq!(yy.retrieval.vector_rank, Some(1));
}
#[test]
fn rrf_formula_matches_known_value() {
// chunk A appears at lexical rank 1, vector rank 2; k_rrf=60.
// Expected: 1/(60+1) + 1/(60+2) = 1/61 + 1/62.
let expected = 1.0_f64 / 61.0 + 1.0_f64 / 62.0;
let lex = Arc::new(CannedRetriever::new(
vec![mk_hit("aaaa", 1, SearchMode::Lexical, 0.5)],
"lex-v1",
));
let vec_hits = vec![
mk_hit("zzzz", 1, SearchMode::Vector, 0.9),
mk_hit("aaaa", 2, SearchMode::Vector, 0.7),
];
let vec = Arc::new(CannedRetriever::new(vec_hits, "vec-v1"));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 5);
let out = h.search(&make_query(SearchMode::Hybrid, 5)).unwrap();
let a = out.iter().find(|h| h.chunk_id.0 == "aaaa").unwrap();
let actual = a.retrieval.fusion_score as f64;
// Tolerance: the score is computed in f64 and cast to f32 at
// the API boundary, so any discrepancy must fit within f32
// precision. `1e-7` is below `f32::EPSILON` (~1.19e-7), which
// makes the check brittle on edge cases. Use a small multiple
// of EPSILON to stay robust.
let tol = f64::from(f32::EPSILON) * 10.0;
assert!(
(actual - expected).abs() < tol,
"RRF score {actual} drifted from expected {expected} (tol {tol})"
);
}
#[test]
fn hybrid_tiebreak_prefers_lower_lexical_rank_then_chunk_id() {
// Construct two chunks with identical fused scores.
// Strategy: A appears at lex rank 2 only → score = 1/62.
// B appears at vec rank 2 only → score = 1/62.
// Tie-break: lex_rank ascending (Some(2) < None), so A wins.
let lex = Arc::new(CannedRetriever::new(
vec![
mk_hit("zzzz", 1, SearchMode::Lexical, 0.9), // rank 1: high RRF, leader
mk_hit("aaaa", 2, SearchMode::Lexical, 0.5), // rank 2
],
"lex-v1",
));
let vec = Arc::new(CannedRetriever::new(
vec![
mk_hit("zzzz", 1, SearchMode::Vector, 0.9),
mk_hit("bbbb", 2, SearchMode::Vector, 0.5),
],
"vec-v1",
));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 5);
let out = h.search(&make_query(SearchMode::Hybrid, 5)).unwrap();
// zzzz has both ranks → strictly higher RRF → rank 1.
assert_eq!(out[0].chunk_id.0, "zzzz");
// aaaa and bbbb both have a single rank-2 contribution → identical
// RRF. Tie-break: aaaa has lex_rank=Some(2), bbbb has lex_rank=None,
// so aaaa comes first.
assert_eq!(out[1].chunk_id.0, "aaaa");
assert_eq!(out[2].chunk_id.0, "bbbb");
// Now construct two chunks with identical lex rank to verify
// the chunk_id tie-break. CannedRetriever can't produce two
// hits at the same rank via mk_hit's normal flow, so we patch
// `retrieval.lexical_rank` directly after construction.
let mut tied_a = mk_hit("aaaa", 2, SearchMode::Lexical, 0.4);
tied_a.retrieval.lexical_rank = Some(2);
let mut tied_b = mk_hit("bbbb", 2, SearchMode::Lexical, 0.4);
tied_b.retrieval.lexical_rank = Some(2);
let lex3 = Arc::new(CannedRetriever::new(
vec![tied_a, tied_b],
"lex-v1",
));
let vec3 = Arc::new(CannedRetriever::new(vec![], "vec-v1"));
let h3 = HybridRetriever::with_policy(lex3, vec3, rrf_policy(60), 5);
let out3 = h3.search(&make_query(SearchMode::Hybrid, 5)).unwrap();
// Same lex_rank=2 → tie-break on chunk_id ascending: aaaa < bbbb.
assert_eq!(out3[0].chunk_id.0, "aaaa");
assert_eq!(out3[1].chunk_id.0, "bbbb");
}
#[test]
fn hybrid_index_version_is_composite() {
let lex = Arc::new(CannedRetriever::new(vec![], "lex-v1"));
let vec = Arc::new(CannedRetriever::new(vec![], "vec-v2"));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 5);
assert_eq!(h.index_version().0, "hybrid:lex-v1+vec-v2");
}
#[test]
fn hybrid_disjoint_recall_returns_all_when_k_large_enough() {
// lex returns [A, B], vec returns [C, D]; k=4 → all 4 in result.
let lex = Arc::new(CannedRetriever::new(
vec![
mk_hit("aaaa", 1, SearchMode::Lexical, 0.9),
mk_hit("bbbb", 2, SearchMode::Lexical, 0.7),
],
"lex-v1",
));
let vec = Arc::new(CannedRetriever::new(
vec![
mk_hit("cccc", 1, SearchMode::Vector, 0.9),
mk_hit("dddd", 2, SearchMode::Vector, 0.7),
],
"vec-v1",
));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 4);
let out = h.search(&make_query(SearchMode::Hybrid, 4)).unwrap();
let mut ids: Vec<&str> = out.iter().map(|h| h.chunk_id.0.as_str()).collect();
ids.sort();
assert_eq!(ids, vec!["aaaa", "bbbb", "cccc", "dddd"]);
}
#[test]
fn hybrid_zero_k_uses_default() {
// With query.k=0, hybrid should use the configured default_k.
let lex = Arc::new(CannedRetriever::new(
(0..20)
.map(|i| mk_hit(&format!("c{i:04}"), i + 1, SearchMode::Lexical, 0.5))
.collect(),
"lex-v1",
));
let vec = Arc::new(CannedRetriever::new(vec![], "vec-v1"));
let h = HybridRetriever::with_policy(lex, vec, rrf_policy(60), 7);
let out = h.search(&make_query(SearchMode::Hybrid, 0)).unwrap();
assert_eq!(out.len(), 7);
}
#[test]
fn parse_fusion_falls_back_to_rrf_on_unknown() {
let p = parse_fusion("nonsense", 60);
let FusionPolicy::Rrf { k_rrf } = p;
assert_eq!(k_rrf, 60);
}
#[test]
fn parse_fusion_zero_k_falls_back_to_default() {
let FusionPolicy::Rrf { k_rrf } = parse_fusion("rrf", 0);
assert_eq!(k_rrf, DEFAULT_K_RRF);
}
}

View File

@@ -10,13 +10,15 @@ use std::sync::Arc;
use anyhow::{Context, Result};
use globset::GlobMatcher;
use kb_core::{
ChunkId, ChunkerVersion, Citation, DocumentId, IndexVersion, RetrievalDetail,
Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery, SourceSpan,
TrustLevel, WorkspacePath,
ChunkId, ChunkerVersion, DocumentId, IndexVersion, RetrievalDetail, Retriever,
SearchFilters, SearchHit, SearchMode, SearchQuery, SourceSpan, TrustLevel,
WorkspacePath,
};
use kb_store_sqlite::SqliteStore;
use rusqlite::{params_from_iter, Connection, Row, ToSql};
use crate::citation_helper::citation_from_first_span;
// ── Tunables ─────────────────────────────────────────────────────────────
/// FTS5 hard limit on the `snippet()` `nToken` argument.
@@ -367,7 +369,7 @@ fn build_hit(
let workspace_path = WorkspacePath::new(raw.workspace_path)
.context("kb-search lexical: documents.workspace_path violates WorkspacePath invariant")?;
let citation = build_citation(
let citation = citation_from_first_span(
&raw.chunk_id,
workspace_path.clone(),
raw.section_label.clone(),
@@ -413,64 +415,6 @@ fn normalize_bm25(bm25_raw: f64) -> f32 {
normalized as f32
}
/// Build a `Citation` from the chunk's first `SourceSpan`. P1 markdown
/// only emits `Line`, so the other variants are mostly defensive — we
/// forward them as faithfully as possible so a future PDF / image
/// extractor can flow through without churn.
fn build_citation(
chunk_id: &str,
path: WorkspacePath,
section: Option<String>,
first_span: Option<&SourceSpan>,
) -> Citation {
match first_span {
Some(SourceSpan::Line { start, end }) => Citation::Line {
path,
start: *start,
end: *end,
section,
},
Some(SourceSpan::Page { page, .. }) => Citation::Page {
path,
page: *page,
section,
},
Some(SourceSpan::Region { x, y, w, h }) => Citation::Region {
path,
x: *x,
y: *y,
w: *w,
h: *h,
},
Some(SourceSpan::Time { start_ms, end_ms }) => Citation::Time {
path,
start_ms: *start_ms,
end_ms: *end_ms,
speaker: None,
},
// Byte-spans don't have a Citation variant. Fall back to a Line
// citation pointing at the document head — better than fabricating
// a position. Spans-empty falls into the same branch.
other @ (Some(SourceSpan::Byte { .. }) | None) => {
let span_shape = match other {
Some(_) => "Byte",
None => "empty array",
};
tracing::warn!(
chunk_id,
span_shape,
"kb-search lexical: SourceSpan has no Citation mapping; falling back to Line {{1, 1}}"
);
Citation::Line {
path,
start: 1,
end: 1,
section,
}
}
}
}
/// Cap the snippet at `max_chars` characters (Unicode scalar values, not
/// bytes — matches the §6.4 setting's "characters" semantics). Returns
/// the input unchanged when already short enough.
@@ -579,9 +523,10 @@ mod tests {
#[test]
fn build_citation_line_round_trip() {
use kb_core::Citation;
let p = WorkspacePath::new("a/b.md".to_string()).unwrap();
let span = SourceSpan::Line { start: 7, end: 12 };
let c = build_citation("c1", p.clone(), Some("S1".to_string()), Some(&span));
let c = citation_from_first_span("c1", p.clone(), Some("S1".to_string()), Some(&span));
match c {
Citation::Line {
start,
@@ -600,13 +545,14 @@ mod tests {
#[test]
fn build_citation_page_forwards_section() {
use kb_core::Citation;
let p = WorkspacePath::new("doc.pdf".to_string()).unwrap();
let span = SourceSpan::Page {
page: 4,
char_start: None,
char_end: None,
};
let c = build_citation("c1", p, Some("Intro".to_string()), Some(&span));
let c = citation_from_first_span("c1", p, Some("Intro".to_string()), Some(&span));
match c {
Citation::Page {
page,
@@ -622,8 +568,9 @@ mod tests {
#[test]
fn build_citation_none_falls_back_to_line_one() {
use kb_core::Citation;
let p = WorkspacePath::new("x.md".to_string()).unwrap();
let c = build_citation("c1", p, None, None);
let c = citation_from_first_span("c1", p, None, None);
match c {
Citation::Line { start, end, .. } => {
assert_eq!((start, end), (1, 1));

View File

@@ -1,15 +1,26 @@
//! `kb-search` — `kb_core::Retriever` implementations.
//!
//! P2-2 ships [`LexicalRetriever`], a SQLite-FTS5-backed retriever for
//! `SearchMode::Lexical`. Vector + Hybrid retrievers land in P3-3 / P3-4.
//! - [`LexicalRetriever`] (P2-2): SQLite-FTS5 + bm25 backed retriever
//! for `SearchMode::Lexical`.
//! - [`VectorRetriever`] (P3-4): wraps a `dyn VectorStore` (typically
//! `kb-store-vector::LanceVectorStore`) and a `dyn Embedder`,
//! hydrating SQLite metadata for full `SearchHit`s.
//! - [`HybridRetriever`] (P3-4): composes lexical + vector retrievers,
//! dispatches by `SearchMode`, fuses Hybrid via [`FusionPolicy::Rrf`].
//!
//! Allowed deps per task spec: `kb-core`, `kb-config`, `kb-store-sqlite`,
//! `rusqlite`, `globset`, `tracing`, `thiserror`, `anyhow`. Forbidden:
//! `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`,
//! `kb-store-vector`, `kb-embed*`, `kb-llm*`, `kb-rag`, `kb-tui`,
//! `kb-desktop`. Only `serde_json` is a transitive helper used to decode
//! JSON-typed columns from `chunks` / `documents`.
//! Allowed deps per the P2-2 + P3-4 task specs: `kb-core`, `kb-config`,
//! `kb-store-sqlite`, `kb-store-vector`, `kb-embed` (trait re-export
//! only — concrete adapters like `kb-embed-local` are runtime-injected
//! via `Arc<dyn Embedder>`), `rusqlite`, `globset`, `serde_json`,
//! `tracing`, `thiserror`, `anyhow`. Forbidden: `kb-source-fs`,
//! `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-embed-local` (concrete
//! adapter), `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`.
mod citation_helper;
mod hybrid;
mod lexical;
mod vector;
pub use hybrid::{FusionPolicy, HybridRetriever};
pub use lexical::LexicalRetriever;
pub use vector::VectorRetriever;

View File

@@ -0,0 +1,338 @@
//! Vector retriever — design §3.7 / §7.2 / §1.6.
//!
//! Wraps a `dyn VectorStore` + `dyn Embedder` + the SQLite metadata
//! store into a `kb_core::Retriever`. The vector store knows how to
//! find the nearest chunks by cosine on the embedding column; SQLite
//! owns the human-readable metadata (heading_path / section_label /
//! source_spans / chunker_version / workspace_path) needed for
//! `SearchHit` and `Citation`. The retriever stitches them together
//! per spec §7.2.
//!
//! Snippet policy: this retriever has no FTS5 highlighter to lean on,
//! so the `snippet` field is the chunk text trimmed to
//! `config.search.snippet_chars` Unicode scalar values. The lexical
//! retriever does query-token highlighting; downstream UI code should
//! continue to surface lexical snippets for hybrid hits where the
//! lexical side contributed (handled in `HybridRetriever::search`).
use std::collections::HashMap;
use std::sync::Arc;
use anyhow::{Context, Result};
use kb_core::{
ChunkId, ChunkerVersion, DocumentId, Embedder, EmbeddingInput, EmbeddingKind,
IndexVersion, RetrievalDetail, Retriever, SearchHit, SearchMode, SearchQuery,
SourceSpan, VectorHit, VectorStore, WorkspacePath,
};
use kb_store_sqlite::SqliteStore;
use rusqlite::params_from_iter;
use crate::citation_helper::citation_from_first_span;
/// Default `k` when `SearchQuery::k == 0`. Mirrors §6.4 default_k=10
/// and the lexical retriever's `DEFAULT_K`.
const DEFAULT_K: usize = 10;
/// Over-fetch multiplier passed to `VectorStore::search` so that
/// SQLite-side filter losses (tags / lang / trust / path_glob) still
/// leave at least `k` candidates. The Lance store already applies the
/// same filters internally; the extra `* 2` is the spec-mandated
/// safety margin for the `Retriever` layer (§7.2 spec line 138).
const VECTOR_OVERFETCH_MULTIPLIER: usize = 2;
/// Wraps a vector store + embedder into a [`Retriever`].
///
/// `VectorStore` is not declared `Send + Sync` in `kb-core::traits`,
/// but `Retriever` requires both. We constrain the trait objects
/// here so callers must hand us implementations that already are
/// (`LanceVectorStore` is `Send + Sync` thanks to its
/// `Connection`/`Runtime` ownership; the trait is sync-method-only).
pub struct VectorRetriever {
store: Arc<dyn VectorStore + Send + Sync>,
embed: Arc<dyn Embedder>,
sqlite: Arc<SqliteStore>,
index_version: IndexVersion,
snippet_chars: usize,
}
impl VectorRetriever {
/// Construct with `index_version` derived from the configured
/// embedding model + dimensions, and snippet width pulled from
/// `kb-config`'s defaults.
///
/// The explicit `index_version` form is [`Self::with_settings`].
pub fn new(
store: Arc<dyn VectorStore + Send + Sync>,
embed: Arc<dyn Embedder>,
sqlite: Arc<SqliteStore>,
index_version: IndexVersion,
) -> Self {
let cfg = kb_config::Config::defaults();
Self::with_settings(store, embed, sqlite, index_version, cfg.search.snippet_chars)
}
/// Construct with explicit `snippet_chars`. Mirrors the lexical
/// retriever's `with_settings` constructor for callers that have
/// already loaded a `Config`.
pub fn with_settings(
store: Arc<dyn VectorStore + Send + Sync>,
embed: Arc<dyn Embedder>,
sqlite: Arc<SqliteStore>,
index_version: IndexVersion,
snippet_chars: usize,
) -> Self {
Self {
store,
embed,
sqlite,
index_version,
snippet_chars,
}
}
}
impl Retriever for VectorRetriever {
fn search(&self, query: &SearchQuery) -> Result<Vec<SearchHit>> {
let k = if query.k == 0 { DEFAULT_K } else { query.k };
tracing::debug!(
text_len = query.text.len(),
k,
"kb-search vector: search start"
);
// Empty / whitespace-only queries — short-circuit. The
// embedder would still produce a vector for an empty string,
// but nearest-neighbours on the centroid of "" is meaningless
// and only forces a wasted Lance scan.
if query.text.trim().is_empty() {
return Ok(Vec::new());
}
// 1. Embed the query as `Query` kind (e5-style asymmetry —
// documents and queries have different prefixes).
let inputs = [EmbeddingInput {
text: &query.text,
kind: EmbeddingKind::Query,
}];
let mut embeddings = self
.embed
.embed(&inputs)
.context("kb-search vector: embed query")?;
if embeddings.len() != 1 {
anyhow::bail!(
"kb-search vector: embedder returned {} vectors for one input",
embeddings.len()
);
}
let query_vec = embeddings.remove(0);
// 2. Over-fetch from the vector store. The Lance store
// applies `filter_chunks` internally, so we pass `query.filters`
// through and trust the post-filter pass to honour them.
// `saturating_mul(2)` is always ≥ k for any usize k, so we
// don't need an extra `.max(k)` clamp.
let overfetch = k.saturating_mul(VECTOR_OVERFETCH_MULTIPLIER);
let raw_hits = self
.store
.search(&query_vec, overfetch, &query.filters)
.context("kb-search vector: VectorStore::search")?;
if raw_hits.is_empty() {
tracing::debug!("kb-search vector: store returned no hits");
return Ok(Vec::new());
}
// 3. Hydrate metadata from SQLite for the candidate ids in
// one round-trip. Order is preserved by the caller via the
// HashMap lookup at hit-construction time.
let candidate_ids: Vec<&str> =
raw_hits.iter().map(|h| h.chunk_id.0.as_str()).collect();
let hydration = hydrate_chunks(&self.sqlite, &candidate_ids)
.context("kb-search vector: hydrate chunk metadata")?;
// 4. Build `SearchHit` for the first `k` raw hits that pass
// hydration (a missing row would be a filter-induced drop —
// Lance returned the chunk but SQLite filtered it out, or
// the chunk was deleted between Lance's read and ours).
let model_id = self.embed.model_id();
let mut hits: Vec<SearchHit> = Vec::with_capacity(k.min(raw_hits.len()));
let mut rank: u32 = 0;
for hit in raw_hits {
let Some(meta) = hydration.get(hit.chunk_id.0.as_str()) else {
continue;
};
rank = rank.saturating_add(1);
hits.push(build_hit(
hit,
meta,
rank,
&self.index_version,
&model_id,
self.snippet_chars,
)?);
if hits.len() >= k {
break;
}
}
tracing::debug!(rows = hits.len(), "kb-search vector: search done");
Ok(hits)
}
fn index_version(&self) -> IndexVersion {
self.index_version.clone()
}
}
// ── Hydration ────────────────────────────────────────────────────────────
/// Subset of `chunks` + `documents` metadata needed to build a
/// `SearchHit` from a `VectorHit`. Pulled in one round-trip so the
/// per-hit construction loop stays O(1) per row.
struct ChunkMeta {
text: String,
heading_path_json: String,
section_label: Option<String>,
source_spans_json: String,
chunker_version: String,
doc_id: String,
workspace_path: String,
}
fn hydrate_chunks(
sqlite: &SqliteStore,
chunk_ids: &[&str],
) -> Result<HashMap<String, ChunkMeta>> {
if chunk_ids.is_empty() {
return Ok(HashMap::new());
}
// Deduplicate the IN-list — Lance can repeat a chunk_id across
// batches in pathological cases. A HashMap key dedupes in the
// result anyway, but keeping the placeholder count tight is good
// hygiene.
let mut seen = std::collections::HashSet::new();
let unique: Vec<&str> = chunk_ids
.iter()
.copied()
.filter(|id| seen.insert(*id))
.collect();
let placeholders = vec!["?"; unique.len()].join(",");
let sql = format!(
"SELECT \
c.chunk_id, c.text, c.heading_path_json, c.section_label, \
c.source_spans_json, c.chunker_version, \
c.doc_id, d.workspace_path \
FROM chunks c \
JOIN documents d ON d.doc_id = c.doc_id \
WHERE c.chunk_id IN ({placeholders})"
);
let conn = sqlite.read_conn();
let mut stmt = conn
.prepare(&sql)
.context("kb-search vector: prepare hydration statement")?;
let rows = stmt
.query_map(
// `unique` is a `Vec<&str>`; `&str` implements `ToSql`
// directly, so we hand the iterator straight to
// `params_from_iter` without copying.
params_from_iter(unique.iter().copied()),
|row| {
let chunk_id: String = row.get(0)?;
Ok((
chunk_id,
ChunkMeta {
text: row.get(1)?,
heading_path_json: row.get(2)?,
section_label: row.get(3)?,
source_spans_json: row.get(4)?,
chunker_version: row.get(5)?,
doc_id: row.get(6)?,
workspace_path: row.get(7)?,
},
))
},
)
.context("kb-search vector: execute hydration query")?;
let mut out: HashMap<String, ChunkMeta> = HashMap::with_capacity(unique.len());
for row in rows {
let (chunk_id, meta) =
row.context("kb-search vector: read hydration row")?;
out.insert(chunk_id, meta);
}
Ok(out)
}
fn build_hit(
hit: VectorHit,
meta: &ChunkMeta,
rank: u32,
index_version: &IndexVersion,
model_id: &kb_core::EmbeddingModelId,
snippet_chars: usize,
) -> Result<SearchHit> {
let heading_path: Vec<String> = serde_json::from_str(&meta.heading_path_json)
.context("kb-search vector: deserialize heading_path_json")?;
let source_spans: Vec<SourceSpan> = serde_json::from_str(&meta.source_spans_json)
.context("kb-search vector: deserialize source_spans_json")?;
let workspace_path = WorkspacePath::new(meta.workspace_path.clone()).context(
"kb-search vector: documents.workspace_path violates WorkspacePath invariant",
)?;
let citation = citation_from_first_span(
&hit.chunk_id.0,
workspace_path.clone(),
meta.section_label.clone(),
source_spans.first(),
);
let snippet = trim_snippet(&meta.text, snippet_chars);
let score = hit.score;
Ok(SearchHit {
rank,
chunk_id: ChunkId(hit.chunk_id.0),
doc_id: DocumentId(meta.doc_id.clone()),
doc_path: workspace_path,
heading_path,
section_label: meta.section_label.clone(),
snippet,
citation,
retrieval: RetrievalDetail {
method: SearchMode::Vector,
fusion_score: score,
lexical_score: None,
vector_score: Some(score),
lexical_rank: None,
vector_rank: Some(rank),
},
index_version: index_version.clone(),
embedding_model: Some(model_id.clone()),
chunker_version: ChunkerVersion(meta.chunker_version.clone()),
})
}
/// Cap the snippet at `max_chars` Unicode scalar values. Mirrors
/// `lexical::trim_snippet` so the two retrievers produce identically
/// shaped snippets for hybrid output.
fn trim_snippet(s: &str, max_chars: usize) -> String {
if s.chars().count() <= max_chars {
return s.to_string();
}
s.chars().take(max_chars).collect()
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn trim_snippet_caps_at_char_count() {
let s = "a".repeat(300);
assert_eq!(trim_snippet(&s, 220).chars().count(), 220);
}
#[test]
fn trim_snippet_passthrough_when_short() {
assert_eq!(trim_snippet("short", 220), "short");
}
}

View File

@@ -0,0 +1,216 @@
//! Shared scaffolding for kb-search hybrid integration tests.
//!
//! # Test policy
//!
//! Integration tests in `hybrid.rs` that touch `LanceVectorStore`
//! are marked `#[ignore]` AND call [`require_avx_or_panic`] inside
//! the test body so a `--ignored` invocation on a non-AVX host
//! fails loudly with a clear message rather than crashing later
//! inside Lance's f32 SIMD kernel with `SIGILL`.
//!
//! See `crates/kb-store-vector/tests/common/mod.rs` for the
//! original P3-3 rationale; this is a copy because that crate's
//! test commons are test-only and not part of its public surface.
#![allow(dead_code)]
use std::sync::Arc;
use kb_config::Config;
use kb_core::{
ChunkId, DocumentId, EmbeddingId, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion, IndexVersion, VectorRecord, VectorStore,
};
use kb_embed::{Embedder, MockEmbedder};
use kb_search::{LexicalRetriever, VectorRetriever};
use kb_store_sqlite::SqliteStore;
use kb_store_vector::LanceVectorStore;
use rusqlite::params;
use tempfile::TempDir;
/// Panic if the host CPU lacks AVX. Called from every `#[ignore]`-d
/// integration test body so that `cargo test -- --ignored` on a
/// non-AVX host fails loudly with a clear message instead of crashing
/// later inside a Lance SIMD kernel with `SIGILL`.
pub fn require_avx_or_panic() {
#[cfg(target_arch = "x86_64")]
{
if !std::is_x86_feature_detected!("avx") {
panic!(
"kb-search hybrid integration test requires AVX-capable hardware; \
host CPU lacks AVX. Run on an AVX-capable machine."
);
}
}
}
/// Index version label used by hybrid integration tests so the
/// `index_version()` composite token is predictable in snapshots.
pub const TEST_LEX_INDEX_VERSION: &str = "v1.0-lex";
pub const TEST_VEC_INDEX_VERSION: &str = "v1.0-vec";
/// Embedding dimensions for tests. Kept small so MockEmbedder runs
/// fast and the Lance table stays compact on disk; production uses
/// 384 (multilingual-e5-small) but the retriever code is dim-agnostic.
pub const TEST_DIMENSIONS: usize = 16;
pub const TEST_MODEL_ID: &str = "mock-e5";
pub struct HybridEnv {
pub temp: TempDir,
pub config: Config,
pub sqlite: Arc<SqliteStore>,
pub vector_store: Arc<LanceVectorStore>,
pub embedder: Arc<MockEmbedder>,
}
impl HybridEnv {
pub fn new() -> Self {
let temp = tempfile::tempdir().expect("tempdir");
let mut config = Config::defaults();
config.storage.data_dir = temp.path().to_string_lossy().into_owned();
let sqlite = SqliteStore::open(&config).unwrap();
sqlite.run_migrations().unwrap();
let sqlite = Arc::new(sqlite);
let vector_store =
Arc::new(LanceVectorStore::new(&config, sqlite.clone()).unwrap());
let embedder = Arc::new(MockEmbedder::new(
EmbeddingModelId(TEST_MODEL_ID.to_string()),
EmbeddingVersion("v1".to_string()),
TEST_DIMENSIONS,
));
Self {
temp,
config,
sqlite,
vector_store,
embedder,
}
}
/// Build a `LexicalRetriever` over the shared SQLite store.
pub fn lexical_retriever(&self) -> LexicalRetriever {
LexicalRetriever::new(
Arc::clone(&self.sqlite),
IndexVersion(TEST_LEX_INDEX_VERSION.to_string()),
)
}
/// Build a `VectorRetriever` over the shared LanceVectorStore +
/// MockEmbedder + SQLite store.
pub fn vector_retriever(&self) -> VectorRetriever {
let store: Arc<dyn VectorStore + Send + Sync> =
Arc::clone(&self.vector_store) as Arc<dyn VectorStore + Send + Sync>;
let embed: Arc<dyn Embedder> =
Arc::clone(&self.embedder) as Arc<dyn Embedder>;
VectorRetriever::new(
store,
embed,
Arc::clone(&self.sqlite),
IndexVersion(TEST_VEC_INDEX_VERSION.to_string()),
)
}
/// Insert (asset, document, document_tags, chunk) rows directly.
/// We seed without going through `DocumentStore::put_document`
/// to keep this crate's test deps inside the Allowed list (no
/// `kb-parse-md` / `kb-normalize` / `kb-chunk`). The `chunks` row
/// also fires the V002 FTS5 triggers, so the lexical retriever
/// can find the row by `MATCH` without a manual rebuild.
pub fn seed_chunk(
&self,
chunk_id: &str,
doc_id: &str,
workspace_path: &str,
text: &str,
heading_path: &[&str],
tags: &[&str],
) {
let asset_id = format!("a{}", &doc_id[..31]);
let conn = self.sqlite.read_conn();
conn.execute(
"INSERT OR IGNORE INTO assets (
asset_id, source_uri, workspace_path, media_type, byte_len,
checksum, storage_kind, storage_path, discovered_at
) VALUES (?, ?, ?, '\"markdown\"', 0,
'deadbeefdeadbeefdeadbeefdeadbeef',
'reference', ?, '1970-01-01T00:00:00Z')",
params![
asset_id,
format!("file://{workspace_path}"),
workspace_path,
workspace_path,
],
)
.unwrap();
conn.execute(
"INSERT OR IGNORE INTO documents (
doc_id, asset_id, workspace_path, title, lang, source_type,
trust_level, parser_version, doc_version, schema_version,
metadata_json, provenance_json, created_at, updated_at
) VALUES (?, ?, ?, NULL, 'en', 'markdown', 'primary', 'v1', 1, 1,
'{}', '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
params![doc_id, asset_id, workspace_path],
)
.unwrap();
for t in tags {
conn.execute(
"INSERT OR IGNORE INTO document_tags (doc_id, tag) VALUES (?, ?)",
params![doc_id, t],
)
.unwrap();
}
let heading_json = serde_json::to_string(heading_path).unwrap();
conn.execute(
"INSERT OR IGNORE INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, ?, ?, ?, NULL,
'[{\"kind\":\"line\",\"start\":1,\"end\":3}]',
1, 'v1', 'h', '[]', '1970-01-01T00:00:00Z')",
params![chunk_id, doc_id, text, heading_json],
)
.unwrap();
}
/// Embed `text` as a Document and upsert it as the embedding for
/// `chunk_id`. Drives the same code path production uses:
/// MockEmbedder → VectorRecord → LanceVectorStore::upsert →
/// embedding_records committed.
pub fn embed_and_upsert(
&self,
chunk_id: &str,
doc_id: &str,
text: &str,
heading_path: &[&str],
) {
let inputs = [EmbeddingInput {
text,
kind: EmbeddingKind::Document,
}];
let mut vecs = self.embedder.embed(&inputs).unwrap();
let vector = vecs.remove(0);
let record = VectorRecord {
chunk_id: ChunkId(chunk_id.to_string()),
embedding_id: EmbeddingId(format!("e{}", &chunk_id[..31])),
vector,
doc_id: DocumentId(doc_id.to_string()),
text: text.to_string(),
heading_path: heading_path.iter().map(|s| s.to_string()).collect(),
model_id: EmbeddingModelId(TEST_MODEL_ID.to_string()),
model_version: EmbeddingVersion("v1".to_string()),
dimensions: TEST_DIMENSIONS,
};
self.vector_store.upsert(&[record]).unwrap();
}
}
/// Pad a short prefix to the 32-hex shape `kb_core` newtypes expect.
pub fn id32(prefix: &str) -> String {
let mut s = prefix.to_string();
while s.len() < 32 {
s.push('0');
}
s.truncate(32);
s
}

View File

@@ -0,0 +1,42 @@
[
{
"chunk_id": "c1000000000000000000000000000000",
"fusion_score_positive": true,
"lex_some": true,
"lexical_rank": 1,
"method": "hybrid",
"rank": 1,
"vec_some": true,
"vector_rank": 3
},
{
"chunk_id": "c2000000000000000000000000000000",
"fusion_score_positive": true,
"lex_some": true,
"lexical_rank": 2,
"method": "hybrid",
"rank": 2,
"vec_some": true,
"vector_rank": 2
},
{
"chunk_id": "c4000000000000000000000000000000",
"fusion_score_positive": true,
"lex_some": false,
"lexical_rank": null,
"method": "hybrid",
"rank": 3,
"vec_some": true,
"vector_rank": 1
},
{
"chunk_id": "c3000000000000000000000000000000",
"fusion_score_positive": true,
"lex_some": false,
"lexical_rank": null,
"method": "hybrid",
"rank": 4,
"vec_some": true,
"vector_rank": 4
}
]

View File

@@ -0,0 +1,213 @@
//! Hybrid integration tests — touch a real `LanceVectorStore` +
//! `SqliteStore` + `MockEmbedder`. These tests are `#[ignore]`-d and
//! AVX-gated; see `tests/common/mod.rs` for the policy rationale.
//!
//! Mock-retriever unit tests live alongside the implementation in
//! `crates/kb-search/src/hybrid.rs` (no Lance, no AVX needed) — the
//! tests here exercise the full plumbing with the real Lance store.
mod common;
use std::path::PathBuf;
use std::sync::Arc;
use common::{
HybridEnv, id32, require_avx_or_panic, TEST_LEX_INDEX_VERSION, TEST_VEC_INDEX_VERSION,
};
use kb_core::{
Retriever, SearchFilters, SearchHit, SearchMode, SearchQuery,
};
use kb_search::{FusionPolicy, HybridRetriever};
use serde_json::json;
fn build_hybrid(env: &HybridEnv) -> HybridRetriever {
let lex: Arc<dyn Retriever> = Arc::new(env.lexical_retriever());
let vec: Arc<dyn Retriever> = Arc::new(env.vector_retriever());
HybridRetriever::with_policy(lex, vec, FusionPolicy::Rrf { k_rrf: 60 }, 5)
}
/// Seed a tiny corpus that lets us prove hybrid recall ≥ each side
/// independently. Two chunks are lexical-only matches ("rust cargo");
/// two chunks are vector-only matches (their text doesn't contain
/// the query token but their embedding still scores nearby because
/// MockEmbedder's hash distributes over all chunks).
fn seed_disjoint_corpus(env: &HybridEnv) -> Vec<String> {
// The lexical side will only match chunks that contain the query
// tokens. The vector side will rank ALL chunks by embedding
// similarity to the query — even ones whose text doesn't share
// a token with the query.
let chunks = [
// (chunk_id, doc_id, path, text, headings)
(id32("c1"), id32("d1"), "notes/rust1.md", "rust cargo macros", &["A"][..]),
(id32("c2"), id32("d2"), "notes/rust2.md", "rust traits and lifetimes", &["B"][..]),
(id32("c3"), id32("d3"), "notes/python.md", "python dataclasses tutorial", &["C"][..]),
(id32("c4"), id32("d4"), "notes/go.md", "go interfaces and channels", &["D"][..]),
];
let mut ids = Vec::new();
for (cid, did, path, text, headings) in &chunks {
env.seed_chunk(cid, did, path, text, headings, &[]);
env.embed_and_upsert(cid, did, text, headings);
ids.push(cid.clone());
}
ids
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn hybrid_recall_disjoint_returns_union() {
require_avx_or_panic();
let env = HybridEnv::new();
let _ids = seed_disjoint_corpus(&env);
let h = build_hybrid(&env);
let q = SearchQuery {
text: "rust".to_string(),
mode: SearchMode::Hybrid,
k: 4,
filters: SearchFilters::default(),
};
let hits = h.search(&q).unwrap();
// The vector side will return up to 4 candidates regardless of
// text overlap; the lexical side will return only the rust* ones.
// Together the union must cover at least the lexical hits AND
// include at least one non-lexical chunk if vector found one.
assert!(!hits.is_empty(), "hybrid must return at least one hit");
// Every hit's RetrievalDetail.method must be Hybrid.
for h in &hits {
assert_eq!(h.retrieval.method, SearchMode::Hybrid);
// At least one of lex/vec_score must be Some.
assert!(
h.retrieval.lexical_score.is_some() || h.retrieval.vector_score.is_some(),
"hybrid hit must carry at least one mode's score"
);
}
// index_version composite token.
let iv = h.index_version();
assert!(iv.0.starts_with("hybrid:"));
assert!(iv.0.contains(TEST_LEX_INDEX_VERSION));
assert!(iv.0.contains(TEST_VEC_INDEX_VERSION));
// Lexical-only chunks (c1, c2) MUST appear: they're the only ones
// matching the FTS5 query, and the vector side over-fetches enough
// to include them too.
let ids: Vec<&str> = hits.iter().map(|h| h.chunk_id.0.as_str()).collect();
assert!(ids.contains(&id32("c1").as_str()));
assert!(ids.contains(&id32("c2").as_str()));
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn hybrid_determinism_same_query_twice() {
require_avx_or_panic();
let env = HybridEnv::new();
let _ = seed_disjoint_corpus(&env);
let h = build_hybrid(&env);
let q = SearchQuery {
text: "rust".to_string(),
mode: SearchMode::Hybrid,
k: 4,
filters: SearchFilters::default(),
};
let a = h.search(&q).unwrap();
let b = h.search(&q).unwrap();
assert_eq!(a, b, "identical query must yield byte-identical Vec<SearchHit>");
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn hybrid_snapshot_run_1() {
require_avx_or_panic();
let env = HybridEnv::new();
let _ = seed_disjoint_corpus(&env);
let h = build_hybrid(&env);
let q = SearchQuery {
text: "rust".to_string(),
mode: SearchMode::Hybrid,
k: 4,
filters: SearchFilters::default(),
};
let hits = h.search(&q).unwrap();
// Snapshot pins the structural shape:
// - chunk_id ordering
// - which side contributed (lexical_rank / vector_rank
// populated as Some/None)
// - that fusion_score is non-increasing
// - method = Hybrid for every hit
let actual = json!(
hits.iter().map(|h: &SearchHit| json!({
"chunk_id": h.chunk_id.0,
"rank": h.rank,
"method": h.retrieval.method,
"lexical_rank": h.retrieval.lexical_rank,
"vector_rank": h.retrieval.vector_rank,
"lex_some": h.retrieval.lexical_score.is_some(),
"vec_some": h.retrieval.vector_score.is_some(),
"fusion_score_positive": h.retrieval.fusion_score > 0.0,
})).collect::<Vec<_>>()
);
let fixture = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
.join("search")
.join("hybrid")
.join("run-1.json");
if std::env::var_os("KB_UPDATE_SNAPSHOTS").is_some() {
std::fs::create_dir_all(fixture.parent().unwrap()).unwrap();
std::fs::write(&fixture, serde_json::to_string_pretty(&actual).unwrap()).unwrap();
eprintln!("[snapshot] regenerated {}", fixture.display());
// Fail loudly so that accidentally setting KB_UPDATE_SNAPSHOTS
// in CI surfaces as a test failure rather than a silent
// overwrite + green run. Same fail-loud-instead-of-silent-pass
// philosophy as P3-2's `SNAPSHOT_HASH_BASELINE = 0` and P3-3's
// placeholder fixture guards.
panic!(
"[snapshot] regenerated {}, re-run without KB_UPDATE_SNAPSHOTS to verify pin",
fixture.display()
);
}
let expected: serde_json::Value =
serde_json::from_str(&std::fs::read_to_string(&fixture).unwrap_or_else(|_| {
panic!(
"missing snapshot fixture at {}; run with \
KB_UPDATE_SNAPSHOTS=1 to create",
fixture.display()
)
}))
.unwrap();
// Refuse to silently "pass" against the committed placeholder. The
// placeholder JSON carries a `_comment` field with regeneration
// instructions; production fixtures (a captured list) do not.
if expected.get("_comment").is_some() {
panic!(
"snapshot fixture is a placeholder — regenerate on AVX hardware then commit. \
Path: {}. To regenerate: \
`KB_UPDATE_SNAPSHOTS=1 cargo test -p kb-search -- --ignored hybrid_snapshot`.",
fixture.display()
);
}
assert_eq!(
actual, expected,
"hybrid snapshot drift; rerun with KB_UPDATE_SNAPSHOTS=1 to regenerate"
);
// Independent guard: fusion scores must be non-increasing across
// the result list (rrf is rank-biased, so this is the
// semantically-correct ordering invariant).
for w in hits.windows(2) {
assert!(
w[0].retrieval.fusion_score >= w[1].retrieval.fusion_score,
"fusion scores not in descending order: {} then {}",
w[0].retrieval.fusion_score,
w[1].retrieval.fusion_score
);
}
}