Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever
wrapper around LanceVectorStore (P3-3) into a single Retriever that
dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and
vector candidates via Reciprocal Rank Fusion and populates the full
RetrievalDetail per SearchHit so kb search --explain can attribute
scores back to each side.
Public surface (kb-search crate):
- pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>,
Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction.
- pub struct HybridRetriever { lexical, vector, fusion, k }.
- pub enum FusionPolicy { Rrf { k_rrf: u32 } }.
VectorRetriever:
- Embeds query.text as EmbeddingKind::Query before delegating to
VectorStore::search(query_vec, query.k * 2, &query.filters). Over-
fetches by ×2 for filter losses; LanceVectorStore applies the
filters internally so they propagate naturally.
- Hydrates each VectorHit into a full SearchHit by joining on
chunk_id in a single IN-clause batch (no N+1): doc_path,
section_label, chunker_version, source_spans for citation, plus
embedding_model from embedder.model_id().
- Snippet trimmed to config.search.snippet_chars (vector mode lacks
FTS5 highlighting; chunk text prefix is the next-best signal).
- Citation built from the chunk's first source span via the shared
citation_helper module — extracted from lexical.rs so both
retrievers compute citations identically (Byte/empty fallback to
Line{1,1} preserved with tracing::warn).
- RetrievalDetail.method = Vector for standalone calls; both
fusion_score and vector_score set to the LanceVectorStore-shifted
cosine score; lexical_* None.
HybridRetriever:
- Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail.
- Hybrid mode runs both retrievers with k * 2 fanout, fuses with
RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC
with deterministic tiebreaker (lex_rank ASC then chunk_id ASC),
takes top query.k. Fusion math runs in f64 throughout; cast to
f32 only at the SearchHit boundary where bounded magnitude (≤
~0.033 at k_rrf=60) makes f32 precision sufficient for ranking.
- Per-hit lexical preferred for snippet/citation/heading_path/
chunker_version/embedding_model when the chunk appears in both
retrievers — FTS5 highlighting is more user-relevant than vector's
truncated text. Vector-only chunks fall through to vector hit data.
- index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at
construction; mismatched lex/vec versions trigger a tracing::warn
so users notice stale indexes (spec line 143).
kb-search additions:
- citation_helper.rs — pub(crate) citation_from_first_span shared
between lexical and vector retrievers. Extracted from lexical.rs;
no behavior drift.
Tests (38 default + 3 ignored):
- 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within
f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid
preserves single-side hits with the missing side's RetrievalDetail
None, deterministic tiebreaker on identical fused scores, composite
index_version, mismatched-version warn at construction.
- 2 unit tests in vector.rs covering the snippet-prefix and citation
fallback paths.
- 11 unit tests in lexical.rs (unchanged from P2-2).
- 13 lexical integration tests (unchanged).
- 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus
recall (lex returns A,B; vec returns C,D; hybrid returns all 4),
determinism over two queries, snapshot stability against
tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was
regenerated against this branch on an AVX-enabled VM and contains
4 real chunks (c1/c2 lex+vec, c3/c4 vec-only).
- KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of
silently passing — matches the P3-2/P3-3 fail-loud-instead-of-
silent-pass philosophy.
Allowed deps respected (kb-core, kb-config, kb-store-sqlite,
kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing
kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow).
kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder>
trait object; the concrete adapter is runtime-injected by kb-app.
Out of scope: reranker (P+), score calibration across modes (RRF is
rank-comparable so absolute calibration is P+), multimodal retrieval
(P6+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
217 lines
7.9 KiB
Rust
217 lines
7.9 KiB
Rust
//! Shared scaffolding for kb-search hybrid integration tests.
|
|
//!
|
|
//! # Test policy
|
|
//!
|
|
//! Integration tests in `hybrid.rs` that touch `LanceVectorStore`
|
|
//! are marked `#[ignore]` AND call [`require_avx_or_panic`] inside
|
|
//! the test body so a `--ignored` invocation on a non-AVX host
|
|
//! fails loudly with a clear message rather than crashing later
|
|
//! inside Lance's f32 SIMD kernel with `SIGILL`.
|
|
//!
|
|
//! See `crates/kb-store-vector/tests/common/mod.rs` for the
|
|
//! original P3-3 rationale; this is a copy because that crate's
|
|
//! test commons are test-only and not part of its public surface.
|
|
|
|
#![allow(dead_code)]
|
|
|
|
use std::sync::Arc;
|
|
|
|
use kb_config::Config;
|
|
use kb_core::{
|
|
ChunkId, DocumentId, EmbeddingId, EmbeddingInput, EmbeddingKind,
|
|
EmbeddingModelId, EmbeddingVersion, IndexVersion, VectorRecord, VectorStore,
|
|
};
|
|
use kb_embed::{Embedder, MockEmbedder};
|
|
use kb_search::{LexicalRetriever, VectorRetriever};
|
|
use kb_store_sqlite::SqliteStore;
|
|
use kb_store_vector::LanceVectorStore;
|
|
use rusqlite::params;
|
|
use tempfile::TempDir;
|
|
|
|
/// Panic if the host CPU lacks AVX. Called from every `#[ignore]`-d
|
|
/// integration test body so that `cargo test -- --ignored` on a
|
|
/// non-AVX host fails loudly with a clear message instead of crashing
|
|
/// later inside a Lance SIMD kernel with `SIGILL`.
|
|
pub fn require_avx_or_panic() {
|
|
#[cfg(target_arch = "x86_64")]
|
|
{
|
|
if !std::is_x86_feature_detected!("avx") {
|
|
panic!(
|
|
"kb-search hybrid integration test requires AVX-capable hardware; \
|
|
host CPU lacks AVX. Run on an AVX-capable machine."
|
|
);
|
|
}
|
|
}
|
|
}
|
|
|
|
/// Index version label used by hybrid integration tests so the
|
|
/// `index_version()` composite token is predictable in snapshots.
|
|
pub const TEST_LEX_INDEX_VERSION: &str = "v1.0-lex";
|
|
pub const TEST_VEC_INDEX_VERSION: &str = "v1.0-vec";
|
|
|
|
/// Embedding dimensions for tests. Kept small so MockEmbedder runs
|
|
/// fast and the Lance table stays compact on disk; production uses
|
|
/// 384 (multilingual-e5-small) but the retriever code is dim-agnostic.
|
|
pub const TEST_DIMENSIONS: usize = 16;
|
|
pub const TEST_MODEL_ID: &str = "mock-e5";
|
|
|
|
pub struct HybridEnv {
|
|
pub temp: TempDir,
|
|
pub config: Config,
|
|
pub sqlite: Arc<SqliteStore>,
|
|
pub vector_store: Arc<LanceVectorStore>,
|
|
pub embedder: Arc<MockEmbedder>,
|
|
}
|
|
|
|
impl HybridEnv {
|
|
pub fn new() -> Self {
|
|
let temp = tempfile::tempdir().expect("tempdir");
|
|
let mut config = Config::defaults();
|
|
config.storage.data_dir = temp.path().to_string_lossy().into_owned();
|
|
let sqlite = SqliteStore::open(&config).unwrap();
|
|
sqlite.run_migrations().unwrap();
|
|
let sqlite = Arc::new(sqlite);
|
|
let vector_store =
|
|
Arc::new(LanceVectorStore::new(&config, sqlite.clone()).unwrap());
|
|
let embedder = Arc::new(MockEmbedder::new(
|
|
EmbeddingModelId(TEST_MODEL_ID.to_string()),
|
|
EmbeddingVersion("v1".to_string()),
|
|
TEST_DIMENSIONS,
|
|
));
|
|
Self {
|
|
temp,
|
|
config,
|
|
sqlite,
|
|
vector_store,
|
|
embedder,
|
|
}
|
|
}
|
|
|
|
/// Build a `LexicalRetriever` over the shared SQLite store.
|
|
pub fn lexical_retriever(&self) -> LexicalRetriever {
|
|
LexicalRetriever::new(
|
|
Arc::clone(&self.sqlite),
|
|
IndexVersion(TEST_LEX_INDEX_VERSION.to_string()),
|
|
)
|
|
}
|
|
|
|
/// Build a `VectorRetriever` over the shared LanceVectorStore +
|
|
/// MockEmbedder + SQLite store.
|
|
pub fn vector_retriever(&self) -> VectorRetriever {
|
|
let store: Arc<dyn VectorStore + Send + Sync> =
|
|
Arc::clone(&self.vector_store) as Arc<dyn VectorStore + Send + Sync>;
|
|
let embed: Arc<dyn Embedder> =
|
|
Arc::clone(&self.embedder) as Arc<dyn Embedder>;
|
|
VectorRetriever::new(
|
|
store,
|
|
embed,
|
|
Arc::clone(&self.sqlite),
|
|
IndexVersion(TEST_VEC_INDEX_VERSION.to_string()),
|
|
)
|
|
}
|
|
|
|
/// Insert (asset, document, document_tags, chunk) rows directly.
|
|
/// We seed without going through `DocumentStore::put_document`
|
|
/// to keep this crate's test deps inside the Allowed list (no
|
|
/// `kb-parse-md` / `kb-normalize` / `kb-chunk`). The `chunks` row
|
|
/// also fires the V002 FTS5 triggers, so the lexical retriever
|
|
/// can find the row by `MATCH` without a manual rebuild.
|
|
pub fn seed_chunk(
|
|
&self,
|
|
chunk_id: &str,
|
|
doc_id: &str,
|
|
workspace_path: &str,
|
|
text: &str,
|
|
heading_path: &[&str],
|
|
tags: &[&str],
|
|
) {
|
|
let asset_id = format!("a{}", &doc_id[..31]);
|
|
let conn = self.sqlite.read_conn();
|
|
conn.execute(
|
|
"INSERT OR IGNORE INTO assets (
|
|
asset_id, source_uri, workspace_path, media_type, byte_len,
|
|
checksum, storage_kind, storage_path, discovered_at
|
|
) VALUES (?, ?, ?, '\"markdown\"', 0,
|
|
'deadbeefdeadbeefdeadbeefdeadbeef',
|
|
'reference', ?, '1970-01-01T00:00:00Z')",
|
|
params![
|
|
asset_id,
|
|
format!("file://{workspace_path}"),
|
|
workspace_path,
|
|
workspace_path,
|
|
],
|
|
)
|
|
.unwrap();
|
|
conn.execute(
|
|
"INSERT OR IGNORE INTO documents (
|
|
doc_id, asset_id, workspace_path, title, lang, source_type,
|
|
trust_level, parser_version, doc_version, schema_version,
|
|
metadata_json, provenance_json, created_at, updated_at
|
|
) VALUES (?, ?, ?, NULL, 'en', 'markdown', 'primary', 'v1', 1, 1,
|
|
'{}', '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
|
|
params![doc_id, asset_id, workspace_path],
|
|
)
|
|
.unwrap();
|
|
for t in tags {
|
|
conn.execute(
|
|
"INSERT OR IGNORE INTO document_tags (doc_id, tag) VALUES (?, ?)",
|
|
params![doc_id, t],
|
|
)
|
|
.unwrap();
|
|
}
|
|
let heading_json = serde_json::to_string(heading_path).unwrap();
|
|
conn.execute(
|
|
"INSERT OR IGNORE INTO chunks (
|
|
chunk_id, doc_id, text, heading_path_json, section_label,
|
|
source_spans_json, token_estimate, chunker_version,
|
|
policy_hash, block_ids_json, created_at
|
|
) VALUES (?, ?, ?, ?, NULL,
|
|
'[{\"kind\":\"line\",\"start\":1,\"end\":3}]',
|
|
1, 'v1', 'h', '[]', '1970-01-01T00:00:00Z')",
|
|
params![chunk_id, doc_id, text, heading_json],
|
|
)
|
|
.unwrap();
|
|
}
|
|
|
|
/// Embed `text` as a Document and upsert it as the embedding for
|
|
/// `chunk_id`. Drives the same code path production uses:
|
|
/// MockEmbedder → VectorRecord → LanceVectorStore::upsert →
|
|
/// embedding_records committed.
|
|
pub fn embed_and_upsert(
|
|
&self,
|
|
chunk_id: &str,
|
|
doc_id: &str,
|
|
text: &str,
|
|
heading_path: &[&str],
|
|
) {
|
|
let inputs = [EmbeddingInput {
|
|
text,
|
|
kind: EmbeddingKind::Document,
|
|
}];
|
|
let mut vecs = self.embedder.embed(&inputs).unwrap();
|
|
let vector = vecs.remove(0);
|
|
let record = VectorRecord {
|
|
chunk_id: ChunkId(chunk_id.to_string()),
|
|
embedding_id: EmbeddingId(format!("e{}", &chunk_id[..31])),
|
|
vector,
|
|
doc_id: DocumentId(doc_id.to_string()),
|
|
text: text.to_string(),
|
|
heading_path: heading_path.iter().map(|s| s.to_string()).collect(),
|
|
model_id: EmbeddingModelId(TEST_MODEL_ID.to_string()),
|
|
model_version: EmbeddingVersion("v1".to_string()),
|
|
dimensions: TEST_DIMENSIONS,
|
|
};
|
|
self.vector_store.upsert(&[record]).unwrap();
|
|
}
|
|
}
|
|
|
|
/// Pad a short prefix to the 32-hex shape `kb_core` newtypes expect.
|
|
pub fn id32(prefix: &str) -> String {
|
|
let mut s = prefix.to_string();
|
|
while s.len() < 32 {
|
|
s.push('0');
|
|
}
|
|
s.truncate(32);
|
|
s
|
|
}
|