Files
kebab/crates/kb-store-vector/tests/upsert_search.rs
altair823 3cd5117a7e feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status
First VectorStore implementation. Per-model Lance tables under
config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance
MergeInsert → SQLite-committed) with crash-safe retry, search via
cosine distance with the spec's score-shift (preserves negative
similarity ranking signal that clamping would crush).

V003 migration:
- Adds status (CHECK constraint pending|committed|tombstone, default
  pending) and vector_committed columns to embedding_records.
- BEFORE DELETE trigger on chunks flips dependent rows to tombstone.
  Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE
  runs first then row vanishes via CASCADE. Spec-faithful tombstone
  preservation requires recreating embedding_records to drop the
  CASCADE — deferred to a P+ migration since no production rows exist
  yet (P3-3 is the first writer). V003 SQL comment explains.

LanceVectorStore:
- ensure_table is idempotent: opens existing or creates with the
  Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32,
  dim>, model_id, embedding_version, text, heading_path, created_at).
- IndexId computed via id_for_index with collection="chunk_embeddings",
  index_kind="flat", params_hash = blake3(descriptor JSON). Schema
  bumps automatically rotate the IndexId.
- upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status=
  'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed
  on chunk_id (idempotent re-run); phase-3 UPDATE status='committed',
  vector_committed=1. If phase-2 fails the rows stay 'pending' and
  the next upsert call retries idempotently.
- search joins embedding_records WHERE status='committed' so partial-
  write rows never surface. Cosine distance from Lance ∈ [0, 2] →
  similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈
  [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters
  via SqliteStore::filter_chunks (added in this commit).
- Sync trait + async LanceDB bridged by an embedded current-thread
  tokio runtime. Doc-comment on the struct flags the "do NOT call
  from inside another tokio runtime" panic (block_on cannot nest).
  kb-app's job scheduler is sync today.

kb-store-sqlite additions:
- pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1
  INSERT OR REPLACE (status='pending', vector_committed=0).
- pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3
  single UPDATE … WHERE embedding_id IN (?, ?, …) via
  params_from_iter, guarded by WHERE status='pending' so tombstones
  don't get clobbered.
- pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId>
  consolidates the JOIN against documents/document_tags/
  embedding_records + path_glob via globset. Lets kb-store-vector
  honor SearchFilters without depending on rusqlite or globset
  directly. (kb-search's filter logic is structurally different —
  interleaved with the FTS5 SELECT — so it stays as-is for now;
  consolidation is a P+ refactor.)
- 4 new unit tests cover the phase-1 round-trip, empty batch,
  replay reset of pending rows, and the WHERE-status-pending guard.

Tests:
- 9 lib unit tests in kb-store-vector covering paths/sanitization,
  arrow_batch dim validation + descriptor hash, bm25-style cosine
  score shift math.
- 4 new kb-store-sqlite unit tests on filter_chunks (committed-only,
  tags/lang/trust/path_glob, order preservation, empty input).
- 4 new kb-store-sqlite unit tests on the embedding_records helpers.
- 8 integration tests in upsert_search.rs and 1 snapshot test marked
  #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke
  require_avx_or_panic() at the top of each body so a missing-AVX
  --ignored run fails loudly instead of silently passing. This dev
  host (qemu64 model) lacks AVX so these were NOT exercised end-to-
  end here — first CI lane on AVX hardware will validate them.
- Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder
  with an _comment marker. Snapshot test panics until the placeholder
  is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware.
- Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace
  --all-targets -- -D warnings clean.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb,
arrow + arrow-array + arrow-schema, serde, serde_json, tracing,
thiserror) plus forced waivers — anyhow (trait return type), tokio
+ futures (LanceDB async-only API), blake3 (params_hash). rusqlite
and globset are NOT direct deps of kb-store-vector — confirmed via
cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for
the test fixture seeder only.

Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app
embed_index orchestration (P3-4 facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:01:31 +00:00

375 lines
13 KiB
Rust

//! Integration tests for `LanceVectorStore` covering ensure_table,
//! upsert, search, dimension mismatch, filters, model isolation, and
//! determinism.
//!
//! Every test in this file is `#[ignore]` and requires an AVX-capable
//! x86_64 host. Run with:
//!
//! ```text
//! cargo test -p kb-store-vector -- --ignored
//! ```
//!
//! See `tests/common/mod.rs` for the full rationale.
use kb_core::{EmbeddingModelId, SearchFilters, VectorStore};
use kb_store_sqlite::EmbeddingRecordRow;
use rusqlite::params;
use time::OffsetDateTime;
mod common;
use common::{TestEnv, make_record, require_avx_or_panic};
const MODEL: &str = "test-model";
/// Helper: produce a unit-norm 4-D vector pointing in one of four
/// directions. The sign pattern keeps cosine similarities cleanly
/// distinct so search ordering tests don't depend on float jitter.
fn dir(idx: u8) -> Vec<f32> {
match idx {
0 => vec![1.0, 0.0, 0.0, 0.0],
1 => vec![0.0, 1.0, 0.0, 0.0],
2 => vec![0.0, 0.0, 1.0, 0.0],
_ => vec![0.0, 0.0, 0.0, 1.0],
}
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn ensure_table_idempotent_returns_same_index_id() {
require_avx_or_panic();
let env = TestEnv::new();
let model = EmbeddingModelId(MODEL.to_string());
let id1 = env.vector.ensure_table(&model, 4).unwrap();
let id2 = env.vector.ensure_table(&model, 4).unwrap();
assert_eq!(id1, id2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn search_before_upsert_returns_empty() {
require_avx_or_panic();
let env = TestEnv::new();
let hits = env
.vector
.search(&dir(0), 5, &SearchFilters::default())
.unwrap();
assert!(hits.is_empty());
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn upsert_ten_then_search_returns_five() {
require_avx_or_panic();
let env = TestEnv::new();
let mut recs = Vec::new();
for i in 0..10u8 {
// 4-D vectors clustered near dir(0) for the first half, dir(1)
// for the rest, with small per-row jitter so they stay
// distinct in the index.
let mut v = if i < 5 { dir(0) } else { dir(1) };
v[3] = (i as f32) * 0.001;
let rec = make_record(i, i, v, &format!("text-{i}"), &["A"], MODEL);
env.seed_chunk(
&rec.chunk_id.0,
&rec.doc_id.0,
&format!("notes/{i}.md"),
"en",
&[],
"primary",
);
recs.push(rec);
}
env.vector.upsert(&recs).unwrap();
// 1:1 alignment check: every record has a committed embedding row.
{
let conn = env.sqlite.read_conn();
let count: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE status = 'committed'",
[],
|r| r.get(0),
)
.unwrap();
assert_eq!(count, 10);
}
let hits = env
.vector
.search(&dir(0), 5, &SearchFilters::default())
.unwrap();
assert_eq!(hits.len(), 5, "expected 5 hits, got {}", hits.len());
// Top hits should be from the first half (clustered around dir(0)).
// make_record lays chunk_idx into the low bits of `0x1100 + i`, so
// `chunk_idx = u32::from_str_radix(last4, 16) - 0x1100`. The first
// half (chunk_idx < 5) lives in 0x1100..=0x1104.
for h in &hits {
let suffix_hex = &h.chunk_id.0[h.chunk_id.0.len() - 4..];
let idx = u32::from_str_radix(suffix_hex, 16).unwrap();
let chunk_idx = idx - 0x1100;
assert!(
chunk_idx < 5,
"top-5 hit unexpectedly came from second cluster: idx={chunk_idx}"
);
}
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn dimension_mismatch_errors_and_writes_nothing() {
require_avx_or_panic();
let env = TestEnv::new();
let model = EmbeddingModelId(MODEL.to_string());
// First populate a 4-D table with one row so it exists on disk.
let r0 = make_record(0, 0, dir(0), "first", &[], MODEL);
env.seed_chunk(&r0.chunk_id.0, &r0.doc_id.0, "notes/0.md", "en", &[], "primary");
env.vector.upsert(&[r0]).unwrap();
assert_eq!(env.vector.ensure_table(&model, 4).unwrap(), env.vector.ensure_table(&model, 4).unwrap());
// Now manually open the same table_name path and try to upsert
// an 8-D vector through `upsert` — the table name function bakes
// dim into the name, so the only way to drive the real
// record-vs-table mismatch is to corrupt `dimensions` so the
// table_name is the existing 4-D table, but the embedded vector
// is 8-D. Spec line 94: must error, write nothing extra.
let mut bad = make_record(1, 1, vec![0.1_f32; 8], "second", &[], MODEL);
// Pretend this is a 4-D vector for table-name purposes; the
// build_batch then enforces that vector.len() == dim and bails.
bad.dimensions = 4;
env.seed_chunk(&bad.chunk_id.0, &bad.doc_id.0, "notes/1.md", "en", &[], "primary");
let bad_chunk = bad.chunk_id.0.clone();
let err = env.vector.upsert(&[bad]).unwrap_err();
let msg = format!("{err:#}");
assert!(
msg.to_lowercase().contains("dim")
|| msg.contains("does not match table dim"),
"unexpected error message: {msg}"
);
// The phase-1 row may have landed before phase 2 detected the
// mismatch — but the on-disk Lance table must NOT contain the
// bad record. So we assert that no `committed` row corresponds
// to chunk_id of the bad record.
let conn = env.sqlite.read_conn();
let committed: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ? AND status = 'committed'",
rusqlite::params![bad_chunk],
|r| r.get(0),
)
.unwrap();
assert_eq!(committed, 0, "bad record reached committed state despite dim mismatch");
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn filter_tags_any_drops_non_matching_docs() {
require_avx_or_panic();
let env = TestEnv::new();
// Two docs: one with tag "ko-style", one without.
let r_a = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
let r_b = make_record(0xbb, 0xbb, dir(0), "beta", &[], MODEL);
env.seed_chunk(
&r_a.chunk_id.0,
&r_a.doc_id.0,
"notes/a.md",
"en",
&["ko-style"],
"primary",
);
env.seed_chunk(
&r_b.chunk_id.0,
&r_b.doc_id.0,
"notes/b.md",
"en",
&["other"],
"primary",
);
let expected_doc_id = r_a.doc_id.0.clone();
env.vector.upsert(&[r_a, r_b]).unwrap();
let filters = SearchFilters {
tags_any: vec!["ko-style".to_string()],
..Default::default()
};
let hits = env.vector.search(&dir(0), 10, &filters).unwrap();
assert_eq!(hits.len(), 1, "expected only the tagged doc to match");
let payload = &hits[0].payload;
assert_eq!(payload["doc_id"], expected_doc_id);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn model_isolation_two_models_two_directories() {
require_avx_or_panic();
let env = TestEnv::new();
let r1 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-A");
env.seed_chunk(
&r1.chunk_id.0,
&r1.doc_id.0,
"notes/a.md",
"en",
&[],
"primary",
);
let chunk_id = r1.chunk_id.0.clone();
env.vector.upsert(&[r1]).unwrap();
// Same chunk_id, different model — should land in a separate table.
let mut r2 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-B");
r2.embedding_id = kb_core::EmbeddingId(
"ee01ee01ee01ee01ee01ee01ee01ee01".to_string(),
);
env.vector.upsert(&[r2]).unwrap();
// Two on-disk Lance directories, distinguished by table name.
let lancedb_root = env.data_dir().join("lancedb");
let entries: Vec<_> = std::fs::read_dir(&lancedb_root)
.unwrap()
.filter_map(Result::ok)
.map(|e| e.file_name().to_string_lossy().into_owned())
.collect();
let a_count = entries
.iter()
.filter(|e| e.contains("model-A"))
.count();
let b_count = entries
.iter()
.filter(|e| e.contains("model-B"))
.count();
assert!(a_count >= 1, "model-A table missing: {entries:?}");
assert!(b_count >= 1, "model-B table missing: {entries:?}");
// Two embedding_records rows for the same chunk_id, one per model.
let conn = env.sqlite.read_conn();
let count: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ?",
params![chunk_id],
|r| r.get(0),
)
.unwrap();
assert_eq!(count, 2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn determinism_same_query_same_top_k() {
require_avx_or_panic();
let env = TestEnv::new();
let recs: Vec<_> = (0..6u8)
.map(|i| {
let mut v = dir(i % 4);
v[3] = (i as f32) * 0.001;
let rec = make_record(i, i, v, &format!("t-{i}"), &[], MODEL);
env.seed_chunk(
&rec.chunk_id.0,
&rec.doc_id.0,
&format!("notes/{i}.md"),
"en",
&[],
"primary",
);
rec
})
.collect();
env.vector.upsert(&recs).unwrap();
let q = dir(0);
let h1 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
let h2 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
let ids1: Vec<_> = h1.iter().map(|h| h.chunk_id.0.clone()).collect();
let ids2: Vec<_> = h2.iter().map(|h| h.chunk_id.0.clone()).collect();
assert_eq!(ids1, ids2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn upsert_retry_promotes_pending_to_committed() {
// Crash-recovery contract: a phase-1 row that was already
// committed by a prior batch is left alone by phase-3, but a
// pending row gets retried and reaches committed once Lance
// accepts it.
//
// Construction of the "crash" state:
//
// 1. Stage a row directly via the SQLite phase-1 helper
// (`put_embedding_records_pending`). NO Lance write happens
// here — this is exactly the on-disk state after a crash
// between phase 1 and phase 2. Confirm the row is at
// `status='pending'` before doing anything else.
//
// 2. Run `LanceVectorStore::upsert` with a `VectorRecord` whose
// `embedding_id` matches the pending row. Phase 1's
// `INSERT OR REPLACE` is idempotent here (same row payload),
// phase 2 actually writes to Lance for the first time, and
// phase 3 flips the row to 'committed'.
//
// 3. Verify status='committed' and vector_committed=1.
//
// This actually exercises the "rows stuck at pending get promoted
// on next upsert" semantics — the previous version pre-seeded via
// raw SQL but then the same upsert call overwrote the seed via
// INSERT OR REPLACE before phase 2 ran, so the recovery path
// never executed.
require_avx_or_panic();
let env = TestEnv::new();
let rec = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
let chunk_id = rec.chunk_id.0.clone();
let doc_id = rec.doc_id.0.clone();
let embedding_id = rec.embedding_id.0.clone();
env.seed_chunk(&chunk_id, &doc_id, "notes/a.md", "en", &[], "primary");
// Phase 1 only — go through the same kb-store-sqlite helper that
// `LanceVectorStore::upsert` uses internally. No Lance write
// happens, so this models "crashed between phase 1 and phase 2".
let pending_row = EmbeddingRecordRow {
embedding_id: embedding_id.clone(),
chunk_id: chunk_id.clone(),
model_id: MODEL.to_string(),
model_version: "v1".to_string(),
dimensions: 4,
lance_table: format!("chunk_embeddings_{MODEL}_4"),
created_at: OffsetDateTime::UNIX_EPOCH,
};
env.sqlite
.put_embedding_records_pending(std::slice::from_ref(&pending_row))
.unwrap();
// Sanity: the row is staged but NOT yet committed and Lance has
// no record of it.
{
let conn = env.sqlite.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "pending", "row should be at status=pending after phase-1-only");
assert_eq!(committed, 0);
}
// Now run upsert with the matching record. Phase 1's INSERT OR
// REPLACE is a no-op equivalent (same row payload), phase 2 lands
// the Lance row for the first time, phase 3 promotes
// status='committed'.
env.vector.upsert(&[rec]).unwrap();
let conn = env.sqlite.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "committed");
assert_eq!(committed, 1);
}