feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status

First VectorStore implementation. Per-model Lance tables under
config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance
MergeInsert → SQLite-committed) with crash-safe retry, search via
cosine distance with the spec's score-shift (preserves negative
similarity ranking signal that clamping would crush).

V003 migration:
- Adds status (CHECK constraint pending|committed|tombstone, default
  pending) and vector_committed columns to embedding_records.
- BEFORE DELETE trigger on chunks flips dependent rows to tombstone.
  Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE
  runs first then row vanishes via CASCADE. Spec-faithful tombstone
  preservation requires recreating embedding_records to drop the
  CASCADE — deferred to a P+ migration since no production rows exist
  yet (P3-3 is the first writer). V003 SQL comment explains.

LanceVectorStore:
- ensure_table is idempotent: opens existing or creates with the
  Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32,
  dim>, model_id, embedding_version, text, heading_path, created_at).
- IndexId computed via id_for_index with collection="chunk_embeddings",
  index_kind="flat", params_hash = blake3(descriptor JSON). Schema
  bumps automatically rotate the IndexId.
- upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status=
  'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed
  on chunk_id (idempotent re-run); phase-3 UPDATE status='committed',
  vector_committed=1. If phase-2 fails the rows stay 'pending' and
  the next upsert call retries idempotently.
- search joins embedding_records WHERE status='committed' so partial-
  write rows never surface. Cosine distance from Lance ∈ [0, 2] →
  similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈
  [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters
  via SqliteStore::filter_chunks (added in this commit).
- Sync trait + async LanceDB bridged by an embedded current-thread
  tokio runtime. Doc-comment on the struct flags the "do NOT call
  from inside another tokio runtime" panic (block_on cannot nest).
  kb-app's job scheduler is sync today.

kb-store-sqlite additions:
- pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1
  INSERT OR REPLACE (status='pending', vector_committed=0).
- pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3
  single UPDATE … WHERE embedding_id IN (?, ?, …) via
  params_from_iter, guarded by WHERE status='pending' so tombstones
  don't get clobbered.
- pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId>
  consolidates the JOIN against documents/document_tags/
  embedding_records + path_glob via globset. Lets kb-store-vector
  honor SearchFilters without depending on rusqlite or globset
  directly. (kb-search's filter logic is structurally different —
  interleaved with the FTS5 SELECT — so it stays as-is for now;
  consolidation is a P+ refactor.)
- 4 new unit tests cover the phase-1 round-trip, empty batch,
  replay reset of pending rows, and the WHERE-status-pending guard.

Tests:
- 9 lib unit tests in kb-store-vector covering paths/sanitization,
  arrow_batch dim validation + descriptor hash, bm25-style cosine
  score shift math.
- 4 new kb-store-sqlite unit tests on filter_chunks (committed-only,
  tags/lang/trust/path_glob, order preservation, empty input).
- 4 new kb-store-sqlite unit tests on the embedding_records helpers.
- 8 integration tests in upsert_search.rs and 1 snapshot test marked
  #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke
  require_avx_or_panic() at the top of each body so a missing-AVX
  --ignored run fails loudly instead of silently passing. This dev
  host (qemu64 model) lacks AVX so these were NOT exercised end-to-
  end here — first CI lane on AVX hardware will validate them.
- Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder
  with an _comment marker. Snapshot test panics until the placeholder
  is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware.
- Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace
  --all-targets -- -D warnings clean.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb,
arrow + arrow-array + arrow-schema, serde, serde_json, tracing,
thiserror) plus forced waivers — anyhow (trait return type), tokio
+ futures (LanceDB async-only API), blake3 (params_hash). rusqlite
and globset are NOT direct deps of kb-store-vector — confirmed via
cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for
the test fixture seeder only.

Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app
embed_index orchestration (P3-4 facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 10:01:31 +00:00
parent 9beef930b4
commit 3cd5117a7e
16 changed files with 6399 additions and 70 deletions

3953
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -9,6 +9,7 @@ members = [
"crates/kb-normalize",
"crates/kb-chunk",
"crates/kb-store-sqlite",
"crates/kb-store-vector",
"crates/kb-search",
"crates/kb-embed",
"crates/kb-embed-local",
@@ -43,3 +44,13 @@ proptest = "1"
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release
# remains untested for this workspace).
fastembed = "4.9"
# LanceDB embedded vector store (P3-3). 0.23.x pulls arrow / arrow-array /
# arrow-schema 56.x transitively (via lance 1.0); the kb-store-vector
# crate matches that major to share the same Arrow types without a
# re-export adapter.
lancedb = { version = "0.23", default-features = false }
arrow = "56"
arrow-array = "56"
arrow-schema = "56"
tokio = { version = "1", features = ["rt", "macros"] }
futures = "0.3"

View File

@@ -14,6 +14,11 @@ kb-config = { path = "../kb-config" }
# Explicitly NOT `bundled-sqlcipher` per task allowed-deps list.
rusqlite = { version = "0.32", features = ["bundled"] }
refinery = { version = "0.8", features = ["rusqlite"] }
# Used by `filter_chunks` for the optional `path_glob` post-filter.
# The SQL prefilter handles tags / lang / trust / committed-status; the
# Rust-side glob keeps the SQL surface small (no LIKE-vs-glob impedance
# mismatch) and matches the pattern kb-search/src/lexical.rs uses.
globset = { workspace = true }
serde_json = { workspace = true }
time = { workspace = true }
blake3 = { workspace = true }

View File

@@ -0,0 +1,317 @@
//! Embedding-records writers used by `kb-store-vector` (P3-3).
//!
//! The `VectorStore` impl in `kb-store-vector` performs a two-phase write:
//! phase 1 stages an `embedding_records` row at `status='pending'` before
//! issuing the Lance write, and phase 3 promotes those same rows to
//! `status='committed'` after the Lance commit lands. We surface those
//! two SQL statements here (rather than expose a generic write
//! connection) so the SQL stays inside the crate that owns the schema —
//! kb-store-vector consumes a typed, narrowly-scoped API and never
//! touches the connection mutex itself.
//!
//! Both helpers wrap a single `INSERT OR REPLACE` / `UPDATE` per row
//! inside a single SQLite transaction, so a partial failure leaves
//! either all rows pending (phase 1) or all rows committed (phase 3),
//! never a mixed batch.
use anyhow::{Context, Result};
use rusqlite::{params, params_from_iter};
use time::OffsetDateTime;
use time::format_description::well_known::Rfc3339;
use crate::error::StoreError;
use crate::store::SqliteStore;
/// Row payload for [`SqliteStore::put_embedding_records_pending`].
///
/// Mirrors the columns of `embedding_records` minus the lifecycle markers
/// (`status` and `vector_committed`) — those are forced to `'pending'`
/// and `0` by phase 1.
///
/// `created_at` is `OffsetDateTime` rather than a pre-formatted string so
/// the helper owns the RFC3339 formatting (the same formatting choice
/// the asset / document / job writers make).
#[derive(Clone, Debug)]
pub struct EmbeddingRecordRow {
pub embedding_id: String,
pub chunk_id: String,
pub model_id: String,
pub model_version: String,
pub dimensions: usize,
pub lance_table: String,
pub created_at: OffsetDateTime,
}
impl SqliteStore {
/// Phase 1 of the kb-store-vector two-phase write: stage every
/// `embedding_records` row with `status='pending'`,
/// `vector_committed=0`. `INSERT OR REPLACE` (rather than UPSERT) is
/// the right shape here because re-running phase 1 for an
/// already-pending row resets `vector_committed` to 0 and the
/// `created_at` to the new attempt's timestamp — both desired,
/// because a retry should look like a fresh attempt to the GC pass.
///
/// All rows are written in a single transaction; if any row fails
/// the entire batch is rolled back and the caller can retry without
/// worrying about partial pending state.
pub fn put_embedding_records_pending(
&self,
rows: &[EmbeddingRecordRow],
) -> Result<()> {
if rows.is_empty() {
return Ok(());
}
let mut conn = self.lock_conn();
let tx = conn.transaction().map_err(StoreError::from)?;
{
let mut stmt = tx
.prepare(
"INSERT OR REPLACE INTO embedding_records (
embedding_id, chunk_id, model_id, model_version,
dimensions, lance_table, created_at,
status, vector_committed
) VALUES (?, ?, ?, ?, ?, ?, ?, 'pending', 0)",
)
.map_err(StoreError::from)?;
for row in rows {
let created_at = row
.created_at
.format(&Rfc3339)
.context("format embedding_records.created_at")?;
stmt.execute(params![
row.embedding_id,
row.chunk_id,
row.model_id,
row.model_version,
row.dimensions as i64,
row.lance_table,
created_at,
])
.map_err(StoreError::from)?;
}
}
tx.commit().map_err(StoreError::from)?;
Ok(())
}
/// Phase 3 of the kb-store-vector two-phase write: after the Lance
/// MergeInsert commits, flip the listed embedding rows to
/// `status='committed'`, `vector_committed=1`. Rows that aren't
/// currently `pending` (e.g. already committed by a duplicate batch,
/// or tombstoned by a chunks DELETE between phase 1 and phase 3)
/// are deliberately left alone via `WHERE status='pending'` — we
/// never resurrect a tombstone, and we never blindly re-mark a
/// committed row.
///
/// All updates run in a single statement (single SQL `UPDATE …
/// WHERE embedding_id IN (?, ?, …)`) inside one transaction —
/// avoids the per-row `execute()` round-trip the previous
/// implementation paid.
pub fn mark_embedding_records_committed(
&self,
embedding_ids: &[String],
) -> Result<()> {
if embedding_ids.is_empty() {
return Ok(());
}
let mut conn = self.lock_conn();
let tx = conn.transaction().map_err(StoreError::from)?;
{
let placeholders = std::iter::repeat_n("?", embedding_ids.len())
.collect::<Vec<_>>()
.join(",");
let sql = format!(
"UPDATE embedding_records
SET status='committed', vector_committed=1
WHERE status='pending'
AND embedding_id IN ({placeholders})"
);
tx.execute(&sql, params_from_iter(embedding_ids.iter()))
.map_err(StoreError::from)?;
}
tx.commit().map_err(StoreError::from)?;
Ok(())
}
}
#[cfg(test)]
mod tests {
use super::*;
use kb_config::Config;
use tempfile::TempDir;
use time::OffsetDateTime;
/// Minimal config pointing at a tempdir for the SQLite file.
fn config_for(tmp: &TempDir) -> Config {
let mut c = Config::defaults();
c.storage.data_dir = tmp.path().to_string_lossy().into_owned();
c
}
/// Seed a chunks row + the doc / asset rows it FKs to. The minimum
/// needed for embedding_records inserts not to fail the FK to
/// chunks.
fn seed_chunk(store: &SqliteStore, chunk_id: &str) {
let conn = store.lock_conn();
// Asset, document, chunk — all hand-rolled at the SQL layer to
// keep the test self-contained (no kb-parse/kb-chunk dep).
conn.execute(
"INSERT INTO assets (
asset_id, source_uri, workspace_path, media_type, byte_len,
checksum, storage_kind, storage_path, discovered_at
) VALUES (?, ?, ?, ?, ?, ?, 'reference', '/tmp/x', ?)",
params![
"0123456789abcdef0123456789abcdef",
"file:///tmp/x",
"x.md",
"{}",
0_i64,
"deadbeef",
"1970-01-01T00:00:00Z",
],
)
.unwrap();
conn.execute(
"INSERT INTO documents (
doc_id, asset_id, workspace_path, title, lang, source_type,
trust_level, parser_version, doc_version, schema_version,
metadata_json, provenance_json, created_at, updated_at
) VALUES (?, ?, ?, NULL, NULL, 'fs', 'unverified', 'v1', 1, 1, '{}', '{}', ?, ?)",
params![
"fedcba9876543210fedcba9876543210",
"0123456789abcdef0123456789abcdef",
"x.md",
"1970-01-01T00:00:00Z",
"1970-01-01T00:00:00Z",
],
)
.unwrap();
conn.execute(
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'hash', '[]', ?)",
params![chunk_id, "fedcba9876543210fedcba9876543210", "1970-01-01T00:00:00Z"],
)
.unwrap();
}
fn open_store(tmp: &TempDir) -> SqliteStore {
let cfg = config_for(tmp);
let store = SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
store
}
#[test]
fn pending_then_committed_round_trip() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
let chunk = "11112222333344445555666677778888";
seed_chunk(&store, chunk);
let row = EmbeddingRecordRow {
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
chunk_id: chunk.to_string(),
model_id: "test-model".to_string(),
model_version: "v1".to_string(),
dimensions: 4,
lance_table: "chunk_embeddings_test_model_4".to_string(),
created_at: OffsetDateTime::now_utc(),
};
store
.put_embedding_records_pending(std::slice::from_ref(&row))
.unwrap();
// Inspect: the row exists at status='pending'.
{
let conn = store.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![row.embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "pending");
assert_eq!(committed, 0);
}
store
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
.unwrap();
{
let conn = store.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![row.embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "committed");
assert_eq!(committed, 1);
}
}
#[test]
fn empty_batches_are_noops() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
store.put_embedding_records_pending(&[]).unwrap();
store.mark_embedding_records_committed(&[]).unwrap();
}
#[test]
fn replay_phase_one_resets_vector_committed() {
// INSERT OR REPLACE: a phase-1 retry on a row that briefly
// reached `committed` (in some adversarial out-of-order replay)
// resets it to `pending`. Confirms the documented semantics.
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
let chunk = "11112222333344445555666677778888";
seed_chunk(&store, chunk);
let row = EmbeddingRecordRow {
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
chunk_id: chunk.to_string(),
model_id: "test-model".to_string(),
model_version: "v1".to_string(),
dimensions: 4,
lance_table: "chunk_embeddings_test_model_4".to_string(),
created_at: OffsetDateTime::now_utc(),
};
store
.put_embedding_records_pending(std::slice::from_ref(&row))
.unwrap();
store
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
.unwrap();
store
.put_embedding_records_pending(std::slice::from_ref(&row))
.unwrap();
let conn = store.read_conn();
let status: String = conn
.query_row(
"SELECT status FROM embedding_records WHERE embedding_id = ?",
params![row.embedding_id],
|r| r.get(0),
)
.unwrap();
assert_eq!(status, "pending");
}
#[test]
fn mark_committed_skips_non_pending() {
// The phase-3 UPDATE explicitly filters `status='pending'`, so
// calling it on an embedding_id that was never staged (or that
// already became a tombstone) is a no-op rather than an error.
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
store
.mark_embedding_records_committed(&["does-not-exist".to_string()])
.unwrap();
}
}

View File

@@ -0,0 +1,452 @@
//! Chunk-level filter helpers shared between retrievers.
//!
//! `kb-store-vector::search` post-filters its Lance candidate set
//! against the SQLite-side metadata (committed-status / lang / tags /
//! trust / path_glob). Rather than open a private SQL surface in
//! `kb-store-vector`, the JOIN logic lives here so:
//!
//! - The schema (and CHECK / FK invariants) stays owned by the crate
//! that ships the migrations.
//! - `kb-store-vector` doesn't need its own `rusqlite` / `globset`
//! direct deps — both are forbidden by the P3-3 spec's allowed-dep
//! list.
//! - Future retrievers (e.g. a hybrid blender) can reuse the same
//! helper without re-deriving the SQL.
//!
//! `kb-search::lexical` already has a similar `tags / lang / trust /
//! path_glob` filter pass for FTS5 results; we deliberately do *not*
//! refactor that one in this PR — its SQL is interleaved with the
//! `bm25 + snippet()` SELECT, so sharing would force an awkward
//! trait split. P3-3 spec line 27 only mandates the move for
//! `kb-store-vector`'s usage.
use std::collections::{HashMap, HashSet};
use anyhow::{Context, Result};
use rusqlite::{params_from_iter, ToSql};
use crate::store::SqliteStore;
impl SqliteStore {
/// Filter `chunk_ids` down to those whose owning document passes
/// `filters` AND whose embedding row is at `status='committed'`.
///
/// The result preserves the input order so the caller can feed it
/// back to a Lance distance-asc result list and `take(k)` directly.
///
/// `filters` semantics mirror `kb_core::SearchFilters`:
///
/// - `tags_any`: doc must own at least one of the listed tags
/// (empty vec ⇒ no tag constraint).
/// - `lang`: exact match against `documents.lang`.
/// - `trust_min`: doc trust ≥ the supplied level (Generated <
/// Secondary < Primary, mirroring `list_documents` and
/// `kb-search::lexical`).
/// - `path_glob`: shell-style glob (`*` does **not** cross `/`)
/// against `documents.workspace_path`. Compiled in Rust via
/// `globset` rather than translated to SQLite GLOB so the
/// semantics match `kb-search::lexical` exactly.
///
/// The `embedding_records.status='committed'` predicate is always
/// applied: tombstoned and pending rows must never surface to
/// search callers (spec §5.6).
pub fn filter_chunks(
&self,
chunk_ids: &[kb_core::ChunkId],
filters: &kb_core::SearchFilters,
) -> Result<Vec<kb_core::ChunkId>> {
if chunk_ids.is_empty() {
return Ok(Vec::new());
}
// Deduplicate the IN-list so a pathological caller passing
// `[c1, c1, c1]` doesn't blow the SQL placeholder count.
let unique_ids: Vec<String> = {
let mut seen = HashSet::new();
chunk_ids
.iter()
.filter_map(|c| {
if seen.insert(c.0.as_str()) {
Some(c.0.clone())
} else {
None
}
})
.collect()
};
let placeholders = std::iter::repeat_n("?", unique_ids.len())
.collect::<Vec<_>>()
.join(",");
let mut sql = format!(
"SELECT er.chunk_id, d.workspace_path
FROM embedding_records er
JOIN chunks c ON c.chunk_id = er.chunk_id
JOIN documents d ON d.doc_id = c.doc_id
WHERE er.status = 'committed'
AND er.chunk_id IN ({placeholders})"
);
let mut bind: Vec<Box<dyn ToSql>> = unique_ids
.iter()
.map(|s| {
let b: Box<dyn ToSql> = Box::new(s.clone());
b
})
.collect();
if let Some(lang) = &filters.lang {
sql.push_str(" AND d.lang = ?");
bind.push(Box::new(lang.0.clone()));
}
if let Some(min) = &filters.trust_min {
// Mirror `list_documents` / `kb-search::lexical`: rank
// Generated=1 < Secondary=2 < Primary=3.
sql.push_str(
" AND CASE d.trust_level
WHEN 'primary' THEN 3
WHEN 'secondary' THEN 2
WHEN 'generated' THEN 1
ELSE 0 END >= ?",
);
let rank: i64 = match min {
kb_core::TrustLevel::Primary => 3,
kb_core::TrustLevel::Secondary => 2,
kb_core::TrustLevel::Generated => 1,
};
bind.push(Box::new(rank));
}
if !filters.tags_any.is_empty() {
let tag_ph = std::iter::repeat_n("?", filters.tags_any.len())
.collect::<Vec<_>>()
.join(",");
sql.push_str(&format!(
" AND EXISTS (SELECT 1 FROM document_tags t \
WHERE t.doc_id = d.doc_id AND t.tag IN ({tag_ph}))"
));
for tag in &filters.tags_any {
bind.push(Box::new(tag.clone()));
}
}
// Optional path_glob: applied in Rust on the rows we get back,
// not in SQL — matching `kb-search::lexical`'s post-filter so
// the glob semantics are byte-identical between retrievers.
let path_matcher = match filters.path_glob.as_deref() {
Some(pat) => Some(
globset::GlobBuilder::new(pat)
.literal_separator(true)
.build()
.with_context(|| {
format!("kb-store-sqlite::filter_chunks: invalid path_glob {pat:?}")
})?
.compile_matcher(),
),
None => None,
};
let conn = self.read_conn();
let mut stmt = conn
.prepare(&sql)
.context("kb-store-sqlite::filter_chunks: prepare SQL")?;
let rows = stmt
.query_map(
params_from_iter(bind.iter().map(|b| b.as_ref())),
|row| {
let chunk_id: String = row.get(0)?;
let workspace_path: String = row.get(1)?;
Ok((chunk_id, workspace_path))
},
)
.context("kb-store-sqlite::filter_chunks: execute SQL")?;
let mut allowed: HashMap<String, String> = HashMap::new();
for r in rows {
let (chunk_id, workspace_path) =
r.context("kb-store-sqlite::filter_chunks: read row")?;
allowed.insert(chunk_id, workspace_path);
}
let mut out = Vec::with_capacity(chunk_ids.len());
for cand in chunk_ids {
let workspace_path = match allowed.get(&cand.0) {
Some(p) => p,
None => continue,
};
if let Some(m) = &path_matcher {
if !m.is_match(workspace_path) {
continue;
}
}
out.push(cand.clone());
}
Ok(out)
}
}
#[cfg(test)]
mod tests {
use super::*;
use kb_config::Config;
use kb_core::{ChunkId, Lang, SearchFilters, TrustLevel};
use rusqlite::params;
use tempfile::TempDir;
use time::OffsetDateTime;
use crate::EmbeddingRecordRow;
fn open_store(tmp: &TempDir) -> SqliteStore {
let mut c = Config::defaults();
c.storage.data_dir = tmp.path().to_string_lossy().into_owned();
let store = SqliteStore::open(&c).unwrap();
store.run_migrations().unwrap();
store
}
/// Seed (asset, document, document_tags, chunk) rows + a
/// committed embedding_records row for a single chunk_id. Mirrors
/// the shape `kb-store-vector` builds in production.
fn seed_committed(
store: &SqliteStore,
chunk_id: &str,
doc_id: &str,
workspace_path: &str,
lang: &str,
tags: &[&str],
trust: &str,
) {
let asset_id = format!("a{}", &doc_id[..31]);
{
let conn = store.lock_conn();
conn.execute(
"INSERT INTO assets (
asset_id, source_uri, workspace_path, media_type, byte_len,
checksum, storage_kind, storage_path, discovered_at
) VALUES (?, ?, ?, '{}', 0, 'deadbeefdeadbeefdeadbeefdeadbeef',
'reference', ?, '1970-01-01T00:00:00Z')",
params![
asset_id,
format!("file://{workspace_path}"),
workspace_path,
workspace_path,
],
)
.unwrap();
conn.execute(
"INSERT INTO documents (
doc_id, asset_id, workspace_path, title, lang, source_type,
trust_level, parser_version, doc_version, schema_version,
metadata_json, provenance_json, created_at, updated_at
) VALUES (?, ?, ?, NULL, ?, 'markdown', ?, 'v1', 1, 1,
'{}', '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
params![doc_id, asset_id, workspace_path, lang, trust],
)
.unwrap();
for t in tags {
conn.execute(
"INSERT INTO document_tags (doc_id, tag) VALUES (?, ?)",
params![doc_id, t],
)
.unwrap();
}
conn.execute(
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
'1970-01-01T00:00:00Z')",
params![chunk_id, doc_id],
)
.unwrap();
}
let embed_row = EmbeddingRecordRow {
embedding_id: format!("e{}", &chunk_id[..31]),
chunk_id: chunk_id.to_string(),
model_id: "m".to_string(),
model_version: "v1".to_string(),
dimensions: 4,
lance_table: "t".to_string(),
created_at: OffsetDateTime::UNIX_EPOCH,
};
store
.put_embedding_records_pending(std::slice::from_ref(&embed_row))
.unwrap();
store
.mark_embedding_records_committed(std::slice::from_ref(
&embed_row.embedding_id,
))
.unwrap();
}
fn cid(s: &str) -> ChunkId {
ChunkId(s.to_string())
}
#[test]
fn filter_chunks_drops_uncommitted_rows() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
let c1 = "11111111111111111111111111111111";
let c2 = "22222222222222222222222222222222";
let d1 = "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1";
let d2 = "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2";
seed_committed(&store, c1, d1, "a.md", "en", &[], "primary");
// c2: chunk + doc but no committed embedding row.
let asset_id = format!("a{}", &d2[..31]);
let conn = store.lock_conn();
conn.execute(
"INSERT INTO assets (
asset_id, source_uri, workspace_path, media_type, byte_len,
checksum, storage_kind, storage_path, discovered_at
) VALUES (?, 'file://b.md', 'b.md', '{}', 0,
'deadbeefdeadbeefdeadbeefdeadbeef',
'reference', 'b.md', '1970-01-01T00:00:00Z')",
params![asset_id],
)
.unwrap();
conn.execute(
"INSERT INTO documents (
doc_id, asset_id, workspace_path, title, lang, source_type,
trust_level, parser_version, doc_version, schema_version,
metadata_json, provenance_json, created_at, updated_at
) VALUES (?, ?, 'b.md', NULL, 'en', 'markdown', 'primary', 'v1',
1, 1, '{}', '{}',
'1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
params![d2, asset_id],
)
.unwrap();
conn.execute(
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
'1970-01-01T00:00:00Z')",
params![c2, d2],
)
.unwrap();
drop(conn);
let out = store
.filter_chunks(&[cid(c1), cid(c2)], &SearchFilters::default())
.unwrap();
assert_eq!(out, vec![cid(c1)]);
}
#[test]
fn filter_chunks_tags_any_lang_trust_path_glob() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
// c1: tags=[ko-style], lang=en, primary, notes/a.md
// c2: tags=[other], lang=en, primary, notes/b.md
// c3: tags=[ko-style], lang=ko, secondary, notes/c.md
// c4: tags=[ko-style], lang=en, generated, src/d.md
let chunks = [
("11111111111111111111111111111111", "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1", "notes/a.md", "en", "primary", &["ko-style"][..]),
("22222222222222222222222222222222", "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2", "notes/b.md", "en", "primary", &["other"][..]),
("33333333333333333333333333333333", "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3", "notes/c.md", "ko", "secondary", &["ko-style"][..]),
("44444444444444444444444444444444", "d4d4d4d4d4d4d4d4d4d4d4d4d4d4d4d4", "src/d.md", "en", "generated", &["ko-style"][..]),
];
for (c, d, p, l, t, tags) in &chunks {
seed_committed(&store, c, d, p, l, tags, t);
}
// tags_any=[ko-style] → c1, c3, c4 (drop c2).
let f = SearchFilters {
tags_any: vec!["ko-style".to_string()],
..Default::default()
};
let out = store
.filter_chunks(
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
&f,
)
.unwrap();
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
got.sort();
assert_eq!(got, vec![chunks[0].0, chunks[2].0, chunks[3].0]);
// + lang=en → drops c3.
let f = SearchFilters {
tags_any: vec!["ko-style".to_string()],
lang: Some(Lang("en".to_string())),
..Default::default()
};
let out = store
.filter_chunks(
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
&f,
)
.unwrap();
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
got.sort();
assert_eq!(got, vec![chunks[0].0, chunks[3].0]);
// + trust_min=Secondary → drops c4 (generated < secondary).
let f = SearchFilters {
tags_any: vec!["ko-style".to_string()],
lang: Some(Lang("en".to_string())),
trust_min: Some(TrustLevel::Secondary),
..Default::default()
};
let out = store
.filter_chunks(
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
&f,
)
.unwrap();
let got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
assert_eq!(got, vec![chunks[0].0]);
// path_glob = "notes/*.md" with no other constraint → c1, c2, c3.
let f = SearchFilters {
path_glob: Some("notes/*.md".to_string()),
..Default::default()
};
let out = store
.filter_chunks(
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
&f,
)
.unwrap();
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
got.sort();
assert_eq!(got, vec![chunks[0].0, chunks[1].0, chunks[2].0]);
}
#[test]
fn filter_chunks_preserves_input_order_and_dedupes() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
let c1 = "11111111111111111111111111111111";
let c2 = "22222222222222222222222222222222";
let c3 = "33333333333333333333333333333333";
seed_committed(&store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1", "a.md", "en", &[], "primary");
seed_committed(&store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2", "b.md", "en", &[], "primary");
seed_committed(&store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3", "c.md", "en", &[], "primary");
// Ask in the order c3, c1, c2; result must preserve that order.
let out = store
.filter_chunks(&[cid(c3), cid(c1), cid(c2)], &SearchFilters::default())
.unwrap();
assert_eq!(out, vec![cid(c3), cid(c1), cid(c2)]);
// Duplicates in the input survive in the output (dedup is for
// the SQL IN-list only — caller may want repeats for ranking).
let out = store
.filter_chunks(&[cid(c1), cid(c1), cid(c2)], &SearchFilters::default())
.unwrap();
assert_eq!(out, vec![cid(c1), cid(c1), cid(c2)]);
}
#[test]
fn filter_chunks_empty_input_short_circuits() {
let tmp = TempDir::new().unwrap();
let store = open_store(&tmp);
let out = store.filter_chunks(&[], &SearchFilters::default()).unwrap();
assert!(out.is_empty());
}
}

View File

@@ -8,18 +8,25 @@
//!
//! Allowed deps per task spec: `kb-core`, `kb-config`, `rusqlite`,
//! `refinery`, `serde_json`, `time`, `blake3`, `tracing`, `anyhow`,
//! `thiserror`. NOT allowed: `kb-parse-*`, `kb-normalize`, `kb-chunk`,
//! `kb-store-vector`, `kb-source-fs`, etc. (`kb-parse-md`, `kb-normalize`,
//! `kb-chunk` may appear as **dev-deps** — see `Cargo.toml` — to drive
//! the contract round-trip test off a real Markdown fixture.)
//! `thiserror`. `globset` was added in P3-3 to back the
//! `filter_chunks` helper (used by `kb-store-vector`'s post-filter
//! pass — moving the SQL JOIN into this crate kept `kb-store-vector`
//! from needing its own `rusqlite` / `globset` direct deps). NOT
//! allowed: `kb-parse-*`, `kb-normalize`, `kb-chunk`, `kb-store-vector`,
//! `kb-source-fs`, etc. (`kb-parse-md`, `kb-normalize`, `kb-chunk` may
//! appear as **dev-deps** — see `Cargo.toml` — to drive the contract
//! round-trip test off a real Markdown fixture.)
mod documents;
mod embeddings;
mod error;
mod filters;
mod fts;
mod jobs;
mod schema;
mod store;
pub use embeddings::EmbeddingRecordRow;
pub use error::StoreError;
pub use fts::rebuild_chunks_fts;
pub use store::SqliteStore;

View File

@@ -0,0 +1,55 @@
[package]
name = "kb-store-vector"
version = { workspace = true }
edition = { workspace = true }
rust-version = { workspace = true }
license = { workspace = true }
repository = { workspace = true }
description = "LanceDB-backed VectorStore for kb (§5.6 embedding_records, §6.3 lancedb tables, §7.2 VectorStore)"
[dependencies]
kb-core = { path = "../kb-core" }
kb-config = { path = "../kb-config" }
# kb-store-sqlite is allowed for the embedding_records writers only
# (P3-3 spec: "Allowed dep `kb-store-sqlite` for writing/reading rows in
# embedding_records"). The Two-phase upsert flow uses
# `put_embedding_records_pending` + `mark_embedding_records_committed`.
kb-store-sqlite = { path = "../kb-store-sqlite" }
# LanceDB embedded vector store. `default-features=false` opts out of
# the cloud object-store integrations (aws / gcs / azure / dynamodb /
# oss); kb is always-local for v1, so dragging in those SDKs would just
# inflate the build.
lancedb = { workspace = true }
arrow = { workspace = true }
arrow-array = { workspace = true }
arrow-schema = { workspace = true }
# Embedded async runtime. The VectorStore trait is sync (§7.2) but
# LanceDB's Rust API is async-only; we own a current-thread
# tokio::Runtime and `block_on` per trait method. current-thread saves
# the two worker threads a multi-thread runtime would spawn — kb-app
# already serializes vector ops behind its own job scheduler so the
# extra parallelism wouldn't be exploited.
tokio = { workspace = true }
# `try_collect` for streaming Lance query results into a Vec<RecordBatch>.
futures = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
tracing = { workspace = true }
thiserror = { workspace = true }
anyhow = { workspace = true }
blake3 = { workspace = true }
time = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }
serde_json = { workspace = true }
# Integration tests seed `documents` / `chunks` fixtures by raw SQL
# (no kb-parse-md / kb-normalize / kb-chunk dep) so they can construct
# adversarial filter / dim-mismatch states. rusqlite is a `[dev-]`
# dep only — the runtime crate uses kb-store-sqlite's typed surface
# (`filter_chunks`, `put_embedding_records_pending`, …) and does not
# touch rusqlite directly (P3-3 spec: kb-store-vector must not list
# rusqlite/globset as direct deps).
rusqlite = { workspace = true }

View File

@@ -0,0 +1,232 @@
//! Arrow schema + RecordBatch builder for the per-model Lance table.
//!
//! Per design §6.3 the per-row layout is:
//!
//! ```text
//! chunk_id : Utf8 (primary)
//! doc_id : Utf8
//! embedding : FixedSizeList<Float32, dim>
//! model_id : Utf8
//! embedding_version : Utf8
//! text : Utf8
//! heading_path : Utf8 (JSON-encoded Vec<String>)
//! created_at : Timestamp(Microsecond, UTC)
//! ```
//!
//! `heading_path` is encoded as a JSON string rather than a Lance
//! `List<Utf8>` to keep the `only_if` SQL filter surface clean — Lance
//! exposes scalar columns to its query DSL trivially, but list columns
//! need `array_contains`-style helpers that aren't required by the
//! current `SearchFilters` shape.
use std::sync::Arc;
use anyhow::{Context, Result};
use arrow_array::{
ArrayRef, FixedSizeListArray, Float32Array, RecordBatch, StringArray,
TimestampMicrosecondArray,
};
use arrow_schema::{DataType, Field, Schema, SchemaRef, TimeUnit};
use kb_core::VectorRecord;
use time::OffsetDateTime;
/// Arrow schema for a Lance table whose vector column is FixedSizeList
/// of `dim` Float32. All non-vector columns are non-nullable; the
/// vector column itself is non-nullable but the inner Float32 slot is
/// nullable per Arrow convention (Lance ignores the inner-nullable
/// flag when the outer field is non-null).
pub(crate) fn schema_for(dim: usize) -> SchemaRef {
Arc::new(Schema::new(vec![
Field::new("chunk_id", DataType::Utf8, false),
Field::new("doc_id", DataType::Utf8, false),
Field::new(
"embedding",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
dim as i32,
),
false,
),
Field::new("model_id", DataType::Utf8, false),
Field::new("embedding_version", DataType::Utf8, false),
Field::new("text", DataType::Utf8, false),
Field::new("heading_path", DataType::Utf8, false),
Field::new(
"created_at",
DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())),
false,
),
]))
}
/// Build a `RecordBatch` from `recs`. All records must share `dim`;
/// callers are expected to pre-bucket per-table batches before reaching
/// here. The batch carries `recs.len()` rows; `now` is folded into
/// `created_at` for every row to match design §6.3.
pub(crate) fn build_batch(
recs: &[VectorRecord],
dim: usize,
now: OffsetDateTime,
) -> Result<RecordBatch> {
let schema = schema_for(dim);
let chunk_ids = StringArray::from(
recs.iter().map(|r| r.chunk_id.0.as_str()).collect::<Vec<_>>(),
);
let doc_ids = StringArray::from(
recs.iter().map(|r| r.doc_id.0.as_str()).collect::<Vec<_>>(),
);
let model_ids = StringArray::from(
recs.iter().map(|r| r.model_id.0.as_str()).collect::<Vec<_>>(),
);
let model_versions = StringArray::from(
recs.iter()
.map(|r| r.model_version.0.as_str())
.collect::<Vec<_>>(),
);
let texts =
StringArray::from(recs.iter().map(|r| r.text.as_str()).collect::<Vec<_>>());
// heading_path: serde_json::Value::Array of strings, then to_string.
let heading_paths: Vec<String> = recs
.iter()
.map(|r| serde_json::to_string(&r.heading_path))
.collect::<std::result::Result<_, _>>()
.context("serialize heading_path JSON")?;
let heading_path_arr = StringArray::from(
heading_paths.iter().map(String::as_str).collect::<Vec<_>>(),
);
// Embedding: FixedSizeList<Float32, dim>. Build from the flat
// contiguous f32 buffer.
let mut flat: Vec<f32> = Vec::with_capacity(recs.len() * dim);
for r in recs {
if r.vector.len() != dim {
anyhow::bail!(
"vector length {} does not match table dim {} for chunk {}",
r.vector.len(),
dim,
r.chunk_id.0
);
}
flat.extend_from_slice(&r.vector);
}
let values = Float32Array::from(flat);
let embedding_field =
Arc::new(Field::new("item", DataType::Float32, true));
let embedding = FixedSizeListArray::try_new(
embedding_field,
dim as i32,
Arc::new(values),
None,
)
.context("build FixedSizeList embedding column")?;
// created_at: microseconds since Unix epoch, UTC.
let micros: Vec<i64> = std::iter::repeat_n(
(now.unix_timestamp_nanos() / 1_000) as i64,
recs.len(),
)
.collect();
let created_at = TimestampMicrosecondArray::from(micros).with_timezone("UTC");
let arrays: Vec<ArrayRef> = vec![
Arc::new(chunk_ids) as ArrayRef,
Arc::new(doc_ids),
Arc::new(embedding),
Arc::new(model_ids),
Arc::new(model_versions),
Arc::new(texts),
Arc::new(heading_path_arr),
Arc::new(created_at),
];
RecordBatch::try_new(schema, arrays).context("assemble RecordBatch")
}
/// blake3-hex of the canonical JSON of the schema. Used as
/// `params_hash` for `id_for_index` so the `IndexId` stays stable
/// across invocations with the same `dim`.
pub(crate) fn schema_params_hash(dim: usize) -> String {
// Keep the hash input shape self-describing so a future schema
// tweak (extra column, type change, …) bumps the hash and produces
// a different `IndexId` automatically.
let descriptor = serde_json::json!({
"version": 1,
"dim": dim,
"columns": [
{"name": "chunk_id", "type": "Utf8"},
{"name": "doc_id", "type": "Utf8"},
{"name": "embedding", "type": "FixedSizeList<Float32>", "size": dim},
{"name": "model_id", "type": "Utf8"},
{"name": "embedding_version", "type": "Utf8"},
{"name": "text", "type": "Utf8"},
{"name": "heading_path", "type": "Utf8"},
{"name": "created_at", "type": "Timestamp<us, UTC>"},
],
});
let bytes = descriptor_bytes(&descriptor);
blake3::hash(&bytes).to_hex().to_string()
}
/// Serialize the schema descriptor to bytes for hashing. Plain
/// `serde_json::to_vec` rather than a canonical-JSON crate is fine
/// here because the descriptor is built from a fixed `serde_json::json!`
/// literal in `schema_params_hash` — `serde_json` walks the object's
/// key order deterministically (insertion order, since `Value::Object`
/// uses `Map`), so the byte output is stable across runs without a
/// canonicalizer. The empty-vec fallback on the (unreachable, given
/// our literal input) error path keeps the function infallible.
fn descriptor_bytes(v: &serde_json::Value) -> Vec<u8> {
serde_json::to_vec(v).unwrap_or_default()
}
#[cfg(test)]
mod tests {
use super::*;
use kb_core::{ChunkId, DocumentId, EmbeddingId, EmbeddingModelId, EmbeddingVersion};
use time::OffsetDateTime;
fn make_rec(chunk_idx: u8, dim: usize) -> VectorRecord {
VectorRecord {
chunk_id: ChunkId(format!("{:032x}", chunk_idx)),
embedding_id: EmbeddingId(format!("{:032x}", 0xeeeeu16 + chunk_idx as u16)),
vector: vec![0.1_f32; dim],
doc_id: DocumentId("aaaa".repeat(8)),
text: format!("text-{chunk_idx}"),
heading_path: vec!["A".to_string(), "B".to_string()],
model_id: EmbeddingModelId("test".to_string()),
model_version: EmbeddingVersion("v1".to_string()),
dimensions: dim,
}
}
#[test]
fn build_batch_round_trip_basic() {
let recs = vec![make_rec(1, 4), make_rec(2, 4)];
let batch = build_batch(&recs, 4, OffsetDateTime::UNIX_EPOCH).unwrap();
assert_eq!(batch.num_rows(), 2);
assert_eq!(batch.num_columns(), 8);
let schema = batch.schema();
assert_eq!(schema.field(0).name(), "chunk_id");
assert_eq!(schema.field(2).name(), "embedding");
}
#[test]
fn build_batch_dim_mismatch_errors() {
let mut rec = make_rec(1, 4);
rec.vector = vec![0.0_f32; 3];
let err = build_batch(&[rec], 4, OffsetDateTime::UNIX_EPOCH).unwrap_err();
let msg = format!("{err}");
assert!(msg.contains("does not match table dim"), "msg={msg}");
}
#[test]
fn schema_params_hash_is_stable_for_dim() {
let h1 = schema_params_hash(384);
let h2 = schema_params_hash(384);
assert_eq!(h1, h2);
let h3 = schema_params_hash(512);
assert_ne!(h1, h3);
}
}

View File

@@ -0,0 +1,31 @@
//! `kb-store-vector` — LanceDB-backed [`kb_core::VectorStore`] for kb.
//!
//! Stores per-model Lance tables under `config.storage.vector_dir/`
//! (`chunk_embeddings_<model>_<dim>.lance/`). `upsert` runs the
//! SQLite-first / Lance-second two-phase write described in design
//! §5.6: phase 1 stages `embedding_records` rows at `status='pending'`,
//! phase 2 issues a Lance `MergeInsert` keyed on `chunk_id`, phase 3
//! flips the rows to `status='committed'`. `search` joins against
//! `embedding_records WHERE status='committed'` so partial-write Lance
//! rows never surface to callers; if the process crashes between phase
//! 2 and phase 3 (or phase 2 itself fails), the next `upsert` call
//! retries the still-pending rows idempotently because Lance MergeInsert
//! dedupes on `chunk_id`.
//!
//! Sync / async bridge: `VectorStore` is a sync trait (§7.2) and
//! LanceDB's Rust API is async-only. We own a private current-thread
//! `tokio::runtime::Runtime` and `block_on` per trait method. The
//! tradeoff is documented inline; multi-thread runtime would let two
//! upserts run concurrently but kb-app's job scheduler already
//! serializes vector ops, and current-thread saves the two worker
//! threads a multi-thread runtime spawns by default.
//!
//! See `docs/superpowers/specs/2026-04-27-kb-final-form-design.md`
//! §5.6 (embedding_records DDL), §6.3 (lancedb table naming),
//! §7.2 (VectorStore), §9 (versioning).
mod arrow_batch;
mod paths;
mod store;
pub use store::LanceVectorStore;

View File

@@ -0,0 +1,119 @@
//! Path expansion + table-name sanitization.
//!
//! Mirrors `kb-store-sqlite::store::expand_data_dir` and
//! `kb-embed-local::expand_path` so the three crates resolve
//! `${XDG_DATA_HOME:-…}` / leading `~` / `{data_dir}` identically. A
//! shared helper would live in `kb-config`, but the task spec forbids
//! adding new types to `kb-config`, so we keep a private clone.
use std::path::PathBuf;
/// Expand `{data_dir}` → `data_dir`, `${XDG_DATA_HOME:-…}` → env or
/// default, leading `~` → `$HOME`. Pass an empty `data_dir` when
/// resolving `data_dir` itself (the `{data_dir}` substitution is a
/// no-op in that case).
pub(crate) fn expand_path(raw: &str, data_dir: &str) -> PathBuf {
let mut s = raw.to_string();
if !data_dir.is_empty() {
s = s.replace("{data_dir}", data_dir);
}
// ${XDG_DATA_HOME:-~/.local/share}: env override, else default after `:-`.
if let Some(start) = s.find("${XDG_DATA_HOME") {
if let Some(rel_end) = s[start..].find('}') {
let end = start + rel_end + 1;
let inner = &s[start + 2..end - 1];
let replacement = match std::env::var("XDG_DATA_HOME") {
Ok(v) if !v.is_empty() => v,
_ => {
if let Some((_, default)) = inner.split_once(":-") {
default.to_string()
} else {
String::new()
}
}
};
s.replace_range(start..end, &replacement);
}
}
if let Some(rest) = s.strip_prefix('~') {
if let Some(home) = std::env::var_os("HOME").map(PathBuf::from) {
return home.join(rest.trim_start_matches('/'));
}
}
PathBuf::from(s)
}
/// Build the per-model Lance table name. Per design §6.3:
/// `chunk_embeddings_<model>_<dim>.lance`. Model IDs may contain
/// characters that are illegal in directory names on some filesystems
/// (Windows reserved chars, `/`, …) — squash anything outside
/// `[A-Za-z0-9-]` to `_` so the name is portable.
///
/// LanceDB's `connect(uri).open_table(name)` resolves `name` against
/// the connection root; the trailing `.lance` is part of the directory
/// LanceDB itself appends when it materializes the table, so we pass
/// the bare logical name (`chunk_embeddings_<model>_<dim>`) and let
/// Lance manage the suffix. Spec text uses the suffixed form for the
/// on-disk path; both are present.
pub(crate) fn lance_table_name(model_id: &str, dim: usize) -> String {
let sanitized = sanitize_model_id(model_id);
format!("chunk_embeddings_{sanitized}_{dim}")
}
/// Replace anything outside `[A-Za-z0-9-]` with `_`. Idempotent.
pub(crate) fn sanitize_model_id(model_id: &str) -> String {
model_id
.chars()
.map(|c| {
if c.is_ascii_alphanumeric() || c == '-' {
c
} else {
'_'
}
})
.collect()
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn sanitize_replaces_path_separators() {
assert_eq!(sanitize_model_id("BAAI/bge-small-en"), "BAAI_bge-small-en");
}
#[test]
fn sanitize_keeps_dash_and_alpha_num() {
assert_eq!(sanitize_model_id("e5-small-v2"), "e5-small-v2");
}
#[test]
fn sanitize_squashes_dot_and_colon() {
assert_eq!(sanitize_model_id("model.v1:fast"), "model_v1_fast");
}
#[test]
fn lance_table_name_format() {
assert_eq!(
lance_table_name("BAAI/bge-small-en", 384),
"chunk_embeddings_BAAI_bge-small-en_384"
);
}
#[test]
fn expand_path_substitutes_data_dir() {
let p = expand_path("{data_dir}/lancedb", "/tmp/kbtest");
assert_eq!(p, PathBuf::from("/tmp/kbtest/lancedb"));
}
#[test]
fn expand_path_passthrough_absolute() {
let p = expand_path("/abs/dir", "/ignored");
assert_eq!(p, PathBuf::from("/abs/dir"));
}
}

View File

@@ -0,0 +1,551 @@
//! `LanceVectorStore` — `kb_core::VectorStore` impl over LanceDB.
//!
//! See module-level docs in `lib.rs` for the high-level shape (two-phase
//! upsert, sync/async bridge, table layout).
use std::collections::HashSet;
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::{Context, Result};
use arrow_array::{Array, Float32Array, RecordBatch, StringArray};
use arrow_schema::SchemaRef;
use futures::TryStreamExt;
use kb_core::{
ChunkId, DocumentId, EmbeddingModelId, IndexId, SearchFilters,
VectorHit, VectorRecord, VectorStore,
};
use kb_store_sqlite::{EmbeddingRecordRow, SqliteStore};
use lancedb::Connection;
use lancedb::query::{ExecutableQuery, QueryBase};
use serde_json::json;
use time::OffsetDateTime;
use tokio::runtime::{Builder as RuntimeBuilder, Runtime};
use crate::arrow_batch::{build_batch, schema_for, schema_params_hash};
use crate::paths::{expand_path, lance_table_name};
/// Overfetch multiplier: when post-filtering Lance results against
/// SQLite-side filters we ask for `2 * k` candidates so a moderately
/// selective filter still returns `k` hits. P3-3 spec line 138 caps
/// the doubling at this multiplier; deeper retries are out of scope.
const OVERFETCH_MULTIPLIER: usize = 2;
/// `IndexId` collection label per design §4.2.
const INDEX_COLLECTION: &str = "chunk_embeddings";
/// `IndexId` kind label — flat cosine for v1 (§7.2 + spec line 85).
const INDEX_KIND: &str = "flat";
/// `IndexVersion` token. The schema doesn't expose IndexVersion as a
/// dimension we vary per call, but `id_for_index` requires one; pin to
/// `v1` so re-runs produce stable IDs.
const INDEX_VERSION: &str = "v1";
/// Lance VectorStore.
///
/// Holds a single `lancedb::Connection` opened against
/// `config.storage.vector_dir/`. The connection is cheap to clone via
/// `Arc` internally and is reused across `ensure_table` / `upsert` /
/// `search`. The `tokio::Runtime` is current-thread; multi-thread
/// would buy concurrency we don't currently exploit (kb-app job
/// scheduler serializes vector ops) at the cost of two worker
/// threads.
///
/// # Async context
///
/// `LanceVectorStore` owns a private `tokio::runtime::Runtime` and
/// drives every `VectorStore` trait method through `runtime.block_on`.
/// **Do NOT construct or call any of these methods from inside another
/// tokio runtime context** — `block_on` panics with `"Cannot start a
/// runtime from within a runtime"` in that case. `kb-app`'s job
/// scheduler is synchronous so this is safe today; if a future caller
/// wants to embed `LanceVectorStore` inside an async server they must
/// wrap calls in `tokio::task::spawn_blocking` (or move to an
/// async-native `VectorStore` impl).
pub struct LanceVectorStore {
runtime: Runtime,
connection: Connection,
sqlite: Arc<SqliteStore>,
/// Resolved absolute path to the Lance root. Kept for diagnostics
/// only — the `Connection` already knows it.
#[allow(dead_code)]
vector_dir: PathBuf,
}
impl LanceVectorStore {
/// Open (or create) the Lance directory under
/// `config.storage.vector_dir`, build a current-thread tokio
/// runtime, and return a ready-to-use store. Migrations on the
/// SQLite side must already have been applied (`run_migrations`)
/// — this constructor does not touch the SQLite schema.
///
/// **Caveat:** internally calls `runtime.block_on` to open the
/// Lance connection. Calling this from inside another tokio
/// runtime context will panic with `"Cannot start a runtime from
/// within a runtime"`. See the struct-level `# Async context`
/// section.
pub fn new(config: &kb_config::Config, sqlite: Arc<SqliteStore>) -> Result<Self> {
let data_dir = expand_path(&config.storage.data_dir, "");
let vector_dir =
expand_path(&config.storage.vector_dir, &data_dir.to_string_lossy());
std::fs::create_dir_all(&vector_dir)
.with_context(|| format!("create vector_dir {}", vector_dir.display()))?;
// current-thread runtime: see module docs. Multi-thread would
// spawn two worker threads we don't use.
let runtime = RuntimeBuilder::new_current_thread()
.enable_all()
.build()
.context("build tokio runtime for kb-store-vector")?;
let uri = vector_dir.to_string_lossy().into_owned();
let connection = runtime
.block_on(async {
lancedb::connect(&uri)
.execute()
.await
.context("lancedb::connect")
})?;
tracing::debug!(
target: "kb-store-vector",
vector_dir = %vector_dir.display(),
"opened LanceVectorStore"
);
Ok(Self {
runtime,
connection,
sqlite,
vector_dir,
})
}
/// Open or create the Lance table with the current schema. Returns
/// a handle the caller can use for queries.
async fn ensure_table_async(
connection: &Connection,
table_name: &str,
dim: usize,
) -> Result<lancedb::Table> {
match connection.open_table(table_name).execute().await {
Ok(t) => Ok(t),
Err(lancedb::Error::TableNotFound { .. }) => {
let schema = schema_for(dim);
let table = connection
.create_empty_table(table_name, schema)
.execute()
.await
.context("create_empty_table")?;
tracing::info!(
target: "kb-store-vector",
table = table_name,
dim,
"created Lance table"
);
Ok(table)
}
Err(e) => Err(anyhow::Error::from(e)).context("open_table"),
}
}
/// Validate that the on-disk Lance table's schema matches what
/// `schema_for(dim)` produces. Used by `upsert` to fail fast on a
/// dim mismatch BEFORE any phase-1 SQLite write lands.
fn check_dim(table_schema: &SchemaRef, dim: usize) -> Result<()> {
let field = table_schema
.field_with_name("embedding")
.context("table missing 'embedding' column")?;
match field.data_type() {
arrow_schema::DataType::FixedSizeList(_, table_dim) => {
if (*table_dim as usize) != dim {
anyhow::bail!(
"dimension mismatch: table has dim {}, records have dim {}",
table_dim,
dim
);
}
Ok(())
}
other => anyhow::bail!(
"embedding column has unexpected Arrow type {:?}",
other
),
}
}
}
impl VectorStore for LanceVectorStore {
fn ensure_table(
&self,
model: &EmbeddingModelId,
dim: usize,
) -> Result<IndexId> {
let table_name = lance_table_name(&model.0, dim);
// The trait method only needs the IndexId — we don't return the
// Lance handle. Open (or create) the table to enforce idempotence
// (a second call with the same params must succeed and yield
// the same IndexId).
self.runtime.block_on(async {
Self::ensure_table_async(&self.connection, &table_name, dim).await
})?;
let params_hash = schema_params_hash(dim);
let id = kb_core::id_for_index(
INDEX_COLLECTION,
model,
dim,
&kb_core::IndexVersion(INDEX_VERSION.to_string()),
INDEX_KIND,
&params_hash,
);
Ok(id)
}
fn upsert(&self, recs: &[VectorRecord]) -> Result<()> {
if recs.is_empty() {
return Ok(());
}
// All records in a single upsert call must share (model_id,
// model_version, dimensions). Callers (kb-app indexer) already
// batch by model; we enforce here so a misuse fails loudly.
let model_id = recs[0].model_id.clone();
let model_version = recs[0].model_version.clone();
let dim = recs[0].dimensions;
for r in recs {
if r.model_id != model_id
|| r.model_version != model_version
|| r.dimensions != dim
{
anyhow::bail!(
"kb-store-vector::upsert called with mixed (model_id, model_version, dim) — caller must bucket per table"
);
}
}
let table_name = lance_table_name(&model_id.0, dim);
// Open (or create) the Lance table FIRST and check its on-disk
// dim against what the records claim. A mismatch must error
// before any phase-1 SQLite write — spec line 94: "Dimension
// mismatch returns Error from upsert and writes nothing."
let table = self.runtime.block_on(async {
Self::ensure_table_async(&self.connection, &table_name, dim).await
})?;
let table_schema = self
.runtime
.block_on(async { table.schema().await.context("read table schema") })?;
Self::check_dim(&table_schema, dim)?;
// Phase 1: stage embedding_records rows at status='pending'.
let now = OffsetDateTime::now_utc();
let pending_rows: Vec<EmbeddingRecordRow> = recs
.iter()
.map(|r| EmbeddingRecordRow {
embedding_id: r.embedding_id.0.clone(),
chunk_id: r.chunk_id.0.clone(),
model_id: r.model_id.0.clone(),
model_version: r.model_version.0.clone(),
dimensions: r.dimensions,
lance_table: table_name.clone(),
created_at: now,
})
.collect();
self.sqlite
.put_embedding_records_pending(&pending_rows)
.context("phase 1: stage pending embedding_records")?;
// Phase 2: Lance MergeInsert keyed on chunk_id.
let batch = build_batch(recs, dim, now)?;
merge_insert_batch(&self.runtime, &table, batch)
.context("phase 2: Lance MergeInsert")?;
// Phase 3: flip rows to status='committed'. If we crashed
// between phase 2 and phase 3, the rows stay 'pending' and a
// future upsert call retries them (Lance MergeInsert dedupes
// on chunk_id, so the retry is a no-op on the Lance side).
let embedding_ids: Vec<String> =
recs.iter().map(|r| r.embedding_id.0.clone()).collect();
self.sqlite
.mark_embedding_records_committed(&embedding_ids)
.context("phase 3: mark embedding_records committed")?;
tracing::info!(
target: "kb-store-vector",
table = %table_name,
rows = recs.len(),
"upsert committed"
);
Ok(())
}
fn search(
&self,
query_vec: &[f32],
k: usize,
filters: &SearchFilters,
) -> Result<Vec<VectorHit>> {
if k == 0 {
return Ok(Vec::new());
}
// We need to know which table to query. SearchFilters doesn't
// carry a model_id (the trait doesn't expose one to the
// caller), so we scan known tables on disk and pick the one
// matching `query_vec.len()`. In v1 there's typically one
// model in play; if there are several we pick the first match.
let dim = query_vec.len();
let table_name = match self
.runtime
.block_on(async { find_matching_table(&self.connection, dim).await })?
{
Some(name) => name,
None => {
tracing::debug!(
target: "kb-store-vector",
dim,
"search: no Lance table matches query dim — returning empty"
);
return Ok(Vec::new());
}
};
// Pre-fetch 2*k Lance rows; we'll filter against SQLite
// afterwards. If filters are empty we still over-fetch to
// exclude tombstoned / pending rows.
let overfetch = k.saturating_mul(OVERFETCH_MULTIPLIER).max(k);
let raw_hits = self.runtime.block_on(async {
let table = match self.connection.open_table(&table_name).execute().await
{
Ok(t) => t,
Err(lancedb::Error::TableNotFound { .. }) => return Ok(Vec::new()),
Err(e) => return Err(anyhow::Error::from(e)),
};
let stream = table
.vector_search(query_vec)
.context("vector_search")?
.distance_type(lancedb::DistanceType::Cosine)
.limit(overfetch)
.execute()
.await
.context("execute vector query")?;
let batches: Vec<RecordBatch> =
stream.try_collect().await.context("collect batches")?;
Result::<Vec<RecordBatch>>::Ok(batches)
})?;
let candidates = decode_lance_hits(&raw_hits)?;
// Filter against embedding_records (status='committed') and
// documents (tags / lang / path / trust). For the empty filter
// case the join still excludes tombstoned / pending rows.
// The `filter_chunks` helper lives in kb-store-sqlite (the
// crate that owns the schema), so this crate doesn't need its
// own rusqlite / globset direct deps.
let candidate_ids: Vec<ChunkId> = {
// Deduplicate — Lance result batches can in principle
// repeat a chunk_id across batches; the JOIN is most
// efficient if we ask once per id.
let mut seen = HashSet::new();
candidates
.iter()
.filter(|c| seen.insert(c.chunk_id.0.clone()))
.map(|c| c.chunk_id.clone())
.collect()
};
let allowed_set: HashSet<String> = self
.sqlite
.filter_chunks(&candidate_ids, filters)
.context("post-filter chunks via kb-store-sqlite")?
.into_iter()
.map(|c| c.0)
.collect();
let mut hits: Vec<VectorHit> = candidates
.into_iter()
.filter(|c| allowed_set.contains(&c.chunk_id.0))
.take(k)
.map(LanceCandidate::into_hit)
.collect();
// Re-rank by score desc to give callers a consistent ordering
// regardless of post-filter shuffling.
hits.sort_by(|a, b| {
b.score
.partial_cmp(&a.score)
.unwrap_or(std::cmp::Ordering::Equal)
});
Ok(hits)
}
}
/// One Lance row decoded from a query batch, paired with the converted
/// score and pre-built JSON payload. We keep `chunk_id` separately so
/// the SQLite filter pass can JOIN against it without re-parsing the
/// payload.
struct LanceCandidate {
chunk_id: ChunkId,
doc_id: DocumentId,
text: String,
heading_path: Vec<String>,
score: f32,
}
impl LanceCandidate {
fn into_hit(self) -> VectorHit {
let payload = json!({
"doc_id": self.doc_id.0,
"text": self.text,
"heading_path": self.heading_path,
});
VectorHit {
chunk_id: self.chunk_id,
score: self.score,
payload,
}
}
}
/// Decode a list of Lance result batches into typed candidates.
/// Lance's vector query attaches a `_distance: Float32` column; we
/// convert to similarity via `1 - distance` then shift to `[0, 1]`
/// via `(sim + 1) / 2` per spec line 96. NaN distances get score 0
/// (with a warn log).
fn decode_lance_hits(batches: &[RecordBatch]) -> Result<Vec<LanceCandidate>> {
let mut out = Vec::new();
for batch in batches {
let chunk_ids = batch
.column_by_name("chunk_id")
.context("missing chunk_id col")?
.as_any()
.downcast_ref::<StringArray>()
.context("chunk_id wrong type")?;
let doc_ids = batch
.column_by_name("doc_id")
.context("missing doc_id col")?
.as_any()
.downcast_ref::<StringArray>()
.context("doc_id wrong type")?;
let texts = batch
.column_by_name("text")
.context("missing text col")?
.as_any()
.downcast_ref::<StringArray>()
.context("text wrong type")?;
let heading_path_str = batch
.column_by_name("heading_path")
.context("missing heading_path col")?
.as_any()
.downcast_ref::<StringArray>()
.context("heading_path wrong type")?;
let distances = batch
.column_by_name("_distance")
.context("missing _distance col")?
.as_any()
.downcast_ref::<Float32Array>()
.context("_distance wrong type")?;
for i in 0..batch.num_rows() {
let dist = distances.value(i);
let score = score_from_distance(dist);
let heading_path: Vec<String> = serde_json::from_str(
heading_path_str.value(i),
)
.unwrap_or_default();
out.push(LanceCandidate {
chunk_id: ChunkId(chunk_ids.value(i).to_string()),
doc_id: DocumentId(doc_ids.value(i).to_string()),
text: texts.value(i).to_string(),
heading_path,
score,
});
}
}
Ok(out)
}
/// Convert a cosine distance (LanceDB returns `1 - cosine_similarity`
/// in `[0, 2]` for L2-normalized vectors) to a `[0, 1]` score via
/// `score = ((1 - distance) + 1) / 2`. Per spec line 96 the shift
/// (rather than clamp) preserves ordering between unrelated and
/// opposite vectors. NaN — which Lance can produce when one side is
/// the all-zero vector — collapses to 0 with a warn.
fn score_from_distance(distance: f32) -> f32 {
if distance.is_nan() {
tracing::warn!(
target: "kb-store-vector",
"NaN cosine distance from Lance — coercing to score 0"
);
return 0.0;
}
let sim = 1.0 - distance;
(sim + 1.0) / 2.0
}
/// Find a Lance table whose embedding column is FixedSizeList<Float32, dim>.
async fn find_matching_table(
connection: &Connection,
dim: usize,
) -> Result<Option<String>> {
let names = connection
.table_names()
.execute()
.await
.context("table_names")?;
for name in names {
if !name.starts_with("chunk_embeddings_") {
continue;
}
match connection.open_table(&name).execute().await {
Ok(t) => {
let schema = t.schema().await.context("schema for table")?;
if let Ok(field) = schema.field_with_name("embedding") {
if let arrow_schema::DataType::FixedSizeList(_, table_dim) =
field.data_type()
{
if (*table_dim as usize) == dim {
return Ok(Some(name));
}
}
}
}
Err(e) => {
tracing::warn!(
target: "kb-store-vector",
table = %name,
error = %e,
"search: skipped unopenable table"
);
}
}
}
Ok(None)
}
/// Run the Lance MergeInsert under our embedded runtime. Pulled out
/// of `upsert` so the trait method stays compact.
fn merge_insert_batch(
runtime: &Runtime,
table: &lancedb::Table,
batch: RecordBatch,
) -> Result<()> {
let schema = batch.schema();
runtime.block_on(async move {
let reader = arrow_array::RecordBatchIterator::new(
vec![Ok(batch)].into_iter(),
schema,
);
let mut builder = table.merge_insert(&["chunk_id"]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
builder
.execute(Box::new(reader))
.await
.context("MergeInsert execute")?;
Result::<()>::Ok(())
})
}

View File

@@ -0,0 +1,185 @@
//! Shared scaffolding for kb-store-vector integration tests.
//!
//! # Test policy
//!
//! Integration tests in this crate are marked `#[ignore]` and require
//! AVX-capable hardware. They are excluded from the default `cargo
//! test -p kb-store-vector` lane and only run when explicitly opted
//! in:
//!
//! ```text
//! cargo test -p kb-store-vector -- --ignored
//! ```
//!
//! The reason: LanceDB's f32 SIMD path uses unconditional AVX
//! intrinsics (`__m256` in `lance-linalg::simd::f32`). On x86_64
//! CPUs without AVX support — notably QEMU's default `qemu64` model
//! in CI sandboxes and some bare-metal dev boxes — those instructions
//! trigger `SIGILL: illegal instruction` at the first `vector_search`
//! call. Rather than silently turn that into a "passing" test (which
//! it isn't), we gate the integration suite behind `#[ignore]` and
//! call [`require_avx_or_panic`] inside each test body so that an
//! `--ignored` invocation on a non-AVX host fails loudly rather than
//! crashing later inside a Lance kernel.
//!
//! This mirrors P3-2's `#[ignore]` policy on tests that require a
//! model download — both are CI-lane decisions, not silent skips.
//!
//! Each test owns a `TempDir` (vector_dir + sqlite db live underneath
//! it), a fully-migrated `SqliteStore`, and a `LanceVectorStore`
//! pointed at both. We seed `documents` / `chunks` rows directly via
//! SQL (rather than going through `DocumentStore::put_document`) so
//! the tests stay independent of kb-parse-md / kb-normalize / kb-chunk
//! and so we can construct adversarial fixtures (filtered tags,
//! mismatched langs) without reproducing a Markdown round-trip.
#![allow(dead_code)]
use std::path::PathBuf;
use std::sync::Arc;
/// Panic if the host CPU lacks AVX. Called from every `#[ignore]`-d
/// integration test body so that `cargo test -- --ignored` on a
/// non-AVX host fails loudly with a clear message instead of crashing
/// later inside a Lance SIMD kernel with `SIGILL`.
///
/// On non-x86_64 hosts this is a no-op (Lance's AVX requirement is
/// x86-only — ARM/Apple Silicon paths use different intrinsics that
/// the workspace doesn't currently target).
pub fn require_avx_or_panic() {
#[cfg(target_arch = "x86_64")]
{
if !std::is_x86_feature_detected!("avx") {
panic!(
"kb-store-vector integration test requires AVX-capable hardware; \
host CPU lacks AVX. Run on an AVX-capable machine. \
See crates/kb-store-vector/tests/common/mod.rs."
);
}
}
}
use kb_config::Config;
use kb_core::{
ChunkId, DocumentId, EmbeddingId, EmbeddingModelId, EmbeddingVersion, VectorRecord,
};
use kb_store_sqlite::SqliteStore;
use kb_store_vector::LanceVectorStore;
use rusqlite::params;
use tempfile::TempDir;
pub struct TestEnv {
pub temp: TempDir,
pub config: Config,
pub sqlite: Arc<SqliteStore>,
pub vector: LanceVectorStore,
}
impl TestEnv {
pub fn new() -> Self {
let temp = tempfile::tempdir().expect("tempdir");
let mut config = Config::defaults();
config.storage.data_dir = temp.path().to_string_lossy().into_owned();
let sqlite = SqliteStore::open(&config).unwrap();
sqlite.run_migrations().unwrap();
let sqlite = Arc::new(sqlite);
let vector = LanceVectorStore::new(&config, sqlite.clone()).unwrap();
Self {
temp,
config,
sqlite,
vector,
}
}
pub fn data_dir(&self) -> PathBuf {
self.temp.path().to_path_buf()
}
/// Insert minimum (asset, document, chunk) rows so phase-1
/// embedding_records inserts don't trip the FK to chunks /
/// documents.
pub fn seed_chunk(
&self,
chunk_id: &str,
doc_id: &str,
workspace_path: &str,
lang: &str,
tags: &[&str],
trust_level: &str,
) {
// Asset id derived from doc_id deterministically — every
// chunk gets its own asset to keep things simple.
let asset_id = format!("a{}", &doc_id[..31]);
let conn = self.sqlite.read_conn();
conn.execute(
"INSERT OR IGNORE INTO assets (
asset_id, source_uri, workspace_path, media_type, byte_len,
checksum, storage_kind, storage_path, discovered_at
) VALUES (?, ?, ?, ?, 0, ?, 'reference', ?, '1970-01-01T00:00:00Z')",
params![
asset_id,
format!("file://{workspace_path}"),
workspace_path,
"{}",
"deadbeefdeadbeefdeadbeefdeadbeef",
workspace_path,
],
)
.unwrap();
conn.execute(
"INSERT OR IGNORE INTO documents (
doc_id, asset_id, workspace_path, title, lang, source_type,
trust_level, parser_version, doc_version, schema_version,
metadata_json, provenance_json, created_at, updated_at
) VALUES (?, ?, ?, NULL, ?, 'markdown', ?, 'v1', 1, 1, '{}', '{}',
'1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
params![doc_id, asset_id, workspace_path, lang, trust_level],
)
.unwrap();
for t in tags {
conn.execute(
"INSERT OR IGNORE INTO document_tags (doc_id, tag) VALUES (?, ?)",
params![doc_id, t],
)
.unwrap();
}
conn.execute(
"INSERT OR IGNORE INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]', '1970-01-01T00:00:00Z')",
params![chunk_id, doc_id],
)
.unwrap();
}
}
/// Build a deterministic test VectorRecord from a few simple inputs.
/// `vector` is taken verbatim, `dimensions` is set from `vector.len()`.
pub fn make_record(
chunk_idx: u8,
doc_idx: u8,
vector: Vec<f32>,
text: &str,
heading: &[&str],
model: &str,
) -> VectorRecord {
let dim = vector.len();
let chunk_id = ChunkId(format!("{:032x}", 0x1100u32 + chunk_idx as u32));
let doc_id = DocumentId(format!("{:032x}", 0xd0c0u32 + doc_idx as u32));
let embedding_id =
EmbeddingId(format!("{:032x}", 0xeeee0000u32 + chunk_idx as u32));
VectorRecord {
chunk_id,
embedding_id,
vector,
doc_id,
text: text.to_string(),
heading_path: heading.iter().map(|s| s.to_string()).collect(),
model_id: EmbeddingModelId(model.to_string()),
model_version: EmbeddingVersion("v1".to_string()),
dimensions: dim,
}
}

View File

@@ -0,0 +1,4 @@
{
"_comment": "PLACEHOLDER — regenerate via `KB_UPDATE_SNAPSHOTS=1 cargo test -p kb-store-vector -- --ignored snapshot` on an AVX-capable host. Until then the snapshot test panics with a clear 'placeholder' message.",
"hits": []
}

View File

@@ -0,0 +1,119 @@
//! Snapshot test: a fixed corpus + fixed query produces a stable
//! `Vec<VectorHit>` JSON. Pinning the snapshot here catches accidental
//! drift in score scaling, payload shape, or top-k ordering.
//!
//! This test is `#[ignore]` and requires AVX-capable hardware. Run
//! with `cargo test -p kb-store-vector -- --ignored snapshot`.
//!
//! The committed fixture at `tests/fixtures/vector/run-1.json` is a
//! placeholder until first regenerated on AVX hardware. The test
//! detects the placeholder via its `_comment` field and panics with
//! a clear "regenerate me" message — see `assert_no_placeholder`
//! below.
use std::path::PathBuf;
use kb_core::{SearchFilters, VectorStore};
use serde_json::json;
mod common;
use common::{TestEnv, make_record, require_avx_or_panic};
const MODEL: &str = "snapshot-model";
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn vector_hits_snapshot_run_1() {
require_avx_or_panic();
let env = TestEnv::new();
// Fixed deterministic corpus: 4 unit-norm vectors, each with a
// known doc / chunk / heading. The query points squarely at
// chunk 0 so the expected ordering is 0, then the others by
// distance from dir(0).
let corpus = vec![
(0u8, vec![1.0_f32, 0.0, 0.0, 0.0], "alpha", &["A"][..]),
(1u8, vec![0.95_f32, 0.31, 0.0, 0.0], "beta", &["A", "B"][..]),
(2u8, vec![0.0_f32, 1.0, 0.0, 0.0], "gamma", &["B"][..]),
(3u8, vec![0.0_f32, 0.0, 1.0, 0.0], "delta", &[][..]),
];
let mut recs = Vec::new();
for (i, vec, text, headings) in &corpus {
let rec = make_record(*i, *i, vec.clone(), text, headings, MODEL);
env.seed_chunk(
&rec.chunk_id.0,
&rec.doc_id.0,
&format!("notes/{i}.md"),
"en",
&[],
"primary",
);
recs.push(rec);
}
env.vector.upsert(&recs).unwrap();
let q = vec![1.0_f32, 0.0, 0.0, 0.0];
let hits = env.vector.search(&q, 3, &SearchFilters::default()).unwrap();
// The snapshot pins:
// - top-3 chunk_id ordering (by score desc)
// - payload shape: { doc_id, text, heading_path }
// - that scores live in [0, 1] and are sorted descending
let actual = json!(
hits.iter().map(|h| json!({
"chunk_id": h.chunk_id.0,
"score_in_unit_interval": (0.0..=1.0).contains(&h.score),
"payload": h.payload,
})).collect::<Vec<_>>()
);
let fixture = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
.join("vector")
.join("run-1.json");
if std::env::var_os("KB_UPDATE_SNAPSHOTS").is_some() {
std::fs::create_dir_all(fixture.parent().unwrap()).unwrap();
std::fs::write(&fixture, serde_json::to_string_pretty(&actual).unwrap())
.unwrap();
return;
}
let expected: serde_json::Value =
serde_json::from_str(&std::fs::read_to_string(&fixture).unwrap_or_else(
|_| panic!(
"missing snapshot fixture at {}; run with KB_UPDATE_SNAPSHOTS=1 to create",
fixture.display()
),
))
.unwrap();
// Refuse to silently "pass" when the fixture is the committed
// placeholder. The placeholder JSON carries a `_comment` field
// with regeneration instructions; production fixtures (a captured
// hits array) do not.
if expected.get("_comment").is_some() {
panic!(
"snapshot fixture is a placeholder — regenerate on AVX hardware then commit. \
Path: {}. To regenerate: \
`KB_UPDATE_SNAPSHOTS=1 cargo test -p kb-store-vector -- --ignored snapshot`.",
fixture.display()
);
}
assert_eq!(
actual, expected,
"snapshot drift; rerun with KB_UPDATE_SNAPSHOTS=1 to regenerate"
);
// Independent guard: scores must be non-increasing.
for w in hits.windows(2) {
assert!(
w[0].score >= w[1].score,
"scores not in descending order: {} then {}",
w[0].score,
w[1].score
);
}
}

View File

@@ -0,0 +1,374 @@
//! Integration tests for `LanceVectorStore` covering ensure_table,
//! upsert, search, dimension mismatch, filters, model isolation, and
//! determinism.
//!
//! Every test in this file is `#[ignore]` and requires an AVX-capable
//! x86_64 host. Run with:
//!
//! ```text
//! cargo test -p kb-store-vector -- --ignored
//! ```
//!
//! See `tests/common/mod.rs` for the full rationale.
use kb_core::{EmbeddingModelId, SearchFilters, VectorStore};
use kb_store_sqlite::EmbeddingRecordRow;
use rusqlite::params;
use time::OffsetDateTime;
mod common;
use common::{TestEnv, make_record, require_avx_or_panic};
const MODEL: &str = "test-model";
/// Helper: produce a unit-norm 4-D vector pointing in one of four
/// directions. The sign pattern keeps cosine similarities cleanly
/// distinct so search ordering tests don't depend on float jitter.
fn dir(idx: u8) -> Vec<f32> {
match idx {
0 => vec![1.0, 0.0, 0.0, 0.0],
1 => vec![0.0, 1.0, 0.0, 0.0],
2 => vec![0.0, 0.0, 1.0, 0.0],
_ => vec![0.0, 0.0, 0.0, 1.0],
}
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn ensure_table_idempotent_returns_same_index_id() {
require_avx_or_panic();
let env = TestEnv::new();
let model = EmbeddingModelId(MODEL.to_string());
let id1 = env.vector.ensure_table(&model, 4).unwrap();
let id2 = env.vector.ensure_table(&model, 4).unwrap();
assert_eq!(id1, id2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn search_before_upsert_returns_empty() {
require_avx_or_panic();
let env = TestEnv::new();
let hits = env
.vector
.search(&dir(0), 5, &SearchFilters::default())
.unwrap();
assert!(hits.is_empty());
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn upsert_ten_then_search_returns_five() {
require_avx_or_panic();
let env = TestEnv::new();
let mut recs = Vec::new();
for i in 0..10u8 {
// 4-D vectors clustered near dir(0) for the first half, dir(1)
// for the rest, with small per-row jitter so they stay
// distinct in the index.
let mut v = if i < 5 { dir(0) } else { dir(1) };
v[3] = (i as f32) * 0.001;
let rec = make_record(i, i, v, &format!("text-{i}"), &["A"], MODEL);
env.seed_chunk(
&rec.chunk_id.0,
&rec.doc_id.0,
&format!("notes/{i}.md"),
"en",
&[],
"primary",
);
recs.push(rec);
}
env.vector.upsert(&recs).unwrap();
// 1:1 alignment check: every record has a committed embedding row.
{
let conn = env.sqlite.read_conn();
let count: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE status = 'committed'",
[],
|r| r.get(0),
)
.unwrap();
assert_eq!(count, 10);
}
let hits = env
.vector
.search(&dir(0), 5, &SearchFilters::default())
.unwrap();
assert_eq!(hits.len(), 5, "expected 5 hits, got {}", hits.len());
// Top hits should be from the first half (clustered around dir(0)).
// make_record lays chunk_idx into the low bits of `0x1100 + i`, so
// `chunk_idx = u32::from_str_radix(last4, 16) - 0x1100`. The first
// half (chunk_idx < 5) lives in 0x1100..=0x1104.
for h in &hits {
let suffix_hex = &h.chunk_id.0[h.chunk_id.0.len() - 4..];
let idx = u32::from_str_radix(suffix_hex, 16).unwrap();
let chunk_idx = idx - 0x1100;
assert!(
chunk_idx < 5,
"top-5 hit unexpectedly came from second cluster: idx={chunk_idx}"
);
}
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn dimension_mismatch_errors_and_writes_nothing() {
require_avx_or_panic();
let env = TestEnv::new();
let model = EmbeddingModelId(MODEL.to_string());
// First populate a 4-D table with one row so it exists on disk.
let r0 = make_record(0, 0, dir(0), "first", &[], MODEL);
env.seed_chunk(&r0.chunk_id.0, &r0.doc_id.0, "notes/0.md", "en", &[], "primary");
env.vector.upsert(&[r0]).unwrap();
assert_eq!(env.vector.ensure_table(&model, 4).unwrap(), env.vector.ensure_table(&model, 4).unwrap());
// Now manually open the same table_name path and try to upsert
// an 8-D vector through `upsert` — the table name function bakes
// dim into the name, so the only way to drive the real
// record-vs-table mismatch is to corrupt `dimensions` so the
// table_name is the existing 4-D table, but the embedded vector
// is 8-D. Spec line 94: must error, write nothing extra.
let mut bad = make_record(1, 1, vec![0.1_f32; 8], "second", &[], MODEL);
// Pretend this is a 4-D vector for table-name purposes; the
// build_batch then enforces that vector.len() == dim and bails.
bad.dimensions = 4;
env.seed_chunk(&bad.chunk_id.0, &bad.doc_id.0, "notes/1.md", "en", &[], "primary");
let bad_chunk = bad.chunk_id.0.clone();
let err = env.vector.upsert(&[bad]).unwrap_err();
let msg = format!("{err:#}");
assert!(
msg.to_lowercase().contains("dim")
|| msg.contains("does not match table dim"),
"unexpected error message: {msg}"
);
// The phase-1 row may have landed before phase 2 detected the
// mismatch — but the on-disk Lance table must NOT contain the
// bad record. So we assert that no `committed` row corresponds
// to chunk_id of the bad record.
let conn = env.sqlite.read_conn();
let committed: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ? AND status = 'committed'",
rusqlite::params![bad_chunk],
|r| r.get(0),
)
.unwrap();
assert_eq!(committed, 0, "bad record reached committed state despite dim mismatch");
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn filter_tags_any_drops_non_matching_docs() {
require_avx_or_panic();
let env = TestEnv::new();
// Two docs: one with tag "ko-style", one without.
let r_a = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
let r_b = make_record(0xbb, 0xbb, dir(0), "beta", &[], MODEL);
env.seed_chunk(
&r_a.chunk_id.0,
&r_a.doc_id.0,
"notes/a.md",
"en",
&["ko-style"],
"primary",
);
env.seed_chunk(
&r_b.chunk_id.0,
&r_b.doc_id.0,
"notes/b.md",
"en",
&["other"],
"primary",
);
let expected_doc_id = r_a.doc_id.0.clone();
env.vector.upsert(&[r_a, r_b]).unwrap();
let filters = SearchFilters {
tags_any: vec!["ko-style".to_string()],
..Default::default()
};
let hits = env.vector.search(&dir(0), 10, &filters).unwrap();
assert_eq!(hits.len(), 1, "expected only the tagged doc to match");
let payload = &hits[0].payload;
assert_eq!(payload["doc_id"], expected_doc_id);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn model_isolation_two_models_two_directories() {
require_avx_or_panic();
let env = TestEnv::new();
let r1 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-A");
env.seed_chunk(
&r1.chunk_id.0,
&r1.doc_id.0,
"notes/a.md",
"en",
&[],
"primary",
);
let chunk_id = r1.chunk_id.0.clone();
env.vector.upsert(&[r1]).unwrap();
// Same chunk_id, different model — should land in a separate table.
let mut r2 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-B");
r2.embedding_id = kb_core::EmbeddingId(
"ee01ee01ee01ee01ee01ee01ee01ee01".to_string(),
);
env.vector.upsert(&[r2]).unwrap();
// Two on-disk Lance directories, distinguished by table name.
let lancedb_root = env.data_dir().join("lancedb");
let entries: Vec<_> = std::fs::read_dir(&lancedb_root)
.unwrap()
.filter_map(Result::ok)
.map(|e| e.file_name().to_string_lossy().into_owned())
.collect();
let a_count = entries
.iter()
.filter(|e| e.contains("model-A"))
.count();
let b_count = entries
.iter()
.filter(|e| e.contains("model-B"))
.count();
assert!(a_count >= 1, "model-A table missing: {entries:?}");
assert!(b_count >= 1, "model-B table missing: {entries:?}");
// Two embedding_records rows for the same chunk_id, one per model.
let conn = env.sqlite.read_conn();
let count: i64 = conn
.query_row(
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ?",
params![chunk_id],
|r| r.get(0),
)
.unwrap();
assert_eq!(count, 2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn determinism_same_query_same_top_k() {
require_avx_or_panic();
let env = TestEnv::new();
let recs: Vec<_> = (0..6u8)
.map(|i| {
let mut v = dir(i % 4);
v[3] = (i as f32) * 0.001;
let rec = make_record(i, i, v, &format!("t-{i}"), &[], MODEL);
env.seed_chunk(
&rec.chunk_id.0,
&rec.doc_id.0,
&format!("notes/{i}.md"),
"en",
&[],
"primary",
);
rec
})
.collect();
env.vector.upsert(&recs).unwrap();
let q = dir(0);
let h1 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
let h2 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
let ids1: Vec<_> = h1.iter().map(|h| h.chunk_id.0.clone()).collect();
let ids2: Vec<_> = h2.iter().map(|h| h.chunk_id.0.clone()).collect();
assert_eq!(ids1, ids2);
}
#[test]
#[ignore = "requires AVX-capable hardware (LanceDB)"]
fn upsert_retry_promotes_pending_to_committed() {
// Crash-recovery contract: a phase-1 row that was already
// committed by a prior batch is left alone by phase-3, but a
// pending row gets retried and reaches committed once Lance
// accepts it.
//
// Construction of the "crash" state:
//
// 1. Stage a row directly via the SQLite phase-1 helper
// (`put_embedding_records_pending`). NO Lance write happens
// here — this is exactly the on-disk state after a crash
// between phase 1 and phase 2. Confirm the row is at
// `status='pending'` before doing anything else.
//
// 2. Run `LanceVectorStore::upsert` with a `VectorRecord` whose
// `embedding_id` matches the pending row. Phase 1's
// `INSERT OR REPLACE` is idempotent here (same row payload),
// phase 2 actually writes to Lance for the first time, and
// phase 3 flips the row to 'committed'.
//
// 3. Verify status='committed' and vector_committed=1.
//
// This actually exercises the "rows stuck at pending get promoted
// on next upsert" semantics — the previous version pre-seeded via
// raw SQL but then the same upsert call overwrote the seed via
// INSERT OR REPLACE before phase 2 ran, so the recovery path
// never executed.
require_avx_or_panic();
let env = TestEnv::new();
let rec = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
let chunk_id = rec.chunk_id.0.clone();
let doc_id = rec.doc_id.0.clone();
let embedding_id = rec.embedding_id.0.clone();
env.seed_chunk(&chunk_id, &doc_id, "notes/a.md", "en", &[], "primary");
// Phase 1 only — go through the same kb-store-sqlite helper that
// `LanceVectorStore::upsert` uses internally. No Lance write
// happens, so this models "crashed between phase 1 and phase 2".
let pending_row = EmbeddingRecordRow {
embedding_id: embedding_id.clone(),
chunk_id: chunk_id.clone(),
model_id: MODEL.to_string(),
model_version: "v1".to_string(),
dimensions: 4,
lance_table: format!("chunk_embeddings_{MODEL}_4"),
created_at: OffsetDateTime::UNIX_EPOCH,
};
env.sqlite
.put_embedding_records_pending(std::slice::from_ref(&pending_row))
.unwrap();
// Sanity: the row is staged but NOT yet committed and Lance has
// no record of it.
{
let conn = env.sqlite.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "pending", "row should be at status=pending after phase-1-only");
assert_eq!(committed, 0);
}
// Now run upsert with the matching record. Phase 1's INSERT OR
// REPLACE is a no-op equivalent (same row payload), phase 2 lands
// the Lance row for the first time, phase 3 promotes
// status='committed'.
env.vector.upsert(&[rec]).unwrap();
let conn = env.sqlite.read_conn();
let (status, committed): (String, i64) = conn
.query_row(
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
params![embedding_id],
|r| Ok((r.get(0)?, r.get(1)?)),
)
.unwrap();
assert_eq!(status, "committed");
assert_eq!(committed, 1);
}

View File

@@ -0,0 +1,46 @@
-- V003__embedding_status.sql — additive embedding lifecycle markers (§5.6).
--
-- P3-3 introduces a two-phase write to `embedding_records` paired with
-- a Lance MergeInsert. Phase 1 inserts the row at `status='pending'`;
-- phase 2 issues the Lance write; phase 3 flips the row to
-- `status='committed'`. `search` joins back through this table with
-- `WHERE status='committed'` so partial-write Lance rows never surface
-- to callers, and a crashed phase 2 retry simply re-runs against the
-- still-pending row (Lance MergeInsert dedupes on `chunk_id`).
--
-- The third state, `tombstone`, is reserved for the deletion pipeline:
-- when a chunk row goes away, the matching Lance row should also be
-- garbage-collected, but the GC scheduler is out of P3-3 scope. The
-- BEFORE DELETE trigger below stages the marker so a future GC has a
-- well-defined claim; see the comment block on the trigger for why
-- it currently coexists with V001's `ON DELETE CASCADE` FK rather than
-- replacing it.
ALTER TABLE embedding_records ADD COLUMN status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','committed','tombstone'));
ALTER TABLE embedding_records ADD COLUMN vector_committed INTEGER NOT NULL DEFAULT 0;
CREATE INDEX idx_embed_status ON embedding_records(status);
-- Tombstone trigger.
--
-- Intent: when a `chunks` row is about to be deleted, mark its
-- dependent `embedding_records` rows as `status='tombstone'` so a later
-- GC pass can drop the matching Lance rows in lockstep.
--
-- Caveat (carried into a future migration): V001 declared the FK as
-- `chunk_id REFERENCES chunks(chunk_id) ON DELETE CASCADE`. SQLite's
-- documented order is "BEFORE-DELETE trigger fires first, then CASCADE
-- runs", so this UPDATE will land a `tombstone` value that is
-- immediately followed by the CASCADE removing the row. The trigger is
-- therefore best-effort under the current FK; the only path that
-- actually preserves the tombstone is to drop the CASCADE (table
-- recreation, since SQLite has no DROP CONSTRAINT) — that is queued
-- for a P+ migration once the GC scheduler exists and we have actual
-- production rows to migrate. Keeping the trigger here documents the
-- design intent and gives the deletion-pipeline observer a stable hook
-- to wire into.
CREATE TRIGGER chunks_bd_tombstone_embeddings BEFORE DELETE ON chunks BEGIN
UPDATE embedding_records SET status='tombstone' WHERE chunk_id = old.chunk_id;
END;