feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status
First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending|committed|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
3953
Cargo.lock
generated
3953
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
11
Cargo.toml
11
Cargo.toml
@@ -9,6 +9,7 @@ members = [
|
||||
"crates/kb-normalize",
|
||||
"crates/kb-chunk",
|
||||
"crates/kb-store-sqlite",
|
||||
"crates/kb-store-vector",
|
||||
"crates/kb-search",
|
||||
"crates/kb-embed",
|
||||
"crates/kb-embed-local",
|
||||
@@ -43,3 +44,13 @@ proptest = "1"
|
||||
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release
|
||||
# remains untested for this workspace).
|
||||
fastembed = "4.9"
|
||||
# LanceDB embedded vector store (P3-3). 0.23.x pulls arrow / arrow-array /
|
||||
# arrow-schema 56.x transitively (via lance 1.0); the kb-store-vector
|
||||
# crate matches that major to share the same Arrow types without a
|
||||
# re-export adapter.
|
||||
lancedb = { version = "0.23", default-features = false }
|
||||
arrow = "56"
|
||||
arrow-array = "56"
|
||||
arrow-schema = "56"
|
||||
tokio = { version = "1", features = ["rt", "macros"] }
|
||||
futures = "0.3"
|
||||
|
||||
@@ -14,6 +14,11 @@ kb-config = { path = "../kb-config" }
|
||||
# Explicitly NOT `bundled-sqlcipher` per task allowed-deps list.
|
||||
rusqlite = { version = "0.32", features = ["bundled"] }
|
||||
refinery = { version = "0.8", features = ["rusqlite"] }
|
||||
# Used by `filter_chunks` for the optional `path_glob` post-filter.
|
||||
# The SQL prefilter handles tags / lang / trust / committed-status; the
|
||||
# Rust-side glob keeps the SQL surface small (no LIKE-vs-glob impedance
|
||||
# mismatch) and matches the pattern kb-search/src/lexical.rs uses.
|
||||
globset = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
time = { workspace = true }
|
||||
blake3 = { workspace = true }
|
||||
|
||||
317
crates/kb-store-sqlite/src/embeddings.rs
Normal file
317
crates/kb-store-sqlite/src/embeddings.rs
Normal file
@@ -0,0 +1,317 @@
|
||||
//! Embedding-records writers used by `kb-store-vector` (P3-3).
|
||||
//!
|
||||
//! The `VectorStore` impl in `kb-store-vector` performs a two-phase write:
|
||||
//! phase 1 stages an `embedding_records` row at `status='pending'` before
|
||||
//! issuing the Lance write, and phase 3 promotes those same rows to
|
||||
//! `status='committed'` after the Lance commit lands. We surface those
|
||||
//! two SQL statements here (rather than expose a generic write
|
||||
//! connection) so the SQL stays inside the crate that owns the schema —
|
||||
//! kb-store-vector consumes a typed, narrowly-scoped API and never
|
||||
//! touches the connection mutex itself.
|
||||
//!
|
||||
//! Both helpers wrap a single `INSERT OR REPLACE` / `UPDATE` per row
|
||||
//! inside a single SQLite transaction, so a partial failure leaves
|
||||
//! either all rows pending (phase 1) or all rows committed (phase 3),
|
||||
//! never a mixed batch.
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use rusqlite::{params, params_from_iter};
|
||||
use time::OffsetDateTime;
|
||||
use time::format_description::well_known::Rfc3339;
|
||||
|
||||
use crate::error::StoreError;
|
||||
use crate::store::SqliteStore;
|
||||
|
||||
/// Row payload for [`SqliteStore::put_embedding_records_pending`].
|
||||
///
|
||||
/// Mirrors the columns of `embedding_records` minus the lifecycle markers
|
||||
/// (`status` and `vector_committed`) — those are forced to `'pending'`
|
||||
/// and `0` by phase 1.
|
||||
///
|
||||
/// `created_at` is `OffsetDateTime` rather than a pre-formatted string so
|
||||
/// the helper owns the RFC3339 formatting (the same formatting choice
|
||||
/// the asset / document / job writers make).
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct EmbeddingRecordRow {
|
||||
pub embedding_id: String,
|
||||
pub chunk_id: String,
|
||||
pub model_id: String,
|
||||
pub model_version: String,
|
||||
pub dimensions: usize,
|
||||
pub lance_table: String,
|
||||
pub created_at: OffsetDateTime,
|
||||
}
|
||||
|
||||
impl SqliteStore {
|
||||
/// Phase 1 of the kb-store-vector two-phase write: stage every
|
||||
/// `embedding_records` row with `status='pending'`,
|
||||
/// `vector_committed=0`. `INSERT OR REPLACE` (rather than UPSERT) is
|
||||
/// the right shape here because re-running phase 1 for an
|
||||
/// already-pending row resets `vector_committed` to 0 and the
|
||||
/// `created_at` to the new attempt's timestamp — both desired,
|
||||
/// because a retry should look like a fresh attempt to the GC pass.
|
||||
///
|
||||
/// All rows are written in a single transaction; if any row fails
|
||||
/// the entire batch is rolled back and the caller can retry without
|
||||
/// worrying about partial pending state.
|
||||
pub fn put_embedding_records_pending(
|
||||
&self,
|
||||
rows: &[EmbeddingRecordRow],
|
||||
) -> Result<()> {
|
||||
if rows.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
let mut conn = self.lock_conn();
|
||||
let tx = conn.transaction().map_err(StoreError::from)?;
|
||||
{
|
||||
let mut stmt = tx
|
||||
.prepare(
|
||||
"INSERT OR REPLACE INTO embedding_records (
|
||||
embedding_id, chunk_id, model_id, model_version,
|
||||
dimensions, lance_table, created_at,
|
||||
status, vector_committed
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, 'pending', 0)",
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
for row in rows {
|
||||
let created_at = row
|
||||
.created_at
|
||||
.format(&Rfc3339)
|
||||
.context("format embedding_records.created_at")?;
|
||||
stmt.execute(params![
|
||||
row.embedding_id,
|
||||
row.chunk_id,
|
||||
row.model_id,
|
||||
row.model_version,
|
||||
row.dimensions as i64,
|
||||
row.lance_table,
|
||||
created_at,
|
||||
])
|
||||
.map_err(StoreError::from)?;
|
||||
}
|
||||
}
|
||||
tx.commit().map_err(StoreError::from)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Phase 3 of the kb-store-vector two-phase write: after the Lance
|
||||
/// MergeInsert commits, flip the listed embedding rows to
|
||||
/// `status='committed'`, `vector_committed=1`. Rows that aren't
|
||||
/// currently `pending` (e.g. already committed by a duplicate batch,
|
||||
/// or tombstoned by a chunks DELETE between phase 1 and phase 3)
|
||||
/// are deliberately left alone via `WHERE status='pending'` — we
|
||||
/// never resurrect a tombstone, and we never blindly re-mark a
|
||||
/// committed row.
|
||||
///
|
||||
/// All updates run in a single statement (single SQL `UPDATE …
|
||||
/// WHERE embedding_id IN (?, ?, …)`) inside one transaction —
|
||||
/// avoids the per-row `execute()` round-trip the previous
|
||||
/// implementation paid.
|
||||
pub fn mark_embedding_records_committed(
|
||||
&self,
|
||||
embedding_ids: &[String],
|
||||
) -> Result<()> {
|
||||
if embedding_ids.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
let mut conn = self.lock_conn();
|
||||
let tx = conn.transaction().map_err(StoreError::from)?;
|
||||
{
|
||||
let placeholders = std::iter::repeat_n("?", embedding_ids.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
let sql = format!(
|
||||
"UPDATE embedding_records
|
||||
SET status='committed', vector_committed=1
|
||||
WHERE status='pending'
|
||||
AND embedding_id IN ({placeholders})"
|
||||
);
|
||||
tx.execute(&sql, params_from_iter(embedding_ids.iter()))
|
||||
.map_err(StoreError::from)?;
|
||||
}
|
||||
tx.commit().map_err(StoreError::from)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kb_config::Config;
|
||||
use tempfile::TempDir;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
/// Minimal config pointing at a tempdir for the SQLite file.
|
||||
fn config_for(tmp: &TempDir) -> Config {
|
||||
let mut c = Config::defaults();
|
||||
c.storage.data_dir = tmp.path().to_string_lossy().into_owned();
|
||||
c
|
||||
}
|
||||
|
||||
/// Seed a chunks row + the doc / asset rows it FKs to. The minimum
|
||||
/// needed for embedding_records inserts not to fail the FK to
|
||||
/// chunks.
|
||||
fn seed_chunk(store: &SqliteStore, chunk_id: &str) {
|
||||
let conn = store.lock_conn();
|
||||
// Asset, document, chunk — all hand-rolled at the SQL layer to
|
||||
// keep the test self-contained (no kb-parse/kb-chunk dep).
|
||||
conn.execute(
|
||||
"INSERT INTO assets (
|
||||
asset_id, source_uri, workspace_path, media_type, byte_len,
|
||||
checksum, storage_kind, storage_path, discovered_at
|
||||
) VALUES (?, ?, ?, ?, ?, ?, 'reference', '/tmp/x', ?)",
|
||||
params![
|
||||
"0123456789abcdef0123456789abcdef",
|
||||
"file:///tmp/x",
|
||||
"x.md",
|
||||
"{}",
|
||||
0_i64,
|
||||
"deadbeef",
|
||||
"1970-01-01T00:00:00Z",
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path, title, lang, source_type,
|
||||
trust_level, parser_version, doc_version, schema_version,
|
||||
metadata_json, provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, ?, NULL, NULL, 'fs', 'unverified', 'v1', 1, 1, '{}', '{}', ?, ?)",
|
||||
params![
|
||||
"fedcba9876543210fedcba9876543210",
|
||||
"0123456789abcdef0123456789abcdef",
|
||||
"x.md",
|
||||
"1970-01-01T00:00:00Z",
|
||||
"1970-01-01T00:00:00Z",
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'hash', '[]', ?)",
|
||||
params![chunk_id, "fedcba9876543210fedcba9876543210", "1970-01-01T00:00:00Z"],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn open_store(tmp: &TempDir) -> SqliteStore {
|
||||
let cfg = config_for(tmp);
|
||||
let store = SqliteStore::open(&cfg).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
store
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn pending_then_committed_round_trip() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let chunk = "11112222333344445555666677778888";
|
||||
seed_chunk(&store, chunk);
|
||||
|
||||
let row = EmbeddingRecordRow {
|
||||
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
|
||||
chunk_id: chunk.to_string(),
|
||||
model_id: "test-model".to_string(),
|
||||
model_version: "v1".to_string(),
|
||||
dimensions: 4,
|
||||
lance_table: "chunk_embeddings_test_model_4".to_string(),
|
||||
created_at: OffsetDateTime::now_utc(),
|
||||
};
|
||||
store
|
||||
.put_embedding_records_pending(std::slice::from_ref(&row))
|
||||
.unwrap();
|
||||
|
||||
// Inspect: the row exists at status='pending'.
|
||||
{
|
||||
let conn = store.read_conn();
|
||||
let (status, committed): (String, i64) = conn
|
||||
.query_row(
|
||||
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
||||
params![row.embedding_id],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(status, "pending");
|
||||
assert_eq!(committed, 0);
|
||||
}
|
||||
|
||||
store
|
||||
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
|
||||
.unwrap();
|
||||
{
|
||||
let conn = store.read_conn();
|
||||
let (status, committed): (String, i64) = conn
|
||||
.query_row(
|
||||
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
||||
params![row.embedding_id],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(status, "committed");
|
||||
assert_eq!(committed, 1);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_batches_are_noops() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
store.put_embedding_records_pending(&[]).unwrap();
|
||||
store.mark_embedding_records_committed(&[]).unwrap();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn replay_phase_one_resets_vector_committed() {
|
||||
// INSERT OR REPLACE: a phase-1 retry on a row that briefly
|
||||
// reached `committed` (in some adversarial out-of-order replay)
|
||||
// resets it to `pending`. Confirms the documented semantics.
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let chunk = "11112222333344445555666677778888";
|
||||
seed_chunk(&store, chunk);
|
||||
|
||||
let row = EmbeddingRecordRow {
|
||||
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
|
||||
chunk_id: chunk.to_string(),
|
||||
model_id: "test-model".to_string(),
|
||||
model_version: "v1".to_string(),
|
||||
dimensions: 4,
|
||||
lance_table: "chunk_embeddings_test_model_4".to_string(),
|
||||
created_at: OffsetDateTime::now_utc(),
|
||||
};
|
||||
store
|
||||
.put_embedding_records_pending(std::slice::from_ref(&row))
|
||||
.unwrap();
|
||||
store
|
||||
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
|
||||
.unwrap();
|
||||
store
|
||||
.put_embedding_records_pending(std::slice::from_ref(&row))
|
||||
.unwrap();
|
||||
|
||||
let conn = store.read_conn();
|
||||
let status: String = conn
|
||||
.query_row(
|
||||
"SELECT status FROM embedding_records WHERE embedding_id = ?",
|
||||
params![row.embedding_id],
|
||||
|r| r.get(0),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(status, "pending");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn mark_committed_skips_non_pending() {
|
||||
// The phase-3 UPDATE explicitly filters `status='pending'`, so
|
||||
// calling it on an embedding_id that was never staged (or that
|
||||
// already became a tombstone) is a no-op rather than an error.
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
store
|
||||
.mark_embedding_records_committed(&["does-not-exist".to_string()])
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
452
crates/kb-store-sqlite/src/filters.rs
Normal file
452
crates/kb-store-sqlite/src/filters.rs
Normal file
@@ -0,0 +1,452 @@
|
||||
//! Chunk-level filter helpers shared between retrievers.
|
||||
//!
|
||||
//! `kb-store-vector::search` post-filters its Lance candidate set
|
||||
//! against the SQLite-side metadata (committed-status / lang / tags /
|
||||
//! trust / path_glob). Rather than open a private SQL surface in
|
||||
//! `kb-store-vector`, the JOIN logic lives here so:
|
||||
//!
|
||||
//! - The schema (and CHECK / FK invariants) stays owned by the crate
|
||||
//! that ships the migrations.
|
||||
//! - `kb-store-vector` doesn't need its own `rusqlite` / `globset`
|
||||
//! direct deps — both are forbidden by the P3-3 spec's allowed-dep
|
||||
//! list.
|
||||
//! - Future retrievers (e.g. a hybrid blender) can reuse the same
|
||||
//! helper without re-deriving the SQL.
|
||||
//!
|
||||
//! `kb-search::lexical` already has a similar `tags / lang / trust /
|
||||
//! path_glob` filter pass for FTS5 results; we deliberately do *not*
|
||||
//! refactor that one in this PR — its SQL is interleaved with the
|
||||
//! `bm25 + snippet()` SELECT, so sharing would force an awkward
|
||||
//! trait split. P3-3 spec line 27 only mandates the move for
|
||||
//! `kb-store-vector`'s usage.
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use rusqlite::{params_from_iter, ToSql};
|
||||
|
||||
use crate::store::SqliteStore;
|
||||
|
||||
impl SqliteStore {
|
||||
/// Filter `chunk_ids` down to those whose owning document passes
|
||||
/// `filters` AND whose embedding row is at `status='committed'`.
|
||||
///
|
||||
/// The result preserves the input order so the caller can feed it
|
||||
/// back to a Lance distance-asc result list and `take(k)` directly.
|
||||
///
|
||||
/// `filters` semantics mirror `kb_core::SearchFilters`:
|
||||
///
|
||||
/// - `tags_any`: doc must own at least one of the listed tags
|
||||
/// (empty vec ⇒ no tag constraint).
|
||||
/// - `lang`: exact match against `documents.lang`.
|
||||
/// - `trust_min`: doc trust ≥ the supplied level (Generated <
|
||||
/// Secondary < Primary, mirroring `list_documents` and
|
||||
/// `kb-search::lexical`).
|
||||
/// - `path_glob`: shell-style glob (`*` does **not** cross `/`)
|
||||
/// against `documents.workspace_path`. Compiled in Rust via
|
||||
/// `globset` rather than translated to SQLite GLOB so the
|
||||
/// semantics match `kb-search::lexical` exactly.
|
||||
///
|
||||
/// The `embedding_records.status='committed'` predicate is always
|
||||
/// applied: tombstoned and pending rows must never surface to
|
||||
/// search callers (spec §5.6).
|
||||
pub fn filter_chunks(
|
||||
&self,
|
||||
chunk_ids: &[kb_core::ChunkId],
|
||||
filters: &kb_core::SearchFilters,
|
||||
) -> Result<Vec<kb_core::ChunkId>> {
|
||||
if chunk_ids.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
// Deduplicate the IN-list so a pathological caller passing
|
||||
// `[c1, c1, c1]` doesn't blow the SQL placeholder count.
|
||||
let unique_ids: Vec<String> = {
|
||||
let mut seen = HashSet::new();
|
||||
chunk_ids
|
||||
.iter()
|
||||
.filter_map(|c| {
|
||||
if seen.insert(c.0.as_str()) {
|
||||
Some(c.0.clone())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
})
|
||||
.collect()
|
||||
};
|
||||
|
||||
let placeholders = std::iter::repeat_n("?", unique_ids.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
let mut sql = format!(
|
||||
"SELECT er.chunk_id, d.workspace_path
|
||||
FROM embedding_records er
|
||||
JOIN chunks c ON c.chunk_id = er.chunk_id
|
||||
JOIN documents d ON d.doc_id = c.doc_id
|
||||
WHERE er.status = 'committed'
|
||||
AND er.chunk_id IN ({placeholders})"
|
||||
);
|
||||
|
||||
let mut bind: Vec<Box<dyn ToSql>> = unique_ids
|
||||
.iter()
|
||||
.map(|s| {
|
||||
let b: Box<dyn ToSql> = Box::new(s.clone());
|
||||
b
|
||||
})
|
||||
.collect();
|
||||
|
||||
if let Some(lang) = &filters.lang {
|
||||
sql.push_str(" AND d.lang = ?");
|
||||
bind.push(Box::new(lang.0.clone()));
|
||||
}
|
||||
if let Some(min) = &filters.trust_min {
|
||||
// Mirror `list_documents` / `kb-search::lexical`: rank
|
||||
// Generated=1 < Secondary=2 < Primary=3.
|
||||
sql.push_str(
|
||||
" AND CASE d.trust_level
|
||||
WHEN 'primary' THEN 3
|
||||
WHEN 'secondary' THEN 2
|
||||
WHEN 'generated' THEN 1
|
||||
ELSE 0 END >= ?",
|
||||
);
|
||||
let rank: i64 = match min {
|
||||
kb_core::TrustLevel::Primary => 3,
|
||||
kb_core::TrustLevel::Secondary => 2,
|
||||
kb_core::TrustLevel::Generated => 1,
|
||||
};
|
||||
bind.push(Box::new(rank));
|
||||
}
|
||||
if !filters.tags_any.is_empty() {
|
||||
let tag_ph = std::iter::repeat_n("?", filters.tags_any.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
sql.push_str(&format!(
|
||||
" AND EXISTS (SELECT 1 FROM document_tags t \
|
||||
WHERE t.doc_id = d.doc_id AND t.tag IN ({tag_ph}))"
|
||||
));
|
||||
for tag in &filters.tags_any {
|
||||
bind.push(Box::new(tag.clone()));
|
||||
}
|
||||
}
|
||||
|
||||
// Optional path_glob: applied in Rust on the rows we get back,
|
||||
// not in SQL — matching `kb-search::lexical`'s post-filter so
|
||||
// the glob semantics are byte-identical between retrievers.
|
||||
let path_matcher = match filters.path_glob.as_deref() {
|
||||
Some(pat) => Some(
|
||||
globset::GlobBuilder::new(pat)
|
||||
.literal_separator(true)
|
||||
.build()
|
||||
.with_context(|| {
|
||||
format!("kb-store-sqlite::filter_chunks: invalid path_glob {pat:?}")
|
||||
})?
|
||||
.compile_matcher(),
|
||||
),
|
||||
None => None,
|
||||
};
|
||||
|
||||
let conn = self.read_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(&sql)
|
||||
.context("kb-store-sqlite::filter_chunks: prepare SQL")?;
|
||||
let rows = stmt
|
||||
.query_map(
|
||||
params_from_iter(bind.iter().map(|b| b.as_ref())),
|
||||
|row| {
|
||||
let chunk_id: String = row.get(0)?;
|
||||
let workspace_path: String = row.get(1)?;
|
||||
Ok((chunk_id, workspace_path))
|
||||
},
|
||||
)
|
||||
.context("kb-store-sqlite::filter_chunks: execute SQL")?;
|
||||
|
||||
let mut allowed: HashMap<String, String> = HashMap::new();
|
||||
for r in rows {
|
||||
let (chunk_id, workspace_path) =
|
||||
r.context("kb-store-sqlite::filter_chunks: read row")?;
|
||||
allowed.insert(chunk_id, workspace_path);
|
||||
}
|
||||
|
||||
let mut out = Vec::with_capacity(chunk_ids.len());
|
||||
for cand in chunk_ids {
|
||||
let workspace_path = match allowed.get(&cand.0) {
|
||||
Some(p) => p,
|
||||
None => continue,
|
||||
};
|
||||
if let Some(m) = &path_matcher {
|
||||
if !m.is_match(workspace_path) {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
out.push(cand.clone());
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kb_config::Config;
|
||||
use kb_core::{ChunkId, Lang, SearchFilters, TrustLevel};
|
||||
use rusqlite::params;
|
||||
use tempfile::TempDir;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::EmbeddingRecordRow;
|
||||
|
||||
fn open_store(tmp: &TempDir) -> SqliteStore {
|
||||
let mut c = Config::defaults();
|
||||
c.storage.data_dir = tmp.path().to_string_lossy().into_owned();
|
||||
let store = SqliteStore::open(&c).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
store
|
||||
}
|
||||
|
||||
/// Seed (asset, document, document_tags, chunk) rows + a
|
||||
/// committed embedding_records row for a single chunk_id. Mirrors
|
||||
/// the shape `kb-store-vector` builds in production.
|
||||
fn seed_committed(
|
||||
store: &SqliteStore,
|
||||
chunk_id: &str,
|
||||
doc_id: &str,
|
||||
workspace_path: &str,
|
||||
lang: &str,
|
||||
tags: &[&str],
|
||||
trust: &str,
|
||||
) {
|
||||
let asset_id = format!("a{}", &doc_id[..31]);
|
||||
{
|
||||
let conn = store.lock_conn();
|
||||
conn.execute(
|
||||
"INSERT INTO assets (
|
||||
asset_id, source_uri, workspace_path, media_type, byte_len,
|
||||
checksum, storage_kind, storage_path, discovered_at
|
||||
) VALUES (?, ?, ?, '{}', 0, 'deadbeefdeadbeefdeadbeefdeadbeef',
|
||||
'reference', ?, '1970-01-01T00:00:00Z')",
|
||||
params![
|
||||
asset_id,
|
||||
format!("file://{workspace_path}"),
|
||||
workspace_path,
|
||||
workspace_path,
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path, title, lang, source_type,
|
||||
trust_level, parser_version, doc_version, schema_version,
|
||||
metadata_json, provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, ?, NULL, ?, 'markdown', ?, 'v1', 1, 1,
|
||||
'{}', '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
|
||||
params![doc_id, asset_id, workspace_path, lang, trust],
|
||||
)
|
||||
.unwrap();
|
||||
for t in tags {
|
||||
conn.execute(
|
||||
"INSERT INTO document_tags (doc_id, tag) VALUES (?, ?)",
|
||||
params![doc_id, t],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
|
||||
'1970-01-01T00:00:00Z')",
|
||||
params![chunk_id, doc_id],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let embed_row = EmbeddingRecordRow {
|
||||
embedding_id: format!("e{}", &chunk_id[..31]),
|
||||
chunk_id: chunk_id.to_string(),
|
||||
model_id: "m".to_string(),
|
||||
model_version: "v1".to_string(),
|
||||
dimensions: 4,
|
||||
lance_table: "t".to_string(),
|
||||
created_at: OffsetDateTime::UNIX_EPOCH,
|
||||
};
|
||||
store
|
||||
.put_embedding_records_pending(std::slice::from_ref(&embed_row))
|
||||
.unwrap();
|
||||
store
|
||||
.mark_embedding_records_committed(std::slice::from_ref(
|
||||
&embed_row.embedding_id,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn cid(s: &str) -> ChunkId {
|
||||
ChunkId(s.to_string())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_drops_uncommitted_rows() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let c1 = "11111111111111111111111111111111";
|
||||
let c2 = "22222222222222222222222222222222";
|
||||
let d1 = "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1";
|
||||
let d2 = "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2";
|
||||
seed_committed(&store, c1, d1, "a.md", "en", &[], "primary");
|
||||
|
||||
// c2: chunk + doc but no committed embedding row.
|
||||
let asset_id = format!("a{}", &d2[..31]);
|
||||
let conn = store.lock_conn();
|
||||
conn.execute(
|
||||
"INSERT INTO assets (
|
||||
asset_id, source_uri, workspace_path, media_type, byte_len,
|
||||
checksum, storage_kind, storage_path, discovered_at
|
||||
) VALUES (?, 'file://b.md', 'b.md', '{}', 0,
|
||||
'deadbeefdeadbeefdeadbeefdeadbeef',
|
||||
'reference', 'b.md', '1970-01-01T00:00:00Z')",
|
||||
params![asset_id],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path, title, lang, source_type,
|
||||
trust_level, parser_version, doc_version, schema_version,
|
||||
metadata_json, provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, 'b.md', NULL, 'en', 'markdown', 'primary', 'v1',
|
||||
1, 1, '{}', '{}',
|
||||
'1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
|
||||
params![d2, asset_id],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
|
||||
'1970-01-01T00:00:00Z')",
|
||||
params![c2, d2],
|
||||
)
|
||||
.unwrap();
|
||||
drop(conn);
|
||||
|
||||
let out = store
|
||||
.filter_chunks(&[cid(c1), cid(c2)], &SearchFilters::default())
|
||||
.unwrap();
|
||||
assert_eq!(out, vec![cid(c1)]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_tags_any_lang_trust_path_glob() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
// c1: tags=[ko-style], lang=en, primary, notes/a.md
|
||||
// c2: tags=[other], lang=en, primary, notes/b.md
|
||||
// c3: tags=[ko-style], lang=ko, secondary, notes/c.md
|
||||
// c4: tags=[ko-style], lang=en, generated, src/d.md
|
||||
let chunks = [
|
||||
("11111111111111111111111111111111", "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1", "notes/a.md", "en", "primary", &["ko-style"][..]),
|
||||
("22222222222222222222222222222222", "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2", "notes/b.md", "en", "primary", &["other"][..]),
|
||||
("33333333333333333333333333333333", "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3", "notes/c.md", "ko", "secondary", &["ko-style"][..]),
|
||||
("44444444444444444444444444444444", "d4d4d4d4d4d4d4d4d4d4d4d4d4d4d4d4", "src/d.md", "en", "generated", &["ko-style"][..]),
|
||||
];
|
||||
for (c, d, p, l, t, tags) in &chunks {
|
||||
seed_committed(&store, c, d, p, l, tags, t);
|
||||
}
|
||||
|
||||
// tags_any=[ko-style] → c1, c3, c4 (drop c2).
|
||||
let f = SearchFilters {
|
||||
tags_any: vec!["ko-style".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(
|
||||
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
|
||||
&f,
|
||||
)
|
||||
.unwrap();
|
||||
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
|
||||
got.sort();
|
||||
assert_eq!(got, vec![chunks[0].0, chunks[2].0, chunks[3].0]);
|
||||
|
||||
// + lang=en → drops c3.
|
||||
let f = SearchFilters {
|
||||
tags_any: vec!["ko-style".to_string()],
|
||||
lang: Some(Lang("en".to_string())),
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(
|
||||
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
|
||||
&f,
|
||||
)
|
||||
.unwrap();
|
||||
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
|
||||
got.sort();
|
||||
assert_eq!(got, vec![chunks[0].0, chunks[3].0]);
|
||||
|
||||
// + trust_min=Secondary → drops c4 (generated < secondary).
|
||||
let f = SearchFilters {
|
||||
tags_any: vec!["ko-style".to_string()],
|
||||
lang: Some(Lang("en".to_string())),
|
||||
trust_min: Some(TrustLevel::Secondary),
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(
|
||||
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
|
||||
&f,
|
||||
)
|
||||
.unwrap();
|
||||
let got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
|
||||
assert_eq!(got, vec![chunks[0].0]);
|
||||
|
||||
// path_glob = "notes/*.md" with no other constraint → c1, c2, c3.
|
||||
let f = SearchFilters {
|
||||
path_glob: Some("notes/*.md".to_string()),
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(
|
||||
&chunks.iter().map(|c| cid(c.0)).collect::<Vec<_>>(),
|
||||
&f,
|
||||
)
|
||||
.unwrap();
|
||||
let mut got: Vec<&str> = out.iter().map(|c| c.0.as_str()).collect();
|
||||
got.sort();
|
||||
assert_eq!(got, vec![chunks[0].0, chunks[1].0, chunks[2].0]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_preserves_input_order_and_dedupes() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let c1 = "11111111111111111111111111111111";
|
||||
let c2 = "22222222222222222222222222222222";
|
||||
let c3 = "33333333333333333333333333333333";
|
||||
seed_committed(&store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1", "a.md", "en", &[], "primary");
|
||||
seed_committed(&store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2", "b.md", "en", &[], "primary");
|
||||
seed_committed(&store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3", "c.md", "en", &[], "primary");
|
||||
|
||||
// Ask in the order c3, c1, c2; result must preserve that order.
|
||||
let out = store
|
||||
.filter_chunks(&[cid(c3), cid(c1), cid(c2)], &SearchFilters::default())
|
||||
.unwrap();
|
||||
assert_eq!(out, vec![cid(c3), cid(c1), cid(c2)]);
|
||||
|
||||
// Duplicates in the input survive in the output (dedup is for
|
||||
// the SQL IN-list only — caller may want repeats for ranking).
|
||||
let out = store
|
||||
.filter_chunks(&[cid(c1), cid(c1), cid(c2)], &SearchFilters::default())
|
||||
.unwrap();
|
||||
assert_eq!(out, vec![cid(c1), cid(c1), cid(c2)]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_empty_input_short_circuits() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let out = store.filter_chunks(&[], &SearchFilters::default()).unwrap();
|
||||
assert!(out.is_empty());
|
||||
}
|
||||
}
|
||||
@@ -8,18 +8,25 @@
|
||||
//!
|
||||
//! Allowed deps per task spec: `kb-core`, `kb-config`, `rusqlite`,
|
||||
//! `refinery`, `serde_json`, `time`, `blake3`, `tracing`, `anyhow`,
|
||||
//! `thiserror`. NOT allowed: `kb-parse-*`, `kb-normalize`, `kb-chunk`,
|
||||
//! `kb-store-vector`, `kb-source-fs`, etc. (`kb-parse-md`, `kb-normalize`,
|
||||
//! `kb-chunk` may appear as **dev-deps** — see `Cargo.toml` — to drive
|
||||
//! the contract round-trip test off a real Markdown fixture.)
|
||||
//! `thiserror`. `globset` was added in P3-3 to back the
|
||||
//! `filter_chunks` helper (used by `kb-store-vector`'s post-filter
|
||||
//! pass — moving the SQL JOIN into this crate kept `kb-store-vector`
|
||||
//! from needing its own `rusqlite` / `globset` direct deps). NOT
|
||||
//! allowed: `kb-parse-*`, `kb-normalize`, `kb-chunk`, `kb-store-vector`,
|
||||
//! `kb-source-fs`, etc. (`kb-parse-md`, `kb-normalize`, `kb-chunk` may
|
||||
//! appear as **dev-deps** — see `Cargo.toml` — to drive the contract
|
||||
//! round-trip test off a real Markdown fixture.)
|
||||
|
||||
mod documents;
|
||||
mod embeddings;
|
||||
mod error;
|
||||
mod filters;
|
||||
mod fts;
|
||||
mod jobs;
|
||||
mod schema;
|
||||
mod store;
|
||||
|
||||
pub use embeddings::EmbeddingRecordRow;
|
||||
pub use error::StoreError;
|
||||
pub use fts::rebuild_chunks_fts;
|
||||
pub use store::SqliteStore;
|
||||
|
||||
55
crates/kb-store-vector/Cargo.toml
Normal file
55
crates/kb-store-vector/Cargo.toml
Normal file
@@ -0,0 +1,55 @@
|
||||
[package]
|
||||
name = "kb-store-vector"
|
||||
version = { workspace = true }
|
||||
edition = { workspace = true }
|
||||
rust-version = { workspace = true }
|
||||
license = { workspace = true }
|
||||
repository = { workspace = true }
|
||||
description = "LanceDB-backed VectorStore for kb (§5.6 embedding_records, §6.3 lancedb tables, §7.2 VectorStore)"
|
||||
|
||||
[dependencies]
|
||||
kb-core = { path = "../kb-core" }
|
||||
kb-config = { path = "../kb-config" }
|
||||
# kb-store-sqlite is allowed for the embedding_records writers only
|
||||
# (P3-3 spec: "Allowed dep `kb-store-sqlite` for writing/reading rows in
|
||||
# embedding_records"). The Two-phase upsert flow uses
|
||||
# `put_embedding_records_pending` + `mark_embedding_records_committed`.
|
||||
kb-store-sqlite = { path = "../kb-store-sqlite" }
|
||||
|
||||
# LanceDB embedded vector store. `default-features=false` opts out of
|
||||
# the cloud object-store integrations (aws / gcs / azure / dynamodb /
|
||||
# oss); kb is always-local for v1, so dragging in those SDKs would just
|
||||
# inflate the build.
|
||||
lancedb = { workspace = true }
|
||||
arrow = { workspace = true }
|
||||
arrow-array = { workspace = true }
|
||||
arrow-schema = { workspace = true }
|
||||
# Embedded async runtime. The VectorStore trait is sync (§7.2) but
|
||||
# LanceDB's Rust API is async-only; we own a current-thread
|
||||
# tokio::Runtime and `block_on` per trait method. current-thread saves
|
||||
# the two worker threads a multi-thread runtime would spawn — kb-app
|
||||
# already serializes vector ops behind its own job scheduler so the
|
||||
# extra parallelism wouldn't be exploited.
|
||||
tokio = { workspace = true }
|
||||
# `try_collect` for streaming Lance query results into a Vec<RecordBatch>.
|
||||
futures = { workspace = true }
|
||||
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
thiserror = { workspace = true }
|
||||
anyhow = { workspace = true }
|
||||
blake3 = { workspace = true }
|
||||
time = { workspace = true }
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
# Integration tests seed `documents` / `chunks` fixtures by raw SQL
|
||||
# (no kb-parse-md / kb-normalize / kb-chunk dep) so they can construct
|
||||
# adversarial filter / dim-mismatch states. rusqlite is a `[dev-]`
|
||||
# dep only — the runtime crate uses kb-store-sqlite's typed surface
|
||||
# (`filter_chunks`, `put_embedding_records_pending`, …) and does not
|
||||
# touch rusqlite directly (P3-3 spec: kb-store-vector must not list
|
||||
# rusqlite/globset as direct deps).
|
||||
rusqlite = { workspace = true }
|
||||
232
crates/kb-store-vector/src/arrow_batch.rs
Normal file
232
crates/kb-store-vector/src/arrow_batch.rs
Normal file
@@ -0,0 +1,232 @@
|
||||
//! Arrow schema + RecordBatch builder for the per-model Lance table.
|
||||
//!
|
||||
//! Per design §6.3 the per-row layout is:
|
||||
//!
|
||||
//! ```text
|
||||
//! chunk_id : Utf8 (primary)
|
||||
//! doc_id : Utf8
|
||||
//! embedding : FixedSizeList<Float32, dim>
|
||||
//! model_id : Utf8
|
||||
//! embedding_version : Utf8
|
||||
//! text : Utf8
|
||||
//! heading_path : Utf8 (JSON-encoded Vec<String>)
|
||||
//! created_at : Timestamp(Microsecond, UTC)
|
||||
//! ```
|
||||
//!
|
||||
//! `heading_path` is encoded as a JSON string rather than a Lance
|
||||
//! `List<Utf8>` to keep the `only_if` SQL filter surface clean — Lance
|
||||
//! exposes scalar columns to its query DSL trivially, but list columns
|
||||
//! need `array_contains`-style helpers that aren't required by the
|
||||
//! current `SearchFilters` shape.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use arrow_array::{
|
||||
ArrayRef, FixedSizeListArray, Float32Array, RecordBatch, StringArray,
|
||||
TimestampMicrosecondArray,
|
||||
};
|
||||
use arrow_schema::{DataType, Field, Schema, SchemaRef, TimeUnit};
|
||||
use kb_core::VectorRecord;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
/// Arrow schema for a Lance table whose vector column is FixedSizeList
|
||||
/// of `dim` Float32. All non-vector columns are non-nullable; the
|
||||
/// vector column itself is non-nullable but the inner Float32 slot is
|
||||
/// nullable per Arrow convention (Lance ignores the inner-nullable
|
||||
/// flag when the outer field is non-null).
|
||||
pub(crate) fn schema_for(dim: usize) -> SchemaRef {
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("chunk_id", DataType::Utf8, false),
|
||||
Field::new("doc_id", DataType::Utf8, false),
|
||||
Field::new(
|
||||
"embedding",
|
||||
DataType::FixedSizeList(
|
||||
Arc::new(Field::new("item", DataType::Float32, true)),
|
||||
dim as i32,
|
||||
),
|
||||
false,
|
||||
),
|
||||
Field::new("model_id", DataType::Utf8, false),
|
||||
Field::new("embedding_version", DataType::Utf8, false),
|
||||
Field::new("text", DataType::Utf8, false),
|
||||
Field::new("heading_path", DataType::Utf8, false),
|
||||
Field::new(
|
||||
"created_at",
|
||||
DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())),
|
||||
false,
|
||||
),
|
||||
]))
|
||||
}
|
||||
|
||||
/// Build a `RecordBatch` from `recs`. All records must share `dim`;
|
||||
/// callers are expected to pre-bucket per-table batches before reaching
|
||||
/// here. The batch carries `recs.len()` rows; `now` is folded into
|
||||
/// `created_at` for every row to match design §6.3.
|
||||
pub(crate) fn build_batch(
|
||||
recs: &[VectorRecord],
|
||||
dim: usize,
|
||||
now: OffsetDateTime,
|
||||
) -> Result<RecordBatch> {
|
||||
let schema = schema_for(dim);
|
||||
|
||||
let chunk_ids = StringArray::from(
|
||||
recs.iter().map(|r| r.chunk_id.0.as_str()).collect::<Vec<_>>(),
|
||||
);
|
||||
let doc_ids = StringArray::from(
|
||||
recs.iter().map(|r| r.doc_id.0.as_str()).collect::<Vec<_>>(),
|
||||
);
|
||||
let model_ids = StringArray::from(
|
||||
recs.iter().map(|r| r.model_id.0.as_str()).collect::<Vec<_>>(),
|
||||
);
|
||||
let model_versions = StringArray::from(
|
||||
recs.iter()
|
||||
.map(|r| r.model_version.0.as_str())
|
||||
.collect::<Vec<_>>(),
|
||||
);
|
||||
let texts =
|
||||
StringArray::from(recs.iter().map(|r| r.text.as_str()).collect::<Vec<_>>());
|
||||
|
||||
// heading_path: serde_json::Value::Array of strings, then to_string.
|
||||
let heading_paths: Vec<String> = recs
|
||||
.iter()
|
||||
.map(|r| serde_json::to_string(&r.heading_path))
|
||||
.collect::<std::result::Result<_, _>>()
|
||||
.context("serialize heading_path JSON")?;
|
||||
let heading_path_arr = StringArray::from(
|
||||
heading_paths.iter().map(String::as_str).collect::<Vec<_>>(),
|
||||
);
|
||||
|
||||
// Embedding: FixedSizeList<Float32, dim>. Build from the flat
|
||||
// contiguous f32 buffer.
|
||||
let mut flat: Vec<f32> = Vec::with_capacity(recs.len() * dim);
|
||||
for r in recs {
|
||||
if r.vector.len() != dim {
|
||||
anyhow::bail!(
|
||||
"vector length {} does not match table dim {} for chunk {}",
|
||||
r.vector.len(),
|
||||
dim,
|
||||
r.chunk_id.0
|
||||
);
|
||||
}
|
||||
flat.extend_from_slice(&r.vector);
|
||||
}
|
||||
let values = Float32Array::from(flat);
|
||||
let embedding_field =
|
||||
Arc::new(Field::new("item", DataType::Float32, true));
|
||||
let embedding = FixedSizeListArray::try_new(
|
||||
embedding_field,
|
||||
dim as i32,
|
||||
Arc::new(values),
|
||||
None,
|
||||
)
|
||||
.context("build FixedSizeList embedding column")?;
|
||||
|
||||
// created_at: microseconds since Unix epoch, UTC.
|
||||
let micros: Vec<i64> = std::iter::repeat_n(
|
||||
(now.unix_timestamp_nanos() / 1_000) as i64,
|
||||
recs.len(),
|
||||
)
|
||||
.collect();
|
||||
let created_at = TimestampMicrosecondArray::from(micros).with_timezone("UTC");
|
||||
|
||||
let arrays: Vec<ArrayRef> = vec![
|
||||
Arc::new(chunk_ids) as ArrayRef,
|
||||
Arc::new(doc_ids),
|
||||
Arc::new(embedding),
|
||||
Arc::new(model_ids),
|
||||
Arc::new(model_versions),
|
||||
Arc::new(texts),
|
||||
Arc::new(heading_path_arr),
|
||||
Arc::new(created_at),
|
||||
];
|
||||
|
||||
RecordBatch::try_new(schema, arrays).context("assemble RecordBatch")
|
||||
}
|
||||
|
||||
/// blake3-hex of the canonical JSON of the schema. Used as
|
||||
/// `params_hash` for `id_for_index` so the `IndexId` stays stable
|
||||
/// across invocations with the same `dim`.
|
||||
pub(crate) fn schema_params_hash(dim: usize) -> String {
|
||||
// Keep the hash input shape self-describing so a future schema
|
||||
// tweak (extra column, type change, …) bumps the hash and produces
|
||||
// a different `IndexId` automatically.
|
||||
let descriptor = serde_json::json!({
|
||||
"version": 1,
|
||||
"dim": dim,
|
||||
"columns": [
|
||||
{"name": "chunk_id", "type": "Utf8"},
|
||||
{"name": "doc_id", "type": "Utf8"},
|
||||
{"name": "embedding", "type": "FixedSizeList<Float32>", "size": dim},
|
||||
{"name": "model_id", "type": "Utf8"},
|
||||
{"name": "embedding_version", "type": "Utf8"},
|
||||
{"name": "text", "type": "Utf8"},
|
||||
{"name": "heading_path", "type": "Utf8"},
|
||||
{"name": "created_at", "type": "Timestamp<us, UTC>"},
|
||||
],
|
||||
});
|
||||
let bytes = descriptor_bytes(&descriptor);
|
||||
blake3::hash(&bytes).to_hex().to_string()
|
||||
}
|
||||
|
||||
/// Serialize the schema descriptor to bytes for hashing. Plain
|
||||
/// `serde_json::to_vec` rather than a canonical-JSON crate is fine
|
||||
/// here because the descriptor is built from a fixed `serde_json::json!`
|
||||
/// literal in `schema_params_hash` — `serde_json` walks the object's
|
||||
/// key order deterministically (insertion order, since `Value::Object`
|
||||
/// uses `Map`), so the byte output is stable across runs without a
|
||||
/// canonicalizer. The empty-vec fallback on the (unreachable, given
|
||||
/// our literal input) error path keeps the function infallible.
|
||||
fn descriptor_bytes(v: &serde_json::Value) -> Vec<u8> {
|
||||
serde_json::to_vec(v).unwrap_or_default()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kb_core::{ChunkId, DocumentId, EmbeddingId, EmbeddingModelId, EmbeddingVersion};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn make_rec(chunk_idx: u8, dim: usize) -> VectorRecord {
|
||||
VectorRecord {
|
||||
chunk_id: ChunkId(format!("{:032x}", chunk_idx)),
|
||||
embedding_id: EmbeddingId(format!("{:032x}", 0xeeeeu16 + chunk_idx as u16)),
|
||||
vector: vec![0.1_f32; dim],
|
||||
doc_id: DocumentId("aaaa".repeat(8)),
|
||||
text: format!("text-{chunk_idx}"),
|
||||
heading_path: vec!["A".to_string(), "B".to_string()],
|
||||
model_id: EmbeddingModelId("test".to_string()),
|
||||
model_version: EmbeddingVersion("v1".to_string()),
|
||||
dimensions: dim,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_batch_round_trip_basic() {
|
||||
let recs = vec![make_rec(1, 4), make_rec(2, 4)];
|
||||
let batch = build_batch(&recs, 4, OffsetDateTime::UNIX_EPOCH).unwrap();
|
||||
assert_eq!(batch.num_rows(), 2);
|
||||
assert_eq!(batch.num_columns(), 8);
|
||||
let schema = batch.schema();
|
||||
assert_eq!(schema.field(0).name(), "chunk_id");
|
||||
assert_eq!(schema.field(2).name(), "embedding");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_batch_dim_mismatch_errors() {
|
||||
let mut rec = make_rec(1, 4);
|
||||
rec.vector = vec![0.0_f32; 3];
|
||||
let err = build_batch(&[rec], 4, OffsetDateTime::UNIX_EPOCH).unwrap_err();
|
||||
let msg = format!("{err}");
|
||||
assert!(msg.contains("does not match table dim"), "msg={msg}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn schema_params_hash_is_stable_for_dim() {
|
||||
let h1 = schema_params_hash(384);
|
||||
let h2 = schema_params_hash(384);
|
||||
assert_eq!(h1, h2);
|
||||
let h3 = schema_params_hash(512);
|
||||
assert_ne!(h1, h3);
|
||||
}
|
||||
}
|
||||
31
crates/kb-store-vector/src/lib.rs
Normal file
31
crates/kb-store-vector/src/lib.rs
Normal file
@@ -0,0 +1,31 @@
|
||||
//! `kb-store-vector` — LanceDB-backed [`kb_core::VectorStore`] for kb.
|
||||
//!
|
||||
//! Stores per-model Lance tables under `config.storage.vector_dir/`
|
||||
//! (`chunk_embeddings_<model>_<dim>.lance/`). `upsert` runs the
|
||||
//! SQLite-first / Lance-second two-phase write described in design
|
||||
//! §5.6: phase 1 stages `embedding_records` rows at `status='pending'`,
|
||||
//! phase 2 issues a Lance `MergeInsert` keyed on `chunk_id`, phase 3
|
||||
//! flips the rows to `status='committed'`. `search` joins against
|
||||
//! `embedding_records WHERE status='committed'` so partial-write Lance
|
||||
//! rows never surface to callers; if the process crashes between phase
|
||||
//! 2 and phase 3 (or phase 2 itself fails), the next `upsert` call
|
||||
//! retries the still-pending rows idempotently because Lance MergeInsert
|
||||
//! dedupes on `chunk_id`.
|
||||
//!
|
||||
//! Sync / async bridge: `VectorStore` is a sync trait (§7.2) and
|
||||
//! LanceDB's Rust API is async-only. We own a private current-thread
|
||||
//! `tokio::runtime::Runtime` and `block_on` per trait method. The
|
||||
//! tradeoff is documented inline; multi-thread runtime would let two
|
||||
//! upserts run concurrently but kb-app's job scheduler already
|
||||
//! serializes vector ops, and current-thread saves the two worker
|
||||
//! threads a multi-thread runtime spawns by default.
|
||||
//!
|
||||
//! See `docs/superpowers/specs/2026-04-27-kb-final-form-design.md`
|
||||
//! §5.6 (embedding_records DDL), §6.3 (lancedb table naming),
|
||||
//! §7.2 (VectorStore), §9 (versioning).
|
||||
|
||||
mod arrow_batch;
|
||||
mod paths;
|
||||
mod store;
|
||||
|
||||
pub use store::LanceVectorStore;
|
||||
119
crates/kb-store-vector/src/paths.rs
Normal file
119
crates/kb-store-vector/src/paths.rs
Normal file
@@ -0,0 +1,119 @@
|
||||
//! Path expansion + table-name sanitization.
|
||||
//!
|
||||
//! Mirrors `kb-store-sqlite::store::expand_data_dir` and
|
||||
//! `kb-embed-local::expand_path` so the three crates resolve
|
||||
//! `${XDG_DATA_HOME:-…}` / leading `~` / `{data_dir}` identically. A
|
||||
//! shared helper would live in `kb-config`, but the task spec forbids
|
||||
//! adding new types to `kb-config`, so we keep a private clone.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
/// Expand `{data_dir}` → `data_dir`, `${XDG_DATA_HOME:-…}` → env or
|
||||
/// default, leading `~` → `$HOME`. Pass an empty `data_dir` when
|
||||
/// resolving `data_dir` itself (the `{data_dir}` substitution is a
|
||||
/// no-op in that case).
|
||||
pub(crate) fn expand_path(raw: &str, data_dir: &str) -> PathBuf {
|
||||
let mut s = raw.to_string();
|
||||
|
||||
if !data_dir.is_empty() {
|
||||
s = s.replace("{data_dir}", data_dir);
|
||||
}
|
||||
|
||||
// ${XDG_DATA_HOME:-~/.local/share}: env override, else default after `:-`.
|
||||
if let Some(start) = s.find("${XDG_DATA_HOME") {
|
||||
if let Some(rel_end) = s[start..].find('}') {
|
||||
let end = start + rel_end + 1;
|
||||
let inner = &s[start + 2..end - 1];
|
||||
let replacement = match std::env::var("XDG_DATA_HOME") {
|
||||
Ok(v) if !v.is_empty() => v,
|
||||
_ => {
|
||||
if let Some((_, default)) = inner.split_once(":-") {
|
||||
default.to_string()
|
||||
} else {
|
||||
String::new()
|
||||
}
|
||||
}
|
||||
};
|
||||
s.replace_range(start..end, &replacement);
|
||||
}
|
||||
}
|
||||
|
||||
if let Some(rest) = s.strip_prefix('~') {
|
||||
if let Some(home) = std::env::var_os("HOME").map(PathBuf::from) {
|
||||
return home.join(rest.trim_start_matches('/'));
|
||||
}
|
||||
}
|
||||
|
||||
PathBuf::from(s)
|
||||
}
|
||||
|
||||
/// Build the per-model Lance table name. Per design §6.3:
|
||||
/// `chunk_embeddings_<model>_<dim>.lance`. Model IDs may contain
|
||||
/// characters that are illegal in directory names on some filesystems
|
||||
/// (Windows reserved chars, `/`, …) — squash anything outside
|
||||
/// `[A-Za-z0-9-]` to `_` so the name is portable.
|
||||
///
|
||||
/// LanceDB's `connect(uri).open_table(name)` resolves `name` against
|
||||
/// the connection root; the trailing `.lance` is part of the directory
|
||||
/// LanceDB itself appends when it materializes the table, so we pass
|
||||
/// the bare logical name (`chunk_embeddings_<model>_<dim>`) and let
|
||||
/// Lance manage the suffix. Spec text uses the suffixed form for the
|
||||
/// on-disk path; both are present.
|
||||
pub(crate) fn lance_table_name(model_id: &str, dim: usize) -> String {
|
||||
let sanitized = sanitize_model_id(model_id);
|
||||
format!("chunk_embeddings_{sanitized}_{dim}")
|
||||
}
|
||||
|
||||
/// Replace anything outside `[A-Za-z0-9-]` with `_`. Idempotent.
|
||||
pub(crate) fn sanitize_model_id(model_id: &str) -> String {
|
||||
model_id
|
||||
.chars()
|
||||
.map(|c| {
|
||||
if c.is_ascii_alphanumeric() || c == '-' {
|
||||
c
|
||||
} else {
|
||||
'_'
|
||||
}
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn sanitize_replaces_path_separators() {
|
||||
assert_eq!(sanitize_model_id("BAAI/bge-small-en"), "BAAI_bge-small-en");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn sanitize_keeps_dash_and_alpha_num() {
|
||||
assert_eq!(sanitize_model_id("e5-small-v2"), "e5-small-v2");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn sanitize_squashes_dot_and_colon() {
|
||||
assert_eq!(sanitize_model_id("model.v1:fast"), "model_v1_fast");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn lance_table_name_format() {
|
||||
assert_eq!(
|
||||
lance_table_name("BAAI/bge-small-en", 384),
|
||||
"chunk_embeddings_BAAI_bge-small-en_384"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn expand_path_substitutes_data_dir() {
|
||||
let p = expand_path("{data_dir}/lancedb", "/tmp/kbtest");
|
||||
assert_eq!(p, PathBuf::from("/tmp/kbtest/lancedb"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn expand_path_passthrough_absolute() {
|
||||
let p = expand_path("/abs/dir", "/ignored");
|
||||
assert_eq!(p, PathBuf::from("/abs/dir"));
|
||||
}
|
||||
}
|
||||
551
crates/kb-store-vector/src/store.rs
Normal file
551
crates/kb-store-vector/src/store.rs
Normal file
@@ -0,0 +1,551 @@
|
||||
//! `LanceVectorStore` — `kb_core::VectorStore` impl over LanceDB.
|
||||
//!
|
||||
//! See module-level docs in `lib.rs` for the high-level shape (two-phase
|
||||
//! upsert, sync/async bridge, table layout).
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::path::PathBuf;
|
||||
use std::sync::Arc;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use arrow_array::{Array, Float32Array, RecordBatch, StringArray};
|
||||
use arrow_schema::SchemaRef;
|
||||
use futures::TryStreamExt;
|
||||
use kb_core::{
|
||||
ChunkId, DocumentId, EmbeddingModelId, IndexId, SearchFilters,
|
||||
VectorHit, VectorRecord, VectorStore,
|
||||
};
|
||||
use kb_store_sqlite::{EmbeddingRecordRow, SqliteStore};
|
||||
use lancedb::Connection;
|
||||
use lancedb::query::{ExecutableQuery, QueryBase};
|
||||
use serde_json::json;
|
||||
use time::OffsetDateTime;
|
||||
use tokio::runtime::{Builder as RuntimeBuilder, Runtime};
|
||||
|
||||
use crate::arrow_batch::{build_batch, schema_for, schema_params_hash};
|
||||
use crate::paths::{expand_path, lance_table_name};
|
||||
|
||||
/// Overfetch multiplier: when post-filtering Lance results against
|
||||
/// SQLite-side filters we ask for `2 * k` candidates so a moderately
|
||||
/// selective filter still returns `k` hits. P3-3 spec line 138 caps
|
||||
/// the doubling at this multiplier; deeper retries are out of scope.
|
||||
const OVERFETCH_MULTIPLIER: usize = 2;
|
||||
|
||||
/// `IndexId` collection label per design §4.2.
|
||||
const INDEX_COLLECTION: &str = "chunk_embeddings";
|
||||
|
||||
/// `IndexId` kind label — flat cosine for v1 (§7.2 + spec line 85).
|
||||
const INDEX_KIND: &str = "flat";
|
||||
|
||||
/// `IndexVersion` token. The schema doesn't expose IndexVersion as a
|
||||
/// dimension we vary per call, but `id_for_index` requires one; pin to
|
||||
/// `v1` so re-runs produce stable IDs.
|
||||
const INDEX_VERSION: &str = "v1";
|
||||
|
||||
/// Lance VectorStore.
|
||||
///
|
||||
/// Holds a single `lancedb::Connection` opened against
|
||||
/// `config.storage.vector_dir/`. The connection is cheap to clone via
|
||||
/// `Arc` internally and is reused across `ensure_table` / `upsert` /
|
||||
/// `search`. The `tokio::Runtime` is current-thread; multi-thread
|
||||
/// would buy concurrency we don't currently exploit (kb-app job
|
||||
/// scheduler serializes vector ops) at the cost of two worker
|
||||
/// threads.
|
||||
///
|
||||
/// # Async context
|
||||
///
|
||||
/// `LanceVectorStore` owns a private `tokio::runtime::Runtime` and
|
||||
/// drives every `VectorStore` trait method through `runtime.block_on`.
|
||||
/// **Do NOT construct or call any of these methods from inside another
|
||||
/// tokio runtime context** — `block_on` panics with `"Cannot start a
|
||||
/// runtime from within a runtime"` in that case. `kb-app`'s job
|
||||
/// scheduler is synchronous so this is safe today; if a future caller
|
||||
/// wants to embed `LanceVectorStore` inside an async server they must
|
||||
/// wrap calls in `tokio::task::spawn_blocking` (or move to an
|
||||
/// async-native `VectorStore` impl).
|
||||
pub struct LanceVectorStore {
|
||||
runtime: Runtime,
|
||||
connection: Connection,
|
||||
sqlite: Arc<SqliteStore>,
|
||||
/// Resolved absolute path to the Lance root. Kept for diagnostics
|
||||
/// only — the `Connection` already knows it.
|
||||
#[allow(dead_code)]
|
||||
vector_dir: PathBuf,
|
||||
}
|
||||
|
||||
impl LanceVectorStore {
|
||||
/// Open (or create) the Lance directory under
|
||||
/// `config.storage.vector_dir`, build a current-thread tokio
|
||||
/// runtime, and return a ready-to-use store. Migrations on the
|
||||
/// SQLite side must already have been applied (`run_migrations`)
|
||||
/// — this constructor does not touch the SQLite schema.
|
||||
///
|
||||
/// **Caveat:** internally calls `runtime.block_on` to open the
|
||||
/// Lance connection. Calling this from inside another tokio
|
||||
/// runtime context will panic with `"Cannot start a runtime from
|
||||
/// within a runtime"`. See the struct-level `# Async context`
|
||||
/// section.
|
||||
pub fn new(config: &kb_config::Config, sqlite: Arc<SqliteStore>) -> Result<Self> {
|
||||
let data_dir = expand_path(&config.storage.data_dir, "");
|
||||
let vector_dir =
|
||||
expand_path(&config.storage.vector_dir, &data_dir.to_string_lossy());
|
||||
std::fs::create_dir_all(&vector_dir)
|
||||
.with_context(|| format!("create vector_dir {}", vector_dir.display()))?;
|
||||
|
||||
// current-thread runtime: see module docs. Multi-thread would
|
||||
// spawn two worker threads we don't use.
|
||||
let runtime = RuntimeBuilder::new_current_thread()
|
||||
.enable_all()
|
||||
.build()
|
||||
.context("build tokio runtime for kb-store-vector")?;
|
||||
|
||||
let uri = vector_dir.to_string_lossy().into_owned();
|
||||
let connection = runtime
|
||||
.block_on(async {
|
||||
lancedb::connect(&uri)
|
||||
.execute()
|
||||
.await
|
||||
.context("lancedb::connect")
|
||||
})?;
|
||||
|
||||
tracing::debug!(
|
||||
target: "kb-store-vector",
|
||||
vector_dir = %vector_dir.display(),
|
||||
"opened LanceVectorStore"
|
||||
);
|
||||
|
||||
Ok(Self {
|
||||
runtime,
|
||||
connection,
|
||||
sqlite,
|
||||
vector_dir,
|
||||
})
|
||||
}
|
||||
|
||||
/// Open or create the Lance table with the current schema. Returns
|
||||
/// a handle the caller can use for queries.
|
||||
async fn ensure_table_async(
|
||||
connection: &Connection,
|
||||
table_name: &str,
|
||||
dim: usize,
|
||||
) -> Result<lancedb::Table> {
|
||||
match connection.open_table(table_name).execute().await {
|
||||
Ok(t) => Ok(t),
|
||||
Err(lancedb::Error::TableNotFound { .. }) => {
|
||||
let schema = schema_for(dim);
|
||||
let table = connection
|
||||
.create_empty_table(table_name, schema)
|
||||
.execute()
|
||||
.await
|
||||
.context("create_empty_table")?;
|
||||
tracing::info!(
|
||||
target: "kb-store-vector",
|
||||
table = table_name,
|
||||
dim,
|
||||
"created Lance table"
|
||||
);
|
||||
Ok(table)
|
||||
}
|
||||
Err(e) => Err(anyhow::Error::from(e)).context("open_table"),
|
||||
}
|
||||
}
|
||||
|
||||
/// Validate that the on-disk Lance table's schema matches what
|
||||
/// `schema_for(dim)` produces. Used by `upsert` to fail fast on a
|
||||
/// dim mismatch BEFORE any phase-1 SQLite write lands.
|
||||
fn check_dim(table_schema: &SchemaRef, dim: usize) -> Result<()> {
|
||||
let field = table_schema
|
||||
.field_with_name("embedding")
|
||||
.context("table missing 'embedding' column")?;
|
||||
match field.data_type() {
|
||||
arrow_schema::DataType::FixedSizeList(_, table_dim) => {
|
||||
if (*table_dim as usize) != dim {
|
||||
anyhow::bail!(
|
||||
"dimension mismatch: table has dim {}, records have dim {}",
|
||||
table_dim,
|
||||
dim
|
||||
);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
other => anyhow::bail!(
|
||||
"embedding column has unexpected Arrow type {:?}",
|
||||
other
|
||||
),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl VectorStore for LanceVectorStore {
|
||||
fn ensure_table(
|
||||
&self,
|
||||
model: &EmbeddingModelId,
|
||||
dim: usize,
|
||||
) -> Result<IndexId> {
|
||||
let table_name = lance_table_name(&model.0, dim);
|
||||
// The trait method only needs the IndexId — we don't return the
|
||||
// Lance handle. Open (or create) the table to enforce idempotence
|
||||
// (a second call with the same params must succeed and yield
|
||||
// the same IndexId).
|
||||
self.runtime.block_on(async {
|
||||
Self::ensure_table_async(&self.connection, &table_name, dim).await
|
||||
})?;
|
||||
|
||||
let params_hash = schema_params_hash(dim);
|
||||
let id = kb_core::id_for_index(
|
||||
INDEX_COLLECTION,
|
||||
model,
|
||||
dim,
|
||||
&kb_core::IndexVersion(INDEX_VERSION.to_string()),
|
||||
INDEX_KIND,
|
||||
¶ms_hash,
|
||||
);
|
||||
Ok(id)
|
||||
}
|
||||
|
||||
fn upsert(&self, recs: &[VectorRecord]) -> Result<()> {
|
||||
if recs.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// All records in a single upsert call must share (model_id,
|
||||
// model_version, dimensions). Callers (kb-app indexer) already
|
||||
// batch by model; we enforce here so a misuse fails loudly.
|
||||
let model_id = recs[0].model_id.clone();
|
||||
let model_version = recs[0].model_version.clone();
|
||||
let dim = recs[0].dimensions;
|
||||
for r in recs {
|
||||
if r.model_id != model_id
|
||||
|| r.model_version != model_version
|
||||
|| r.dimensions != dim
|
||||
{
|
||||
anyhow::bail!(
|
||||
"kb-store-vector::upsert called with mixed (model_id, model_version, dim) — caller must bucket per table"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let table_name = lance_table_name(&model_id.0, dim);
|
||||
|
||||
// Open (or create) the Lance table FIRST and check its on-disk
|
||||
// dim against what the records claim. A mismatch must error
|
||||
// before any phase-1 SQLite write — spec line 94: "Dimension
|
||||
// mismatch returns Error from upsert and writes nothing."
|
||||
let table = self.runtime.block_on(async {
|
||||
Self::ensure_table_async(&self.connection, &table_name, dim).await
|
||||
})?;
|
||||
let table_schema = self
|
||||
.runtime
|
||||
.block_on(async { table.schema().await.context("read table schema") })?;
|
||||
Self::check_dim(&table_schema, dim)?;
|
||||
|
||||
// Phase 1: stage embedding_records rows at status='pending'.
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let pending_rows: Vec<EmbeddingRecordRow> = recs
|
||||
.iter()
|
||||
.map(|r| EmbeddingRecordRow {
|
||||
embedding_id: r.embedding_id.0.clone(),
|
||||
chunk_id: r.chunk_id.0.clone(),
|
||||
model_id: r.model_id.0.clone(),
|
||||
model_version: r.model_version.0.clone(),
|
||||
dimensions: r.dimensions,
|
||||
lance_table: table_name.clone(),
|
||||
created_at: now,
|
||||
})
|
||||
.collect();
|
||||
self.sqlite
|
||||
.put_embedding_records_pending(&pending_rows)
|
||||
.context("phase 1: stage pending embedding_records")?;
|
||||
|
||||
// Phase 2: Lance MergeInsert keyed on chunk_id.
|
||||
let batch = build_batch(recs, dim, now)?;
|
||||
merge_insert_batch(&self.runtime, &table, batch)
|
||||
.context("phase 2: Lance MergeInsert")?;
|
||||
|
||||
// Phase 3: flip rows to status='committed'. If we crashed
|
||||
// between phase 2 and phase 3, the rows stay 'pending' and a
|
||||
// future upsert call retries them (Lance MergeInsert dedupes
|
||||
// on chunk_id, so the retry is a no-op on the Lance side).
|
||||
let embedding_ids: Vec<String> =
|
||||
recs.iter().map(|r| r.embedding_id.0.clone()).collect();
|
||||
self.sqlite
|
||||
.mark_embedding_records_committed(&embedding_ids)
|
||||
.context("phase 3: mark embedding_records committed")?;
|
||||
|
||||
tracing::info!(
|
||||
target: "kb-store-vector",
|
||||
table = %table_name,
|
||||
rows = recs.len(),
|
||||
"upsert committed"
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn search(
|
||||
&self,
|
||||
query_vec: &[f32],
|
||||
k: usize,
|
||||
filters: &SearchFilters,
|
||||
) -> Result<Vec<VectorHit>> {
|
||||
if k == 0 {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
// We need to know which table to query. SearchFilters doesn't
|
||||
// carry a model_id (the trait doesn't expose one to the
|
||||
// caller), so we scan known tables on disk and pick the one
|
||||
// matching `query_vec.len()`. In v1 there's typically one
|
||||
// model in play; if there are several we pick the first match.
|
||||
let dim = query_vec.len();
|
||||
let table_name = match self
|
||||
.runtime
|
||||
.block_on(async { find_matching_table(&self.connection, dim).await })?
|
||||
{
|
||||
Some(name) => name,
|
||||
None => {
|
||||
tracing::debug!(
|
||||
target: "kb-store-vector",
|
||||
dim,
|
||||
"search: no Lance table matches query dim — returning empty"
|
||||
);
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
};
|
||||
|
||||
// Pre-fetch 2*k Lance rows; we'll filter against SQLite
|
||||
// afterwards. If filters are empty we still over-fetch to
|
||||
// exclude tombstoned / pending rows.
|
||||
let overfetch = k.saturating_mul(OVERFETCH_MULTIPLIER).max(k);
|
||||
let raw_hits = self.runtime.block_on(async {
|
||||
let table = match self.connection.open_table(&table_name).execute().await
|
||||
{
|
||||
Ok(t) => t,
|
||||
Err(lancedb::Error::TableNotFound { .. }) => return Ok(Vec::new()),
|
||||
Err(e) => return Err(anyhow::Error::from(e)),
|
||||
};
|
||||
|
||||
let stream = table
|
||||
.vector_search(query_vec)
|
||||
.context("vector_search")?
|
||||
.distance_type(lancedb::DistanceType::Cosine)
|
||||
.limit(overfetch)
|
||||
.execute()
|
||||
.await
|
||||
.context("execute vector query")?;
|
||||
let batches: Vec<RecordBatch> =
|
||||
stream.try_collect().await.context("collect batches")?;
|
||||
Result::<Vec<RecordBatch>>::Ok(batches)
|
||||
})?;
|
||||
|
||||
let candidates = decode_lance_hits(&raw_hits)?;
|
||||
|
||||
// Filter against embedding_records (status='committed') and
|
||||
// documents (tags / lang / path / trust). For the empty filter
|
||||
// case the join still excludes tombstoned / pending rows.
|
||||
// The `filter_chunks` helper lives in kb-store-sqlite (the
|
||||
// crate that owns the schema), so this crate doesn't need its
|
||||
// own rusqlite / globset direct deps.
|
||||
let candidate_ids: Vec<ChunkId> = {
|
||||
// Deduplicate — Lance result batches can in principle
|
||||
// repeat a chunk_id across batches; the JOIN is most
|
||||
// efficient if we ask once per id.
|
||||
let mut seen = HashSet::new();
|
||||
candidates
|
||||
.iter()
|
||||
.filter(|c| seen.insert(c.chunk_id.0.clone()))
|
||||
.map(|c| c.chunk_id.clone())
|
||||
.collect()
|
||||
};
|
||||
let allowed_set: HashSet<String> = self
|
||||
.sqlite
|
||||
.filter_chunks(&candidate_ids, filters)
|
||||
.context("post-filter chunks via kb-store-sqlite")?
|
||||
.into_iter()
|
||||
.map(|c| c.0)
|
||||
.collect();
|
||||
|
||||
let mut hits: Vec<VectorHit> = candidates
|
||||
.into_iter()
|
||||
.filter(|c| allowed_set.contains(&c.chunk_id.0))
|
||||
.take(k)
|
||||
.map(LanceCandidate::into_hit)
|
||||
.collect();
|
||||
// Re-rank by score desc to give callers a consistent ordering
|
||||
// regardless of post-filter shuffling.
|
||||
hits.sort_by(|a, b| {
|
||||
b.score
|
||||
.partial_cmp(&a.score)
|
||||
.unwrap_or(std::cmp::Ordering::Equal)
|
||||
});
|
||||
Ok(hits)
|
||||
}
|
||||
}
|
||||
|
||||
/// One Lance row decoded from a query batch, paired with the converted
|
||||
/// score and pre-built JSON payload. We keep `chunk_id` separately so
|
||||
/// the SQLite filter pass can JOIN against it without re-parsing the
|
||||
/// payload.
|
||||
struct LanceCandidate {
|
||||
chunk_id: ChunkId,
|
||||
doc_id: DocumentId,
|
||||
text: String,
|
||||
heading_path: Vec<String>,
|
||||
score: f32,
|
||||
}
|
||||
|
||||
impl LanceCandidate {
|
||||
fn into_hit(self) -> VectorHit {
|
||||
let payload = json!({
|
||||
"doc_id": self.doc_id.0,
|
||||
"text": self.text,
|
||||
"heading_path": self.heading_path,
|
||||
});
|
||||
VectorHit {
|
||||
chunk_id: self.chunk_id,
|
||||
score: self.score,
|
||||
payload,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Decode a list of Lance result batches into typed candidates.
|
||||
/// Lance's vector query attaches a `_distance: Float32` column; we
|
||||
/// convert to similarity via `1 - distance` then shift to `[0, 1]`
|
||||
/// via `(sim + 1) / 2` per spec line 96. NaN distances get score 0
|
||||
/// (with a warn log).
|
||||
fn decode_lance_hits(batches: &[RecordBatch]) -> Result<Vec<LanceCandidate>> {
|
||||
let mut out = Vec::new();
|
||||
for batch in batches {
|
||||
let chunk_ids = batch
|
||||
.column_by_name("chunk_id")
|
||||
.context("missing chunk_id col")?
|
||||
.as_any()
|
||||
.downcast_ref::<StringArray>()
|
||||
.context("chunk_id wrong type")?;
|
||||
let doc_ids = batch
|
||||
.column_by_name("doc_id")
|
||||
.context("missing doc_id col")?
|
||||
.as_any()
|
||||
.downcast_ref::<StringArray>()
|
||||
.context("doc_id wrong type")?;
|
||||
let texts = batch
|
||||
.column_by_name("text")
|
||||
.context("missing text col")?
|
||||
.as_any()
|
||||
.downcast_ref::<StringArray>()
|
||||
.context("text wrong type")?;
|
||||
let heading_path_str = batch
|
||||
.column_by_name("heading_path")
|
||||
.context("missing heading_path col")?
|
||||
.as_any()
|
||||
.downcast_ref::<StringArray>()
|
||||
.context("heading_path wrong type")?;
|
||||
let distances = batch
|
||||
.column_by_name("_distance")
|
||||
.context("missing _distance col")?
|
||||
.as_any()
|
||||
.downcast_ref::<Float32Array>()
|
||||
.context("_distance wrong type")?;
|
||||
|
||||
for i in 0..batch.num_rows() {
|
||||
let dist = distances.value(i);
|
||||
let score = score_from_distance(dist);
|
||||
let heading_path: Vec<String> = serde_json::from_str(
|
||||
heading_path_str.value(i),
|
||||
)
|
||||
.unwrap_or_default();
|
||||
out.push(LanceCandidate {
|
||||
chunk_id: ChunkId(chunk_ids.value(i).to_string()),
|
||||
doc_id: DocumentId(doc_ids.value(i).to_string()),
|
||||
text: texts.value(i).to_string(),
|
||||
heading_path,
|
||||
score,
|
||||
});
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// Convert a cosine distance (LanceDB returns `1 - cosine_similarity`
|
||||
/// in `[0, 2]` for L2-normalized vectors) to a `[0, 1]` score via
|
||||
/// `score = ((1 - distance) + 1) / 2`. Per spec line 96 the shift
|
||||
/// (rather than clamp) preserves ordering between unrelated and
|
||||
/// opposite vectors. NaN — which Lance can produce when one side is
|
||||
/// the all-zero vector — collapses to 0 with a warn.
|
||||
fn score_from_distance(distance: f32) -> f32 {
|
||||
if distance.is_nan() {
|
||||
tracing::warn!(
|
||||
target: "kb-store-vector",
|
||||
"NaN cosine distance from Lance — coercing to score 0"
|
||||
);
|
||||
return 0.0;
|
||||
}
|
||||
let sim = 1.0 - distance;
|
||||
(sim + 1.0) / 2.0
|
||||
}
|
||||
|
||||
/// Find a Lance table whose embedding column is FixedSizeList<Float32, dim>.
|
||||
async fn find_matching_table(
|
||||
connection: &Connection,
|
||||
dim: usize,
|
||||
) -> Result<Option<String>> {
|
||||
let names = connection
|
||||
.table_names()
|
||||
.execute()
|
||||
.await
|
||||
.context("table_names")?;
|
||||
for name in names {
|
||||
if !name.starts_with("chunk_embeddings_") {
|
||||
continue;
|
||||
}
|
||||
match connection.open_table(&name).execute().await {
|
||||
Ok(t) => {
|
||||
let schema = t.schema().await.context("schema for table")?;
|
||||
if let Ok(field) = schema.field_with_name("embedding") {
|
||||
if let arrow_schema::DataType::FixedSizeList(_, table_dim) =
|
||||
field.data_type()
|
||||
{
|
||||
if (*table_dim as usize) == dim {
|
||||
return Ok(Some(name));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kb-store-vector",
|
||||
table = %name,
|
||||
error = %e,
|
||||
"search: skipped unopenable table"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(None)
|
||||
}
|
||||
|
||||
/// Run the Lance MergeInsert under our embedded runtime. Pulled out
|
||||
/// of `upsert` so the trait method stays compact.
|
||||
fn merge_insert_batch(
|
||||
runtime: &Runtime,
|
||||
table: &lancedb::Table,
|
||||
batch: RecordBatch,
|
||||
) -> Result<()> {
|
||||
let schema = batch.schema();
|
||||
runtime.block_on(async move {
|
||||
let reader = arrow_array::RecordBatchIterator::new(
|
||||
vec![Ok(batch)].into_iter(),
|
||||
schema,
|
||||
);
|
||||
let mut builder = table.merge_insert(&["chunk_id"]);
|
||||
builder
|
||||
.when_matched_update_all(None)
|
||||
.when_not_matched_insert_all();
|
||||
builder
|
||||
.execute(Box::new(reader))
|
||||
.await
|
||||
.context("MergeInsert execute")?;
|
||||
Result::<()>::Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
185
crates/kb-store-vector/tests/common/mod.rs
Normal file
185
crates/kb-store-vector/tests/common/mod.rs
Normal file
@@ -0,0 +1,185 @@
|
||||
//! Shared scaffolding for kb-store-vector integration tests.
|
||||
//!
|
||||
//! # Test policy
|
||||
//!
|
||||
//! Integration tests in this crate are marked `#[ignore]` and require
|
||||
//! AVX-capable hardware. They are excluded from the default `cargo
|
||||
//! test -p kb-store-vector` lane and only run when explicitly opted
|
||||
//! in:
|
||||
//!
|
||||
//! ```text
|
||||
//! cargo test -p kb-store-vector -- --ignored
|
||||
//! ```
|
||||
//!
|
||||
//! The reason: LanceDB's f32 SIMD path uses unconditional AVX
|
||||
//! intrinsics (`__m256` in `lance-linalg::simd::f32`). On x86_64
|
||||
//! CPUs without AVX support — notably QEMU's default `qemu64` model
|
||||
//! in CI sandboxes and some bare-metal dev boxes — those instructions
|
||||
//! trigger `SIGILL: illegal instruction` at the first `vector_search`
|
||||
//! call. Rather than silently turn that into a "passing" test (which
|
||||
//! it isn't), we gate the integration suite behind `#[ignore]` and
|
||||
//! call [`require_avx_or_panic`] inside each test body so that an
|
||||
//! `--ignored` invocation on a non-AVX host fails loudly rather than
|
||||
//! crashing later inside a Lance kernel.
|
||||
//!
|
||||
//! This mirrors P3-2's `#[ignore]` policy on tests that require a
|
||||
//! model download — both are CI-lane decisions, not silent skips.
|
||||
//!
|
||||
//! Each test owns a `TempDir` (vector_dir + sqlite db live underneath
|
||||
//! it), a fully-migrated `SqliteStore`, and a `LanceVectorStore`
|
||||
//! pointed at both. We seed `documents` / `chunks` rows directly via
|
||||
//! SQL (rather than going through `DocumentStore::put_document`) so
|
||||
//! the tests stay independent of kb-parse-md / kb-normalize / kb-chunk
|
||||
//! and so we can construct adversarial fixtures (filtered tags,
|
||||
//! mismatched langs) without reproducing a Markdown round-trip.
|
||||
|
||||
#![allow(dead_code)]
|
||||
|
||||
use std::path::PathBuf;
|
||||
use std::sync::Arc;
|
||||
|
||||
/// Panic if the host CPU lacks AVX. Called from every `#[ignore]`-d
|
||||
/// integration test body so that `cargo test -- --ignored` on a
|
||||
/// non-AVX host fails loudly with a clear message instead of crashing
|
||||
/// later inside a Lance SIMD kernel with `SIGILL`.
|
||||
///
|
||||
/// On non-x86_64 hosts this is a no-op (Lance's AVX requirement is
|
||||
/// x86-only — ARM/Apple Silicon paths use different intrinsics that
|
||||
/// the workspace doesn't currently target).
|
||||
pub fn require_avx_or_panic() {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
{
|
||||
if !std::is_x86_feature_detected!("avx") {
|
||||
panic!(
|
||||
"kb-store-vector integration test requires AVX-capable hardware; \
|
||||
host CPU lacks AVX. Run on an AVX-capable machine. \
|
||||
See crates/kb-store-vector/tests/common/mod.rs."
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
use kb_config::Config;
|
||||
use kb_core::{
|
||||
ChunkId, DocumentId, EmbeddingId, EmbeddingModelId, EmbeddingVersion, VectorRecord,
|
||||
};
|
||||
use kb_store_sqlite::SqliteStore;
|
||||
use kb_store_vector::LanceVectorStore;
|
||||
use rusqlite::params;
|
||||
use tempfile::TempDir;
|
||||
|
||||
pub struct TestEnv {
|
||||
pub temp: TempDir,
|
||||
pub config: Config,
|
||||
pub sqlite: Arc<SqliteStore>,
|
||||
pub vector: LanceVectorStore,
|
||||
}
|
||||
|
||||
impl TestEnv {
|
||||
pub fn new() -> Self {
|
||||
let temp = tempfile::tempdir().expect("tempdir");
|
||||
let mut config = Config::defaults();
|
||||
config.storage.data_dir = temp.path().to_string_lossy().into_owned();
|
||||
let sqlite = SqliteStore::open(&config).unwrap();
|
||||
sqlite.run_migrations().unwrap();
|
||||
let sqlite = Arc::new(sqlite);
|
||||
let vector = LanceVectorStore::new(&config, sqlite.clone()).unwrap();
|
||||
Self {
|
||||
temp,
|
||||
config,
|
||||
sqlite,
|
||||
vector,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn data_dir(&self) -> PathBuf {
|
||||
self.temp.path().to_path_buf()
|
||||
}
|
||||
|
||||
/// Insert minimum (asset, document, chunk) rows so phase-1
|
||||
/// embedding_records inserts don't trip the FK to chunks /
|
||||
/// documents.
|
||||
pub fn seed_chunk(
|
||||
&self,
|
||||
chunk_id: &str,
|
||||
doc_id: &str,
|
||||
workspace_path: &str,
|
||||
lang: &str,
|
||||
tags: &[&str],
|
||||
trust_level: &str,
|
||||
) {
|
||||
// Asset id derived from doc_id deterministically — every
|
||||
// chunk gets its own asset to keep things simple.
|
||||
let asset_id = format!("a{}", &doc_id[..31]);
|
||||
let conn = self.sqlite.read_conn();
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO assets (
|
||||
asset_id, source_uri, workspace_path, media_type, byte_len,
|
||||
checksum, storage_kind, storage_path, discovered_at
|
||||
) VALUES (?, ?, ?, ?, 0, ?, 'reference', ?, '1970-01-01T00:00:00Z')",
|
||||
params![
|
||||
asset_id,
|
||||
format!("file://{workspace_path}"),
|
||||
workspace_path,
|
||||
"{}",
|
||||
"deadbeefdeadbeefdeadbeefdeadbeef",
|
||||
workspace_path,
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO documents (
|
||||
doc_id, asset_id, workspace_path, title, lang, source_type,
|
||||
trust_level, parser_version, doc_version, schema_version,
|
||||
metadata_json, provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, ?, NULL, ?, 'markdown', ?, 'v1', 1, 1, '{}', '{}',
|
||||
'1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
|
||||
params![doc_id, asset_id, workspace_path, lang, trust_level],
|
||||
)
|
||||
.unwrap();
|
||||
for t in tags {
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO document_tags (doc_id, tag) VALUES (?, ?)",
|
||||
params![doc_id, t],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'h', '[]', '1970-01-01T00:00:00Z')",
|
||||
params![chunk_id, doc_id],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
/// Build a deterministic test VectorRecord from a few simple inputs.
|
||||
/// `vector` is taken verbatim, `dimensions` is set from `vector.len()`.
|
||||
pub fn make_record(
|
||||
chunk_idx: u8,
|
||||
doc_idx: u8,
|
||||
vector: Vec<f32>,
|
||||
text: &str,
|
||||
heading: &[&str],
|
||||
model: &str,
|
||||
) -> VectorRecord {
|
||||
let dim = vector.len();
|
||||
let chunk_id = ChunkId(format!("{:032x}", 0x1100u32 + chunk_idx as u32));
|
||||
let doc_id = DocumentId(format!("{:032x}", 0xd0c0u32 + doc_idx as u32));
|
||||
let embedding_id =
|
||||
EmbeddingId(format!("{:032x}", 0xeeee0000u32 + chunk_idx as u32));
|
||||
VectorRecord {
|
||||
chunk_id,
|
||||
embedding_id,
|
||||
vector,
|
||||
doc_id,
|
||||
text: text.to_string(),
|
||||
heading_path: heading.iter().map(|s| s.to_string()).collect(),
|
||||
model_id: EmbeddingModelId(model.to_string()),
|
||||
model_version: EmbeddingVersion("v1".to_string()),
|
||||
dimensions: dim,
|
||||
}
|
||||
}
|
||||
4
crates/kb-store-vector/tests/fixtures/vector/run-1.json
vendored
Normal file
4
crates/kb-store-vector/tests/fixtures/vector/run-1.json
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
{
|
||||
"_comment": "PLACEHOLDER — regenerate via `KB_UPDATE_SNAPSHOTS=1 cargo test -p kb-store-vector -- --ignored snapshot` on an AVX-capable host. Until then the snapshot test panics with a clear 'placeholder' message.",
|
||||
"hits": []
|
||||
}
|
||||
119
crates/kb-store-vector/tests/snapshot.rs
Normal file
119
crates/kb-store-vector/tests/snapshot.rs
Normal file
@@ -0,0 +1,119 @@
|
||||
//! Snapshot test: a fixed corpus + fixed query produces a stable
|
||||
//! `Vec<VectorHit>` JSON. Pinning the snapshot here catches accidental
|
||||
//! drift in score scaling, payload shape, or top-k ordering.
|
||||
//!
|
||||
//! This test is `#[ignore]` and requires AVX-capable hardware. Run
|
||||
//! with `cargo test -p kb-store-vector -- --ignored snapshot`.
|
||||
//!
|
||||
//! The committed fixture at `tests/fixtures/vector/run-1.json` is a
|
||||
//! placeholder until first regenerated on AVX hardware. The test
|
||||
//! detects the placeholder via its `_comment` field and panics with
|
||||
//! a clear "regenerate me" message — see `assert_no_placeholder`
|
||||
//! below.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kb_core::{SearchFilters, VectorStore};
|
||||
use serde_json::json;
|
||||
|
||||
mod common;
|
||||
use common::{TestEnv, make_record, require_avx_or_panic};
|
||||
|
||||
const MODEL: &str = "snapshot-model";
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn vector_hits_snapshot_run_1() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
// Fixed deterministic corpus: 4 unit-norm vectors, each with a
|
||||
// known doc / chunk / heading. The query points squarely at
|
||||
// chunk 0 so the expected ordering is 0, then the others by
|
||||
// distance from dir(0).
|
||||
let corpus = vec![
|
||||
(0u8, vec![1.0_f32, 0.0, 0.0, 0.0], "alpha", &["A"][..]),
|
||||
(1u8, vec![0.95_f32, 0.31, 0.0, 0.0], "beta", &["A", "B"][..]),
|
||||
(2u8, vec![0.0_f32, 1.0, 0.0, 0.0], "gamma", &["B"][..]),
|
||||
(3u8, vec![0.0_f32, 0.0, 1.0, 0.0], "delta", &[][..]),
|
||||
];
|
||||
|
||||
let mut recs = Vec::new();
|
||||
for (i, vec, text, headings) in &corpus {
|
||||
let rec = make_record(*i, *i, vec.clone(), text, headings, MODEL);
|
||||
env.seed_chunk(
|
||||
&rec.chunk_id.0,
|
||||
&rec.doc_id.0,
|
||||
&format!("notes/{i}.md"),
|
||||
"en",
|
||||
&[],
|
||||
"primary",
|
||||
);
|
||||
recs.push(rec);
|
||||
}
|
||||
env.vector.upsert(&recs).unwrap();
|
||||
|
||||
let q = vec![1.0_f32, 0.0, 0.0, 0.0];
|
||||
let hits = env.vector.search(&q, 3, &SearchFilters::default()).unwrap();
|
||||
|
||||
// The snapshot pins:
|
||||
// - top-3 chunk_id ordering (by score desc)
|
||||
// - payload shape: { doc_id, text, heading_path }
|
||||
// - that scores live in [0, 1] and are sorted descending
|
||||
let actual = json!(
|
||||
hits.iter().map(|h| json!({
|
||||
"chunk_id": h.chunk_id.0,
|
||||
"score_in_unit_interval": (0.0..=1.0).contains(&h.score),
|
||||
"payload": h.payload,
|
||||
})).collect::<Vec<_>>()
|
||||
);
|
||||
|
||||
let fixture = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
.join("vector")
|
||||
.join("run-1.json");
|
||||
|
||||
if std::env::var_os("KB_UPDATE_SNAPSHOTS").is_some() {
|
||||
std::fs::create_dir_all(fixture.parent().unwrap()).unwrap();
|
||||
std::fs::write(&fixture, serde_json::to_string_pretty(&actual).unwrap())
|
||||
.unwrap();
|
||||
return;
|
||||
}
|
||||
|
||||
let expected: serde_json::Value =
|
||||
serde_json::from_str(&std::fs::read_to_string(&fixture).unwrap_or_else(
|
||||
|_| panic!(
|
||||
"missing snapshot fixture at {}; run with KB_UPDATE_SNAPSHOTS=1 to create",
|
||||
fixture.display()
|
||||
),
|
||||
))
|
||||
.unwrap();
|
||||
|
||||
// Refuse to silently "pass" when the fixture is the committed
|
||||
// placeholder. The placeholder JSON carries a `_comment` field
|
||||
// with regeneration instructions; production fixtures (a captured
|
||||
// hits array) do not.
|
||||
if expected.get("_comment").is_some() {
|
||||
panic!(
|
||||
"snapshot fixture is a placeholder — regenerate on AVX hardware then commit. \
|
||||
Path: {}. To regenerate: \
|
||||
`KB_UPDATE_SNAPSHOTS=1 cargo test -p kb-store-vector -- --ignored snapshot`.",
|
||||
fixture.display()
|
||||
);
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
actual, expected,
|
||||
"snapshot drift; rerun with KB_UPDATE_SNAPSHOTS=1 to regenerate"
|
||||
);
|
||||
|
||||
// Independent guard: scores must be non-increasing.
|
||||
for w in hits.windows(2) {
|
||||
assert!(
|
||||
w[0].score >= w[1].score,
|
||||
"scores not in descending order: {} then {}",
|
||||
w[0].score,
|
||||
w[1].score
|
||||
);
|
||||
}
|
||||
}
|
||||
374
crates/kb-store-vector/tests/upsert_search.rs
Normal file
374
crates/kb-store-vector/tests/upsert_search.rs
Normal file
@@ -0,0 +1,374 @@
|
||||
//! Integration tests for `LanceVectorStore` covering ensure_table,
|
||||
//! upsert, search, dimension mismatch, filters, model isolation, and
|
||||
//! determinism.
|
||||
//!
|
||||
//! Every test in this file is `#[ignore]` and requires an AVX-capable
|
||||
//! x86_64 host. Run with:
|
||||
//!
|
||||
//! ```text
|
||||
//! cargo test -p kb-store-vector -- --ignored
|
||||
//! ```
|
||||
//!
|
||||
//! See `tests/common/mod.rs` for the full rationale.
|
||||
|
||||
use kb_core::{EmbeddingModelId, SearchFilters, VectorStore};
|
||||
use kb_store_sqlite::EmbeddingRecordRow;
|
||||
use rusqlite::params;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
mod common;
|
||||
use common::{TestEnv, make_record, require_avx_or_panic};
|
||||
|
||||
const MODEL: &str = "test-model";
|
||||
|
||||
/// Helper: produce a unit-norm 4-D vector pointing in one of four
|
||||
/// directions. The sign pattern keeps cosine similarities cleanly
|
||||
/// distinct so search ordering tests don't depend on float jitter.
|
||||
fn dir(idx: u8) -> Vec<f32> {
|
||||
match idx {
|
||||
0 => vec![1.0, 0.0, 0.0, 0.0],
|
||||
1 => vec![0.0, 1.0, 0.0, 0.0],
|
||||
2 => vec![0.0, 0.0, 1.0, 0.0],
|
||||
_ => vec![0.0, 0.0, 0.0, 1.0],
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn ensure_table_idempotent_returns_same_index_id() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let model = EmbeddingModelId(MODEL.to_string());
|
||||
let id1 = env.vector.ensure_table(&model, 4).unwrap();
|
||||
let id2 = env.vector.ensure_table(&model, 4).unwrap();
|
||||
assert_eq!(id1, id2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn search_before_upsert_returns_empty() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let hits = env
|
||||
.vector
|
||||
.search(&dir(0), 5, &SearchFilters::default())
|
||||
.unwrap();
|
||||
assert!(hits.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn upsert_ten_then_search_returns_five() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let mut recs = Vec::new();
|
||||
for i in 0..10u8 {
|
||||
// 4-D vectors clustered near dir(0) for the first half, dir(1)
|
||||
// for the rest, with small per-row jitter so they stay
|
||||
// distinct in the index.
|
||||
let mut v = if i < 5 { dir(0) } else { dir(1) };
|
||||
v[3] = (i as f32) * 0.001;
|
||||
let rec = make_record(i, i, v, &format!("text-{i}"), &["A"], MODEL);
|
||||
env.seed_chunk(
|
||||
&rec.chunk_id.0,
|
||||
&rec.doc_id.0,
|
||||
&format!("notes/{i}.md"),
|
||||
"en",
|
||||
&[],
|
||||
"primary",
|
||||
);
|
||||
recs.push(rec);
|
||||
}
|
||||
env.vector.upsert(&recs).unwrap();
|
||||
|
||||
// 1:1 alignment check: every record has a committed embedding row.
|
||||
{
|
||||
let conn = env.sqlite.read_conn();
|
||||
let count: i64 = conn
|
||||
.query_row(
|
||||
"SELECT COUNT(*) FROM embedding_records WHERE status = 'committed'",
|
||||
[],
|
||||
|r| r.get(0),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(count, 10);
|
||||
}
|
||||
|
||||
let hits = env
|
||||
.vector
|
||||
.search(&dir(0), 5, &SearchFilters::default())
|
||||
.unwrap();
|
||||
assert_eq!(hits.len(), 5, "expected 5 hits, got {}", hits.len());
|
||||
|
||||
// Top hits should be from the first half (clustered around dir(0)).
|
||||
// make_record lays chunk_idx into the low bits of `0x1100 + i`, so
|
||||
// `chunk_idx = u32::from_str_radix(last4, 16) - 0x1100`. The first
|
||||
// half (chunk_idx < 5) lives in 0x1100..=0x1104.
|
||||
for h in &hits {
|
||||
let suffix_hex = &h.chunk_id.0[h.chunk_id.0.len() - 4..];
|
||||
let idx = u32::from_str_radix(suffix_hex, 16).unwrap();
|
||||
let chunk_idx = idx - 0x1100;
|
||||
assert!(
|
||||
chunk_idx < 5,
|
||||
"top-5 hit unexpectedly came from second cluster: idx={chunk_idx}"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn dimension_mismatch_errors_and_writes_nothing() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let model = EmbeddingModelId(MODEL.to_string());
|
||||
|
||||
// First populate a 4-D table with one row so it exists on disk.
|
||||
let r0 = make_record(0, 0, dir(0), "first", &[], MODEL);
|
||||
env.seed_chunk(&r0.chunk_id.0, &r0.doc_id.0, "notes/0.md", "en", &[], "primary");
|
||||
env.vector.upsert(&[r0]).unwrap();
|
||||
assert_eq!(env.vector.ensure_table(&model, 4).unwrap(), env.vector.ensure_table(&model, 4).unwrap());
|
||||
|
||||
// Now manually open the same table_name path and try to upsert
|
||||
// an 8-D vector through `upsert` — the table name function bakes
|
||||
// dim into the name, so the only way to drive the real
|
||||
// record-vs-table mismatch is to corrupt `dimensions` so the
|
||||
// table_name is the existing 4-D table, but the embedded vector
|
||||
// is 8-D. Spec line 94: must error, write nothing extra.
|
||||
let mut bad = make_record(1, 1, vec![0.1_f32; 8], "second", &[], MODEL);
|
||||
// Pretend this is a 4-D vector for table-name purposes; the
|
||||
// build_batch then enforces that vector.len() == dim and bails.
|
||||
bad.dimensions = 4;
|
||||
env.seed_chunk(&bad.chunk_id.0, &bad.doc_id.0, "notes/1.md", "en", &[], "primary");
|
||||
|
||||
let bad_chunk = bad.chunk_id.0.clone();
|
||||
let err = env.vector.upsert(&[bad]).unwrap_err();
|
||||
let msg = format!("{err:#}");
|
||||
assert!(
|
||||
msg.to_lowercase().contains("dim")
|
||||
|| msg.contains("does not match table dim"),
|
||||
"unexpected error message: {msg}"
|
||||
);
|
||||
|
||||
// The phase-1 row may have landed before phase 2 detected the
|
||||
// mismatch — but the on-disk Lance table must NOT contain the
|
||||
// bad record. So we assert that no `committed` row corresponds
|
||||
// to chunk_id of the bad record.
|
||||
let conn = env.sqlite.read_conn();
|
||||
let committed: i64 = conn
|
||||
.query_row(
|
||||
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ? AND status = 'committed'",
|
||||
rusqlite::params![bad_chunk],
|
||||
|r| r.get(0),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(committed, 0, "bad record reached committed state despite dim mismatch");
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn filter_tags_any_drops_non_matching_docs() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
|
||||
// Two docs: one with tag "ko-style", one without.
|
||||
let r_a = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
|
||||
let r_b = make_record(0xbb, 0xbb, dir(0), "beta", &[], MODEL);
|
||||
env.seed_chunk(
|
||||
&r_a.chunk_id.0,
|
||||
&r_a.doc_id.0,
|
||||
"notes/a.md",
|
||||
"en",
|
||||
&["ko-style"],
|
||||
"primary",
|
||||
);
|
||||
env.seed_chunk(
|
||||
&r_b.chunk_id.0,
|
||||
&r_b.doc_id.0,
|
||||
"notes/b.md",
|
||||
"en",
|
||||
&["other"],
|
||||
"primary",
|
||||
);
|
||||
let expected_doc_id = r_a.doc_id.0.clone();
|
||||
env.vector.upsert(&[r_a, r_b]).unwrap();
|
||||
|
||||
let filters = SearchFilters {
|
||||
tags_any: vec!["ko-style".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let hits = env.vector.search(&dir(0), 10, &filters).unwrap();
|
||||
assert_eq!(hits.len(), 1, "expected only the tagged doc to match");
|
||||
let payload = &hits[0].payload;
|
||||
assert_eq!(payload["doc_id"], expected_doc_id);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn model_isolation_two_models_two_directories() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let r1 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-A");
|
||||
env.seed_chunk(
|
||||
&r1.chunk_id.0,
|
||||
&r1.doc_id.0,
|
||||
"notes/a.md",
|
||||
"en",
|
||||
&[],
|
||||
"primary",
|
||||
);
|
||||
let chunk_id = r1.chunk_id.0.clone();
|
||||
env.vector.upsert(&[r1]).unwrap();
|
||||
|
||||
// Same chunk_id, different model — should land in a separate table.
|
||||
let mut r2 = make_record(0xaa, 0xaa, dir(0), "alpha", &[], "model-B");
|
||||
r2.embedding_id = kb_core::EmbeddingId(
|
||||
"ee01ee01ee01ee01ee01ee01ee01ee01".to_string(),
|
||||
);
|
||||
env.vector.upsert(&[r2]).unwrap();
|
||||
|
||||
// Two on-disk Lance directories, distinguished by table name.
|
||||
let lancedb_root = env.data_dir().join("lancedb");
|
||||
let entries: Vec<_> = std::fs::read_dir(&lancedb_root)
|
||||
.unwrap()
|
||||
.filter_map(Result::ok)
|
||||
.map(|e| e.file_name().to_string_lossy().into_owned())
|
||||
.collect();
|
||||
let a_count = entries
|
||||
.iter()
|
||||
.filter(|e| e.contains("model-A"))
|
||||
.count();
|
||||
let b_count = entries
|
||||
.iter()
|
||||
.filter(|e| e.contains("model-B"))
|
||||
.count();
|
||||
assert!(a_count >= 1, "model-A table missing: {entries:?}");
|
||||
assert!(b_count >= 1, "model-B table missing: {entries:?}");
|
||||
|
||||
// Two embedding_records rows for the same chunk_id, one per model.
|
||||
let conn = env.sqlite.read_conn();
|
||||
let count: i64 = conn
|
||||
.query_row(
|
||||
"SELECT COUNT(*) FROM embedding_records WHERE chunk_id = ?",
|
||||
params![chunk_id],
|
||||
|r| r.get(0),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(count, 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn determinism_same_query_same_top_k() {
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let recs: Vec<_> = (0..6u8)
|
||||
.map(|i| {
|
||||
let mut v = dir(i % 4);
|
||||
v[3] = (i as f32) * 0.001;
|
||||
let rec = make_record(i, i, v, &format!("t-{i}"), &[], MODEL);
|
||||
env.seed_chunk(
|
||||
&rec.chunk_id.0,
|
||||
&rec.doc_id.0,
|
||||
&format!("notes/{i}.md"),
|
||||
"en",
|
||||
&[],
|
||||
"primary",
|
||||
);
|
||||
rec
|
||||
})
|
||||
.collect();
|
||||
env.vector.upsert(&recs).unwrap();
|
||||
|
||||
let q = dir(0);
|
||||
let h1 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
|
||||
let h2 = env.vector.search(&q, 4, &SearchFilters::default()).unwrap();
|
||||
let ids1: Vec<_> = h1.iter().map(|h| h.chunk_id.0.clone()).collect();
|
||||
let ids2: Vec<_> = h2.iter().map(|h| h.chunk_id.0.clone()).collect();
|
||||
assert_eq!(ids1, ids2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "requires AVX-capable hardware (LanceDB)"]
|
||||
fn upsert_retry_promotes_pending_to_committed() {
|
||||
// Crash-recovery contract: a phase-1 row that was already
|
||||
// committed by a prior batch is left alone by phase-3, but a
|
||||
// pending row gets retried and reaches committed once Lance
|
||||
// accepts it.
|
||||
//
|
||||
// Construction of the "crash" state:
|
||||
//
|
||||
// 1. Stage a row directly via the SQLite phase-1 helper
|
||||
// (`put_embedding_records_pending`). NO Lance write happens
|
||||
// here — this is exactly the on-disk state after a crash
|
||||
// between phase 1 and phase 2. Confirm the row is at
|
||||
// `status='pending'` before doing anything else.
|
||||
//
|
||||
// 2. Run `LanceVectorStore::upsert` with a `VectorRecord` whose
|
||||
// `embedding_id` matches the pending row. Phase 1's
|
||||
// `INSERT OR REPLACE` is idempotent here (same row payload),
|
||||
// phase 2 actually writes to Lance for the first time, and
|
||||
// phase 3 flips the row to 'committed'.
|
||||
//
|
||||
// 3. Verify status='committed' and vector_committed=1.
|
||||
//
|
||||
// This actually exercises the "rows stuck at pending get promoted
|
||||
// on next upsert" semantics — the previous version pre-seeded via
|
||||
// raw SQL but then the same upsert call overwrote the seed via
|
||||
// INSERT OR REPLACE before phase 2 ran, so the recovery path
|
||||
// never executed.
|
||||
require_avx_or_panic();
|
||||
let env = TestEnv::new();
|
||||
let rec = make_record(0xaa, 0xaa, dir(0), "alpha", &[], MODEL);
|
||||
let chunk_id = rec.chunk_id.0.clone();
|
||||
let doc_id = rec.doc_id.0.clone();
|
||||
let embedding_id = rec.embedding_id.0.clone();
|
||||
env.seed_chunk(&chunk_id, &doc_id, "notes/a.md", "en", &[], "primary");
|
||||
|
||||
// Phase 1 only — go through the same kb-store-sqlite helper that
|
||||
// `LanceVectorStore::upsert` uses internally. No Lance write
|
||||
// happens, so this models "crashed between phase 1 and phase 2".
|
||||
let pending_row = EmbeddingRecordRow {
|
||||
embedding_id: embedding_id.clone(),
|
||||
chunk_id: chunk_id.clone(),
|
||||
model_id: MODEL.to_string(),
|
||||
model_version: "v1".to_string(),
|
||||
dimensions: 4,
|
||||
lance_table: format!("chunk_embeddings_{MODEL}_4"),
|
||||
created_at: OffsetDateTime::UNIX_EPOCH,
|
||||
};
|
||||
env.sqlite
|
||||
.put_embedding_records_pending(std::slice::from_ref(&pending_row))
|
||||
.unwrap();
|
||||
|
||||
// Sanity: the row is staged but NOT yet committed and Lance has
|
||||
// no record of it.
|
||||
{
|
||||
let conn = env.sqlite.read_conn();
|
||||
let (status, committed): (String, i64) = conn
|
||||
.query_row(
|
||||
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
||||
params![embedding_id],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(status, "pending", "row should be at status=pending after phase-1-only");
|
||||
assert_eq!(committed, 0);
|
||||
}
|
||||
|
||||
// Now run upsert with the matching record. Phase 1's INSERT OR
|
||||
// REPLACE is a no-op equivalent (same row payload), phase 2 lands
|
||||
// the Lance row for the first time, phase 3 promotes
|
||||
// status='committed'.
|
||||
env.vector.upsert(&[rec]).unwrap();
|
||||
|
||||
let conn = env.sqlite.read_conn();
|
||||
let (status, committed): (String, i64) = conn
|
||||
.query_row(
|
||||
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
||||
params![embedding_id],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(status, "committed");
|
||||
assert_eq!(committed, 1);
|
||||
}
|
||||
46
migrations/V003__embedding_status.sql
Normal file
46
migrations/V003__embedding_status.sql
Normal file
@@ -0,0 +1,46 @@
|
||||
-- V003__embedding_status.sql — additive embedding lifecycle markers (§5.6).
|
||||
--
|
||||
-- P3-3 introduces a two-phase write to `embedding_records` paired with
|
||||
-- a Lance MergeInsert. Phase 1 inserts the row at `status='pending'`;
|
||||
-- phase 2 issues the Lance write; phase 3 flips the row to
|
||||
-- `status='committed'`. `search` joins back through this table with
|
||||
-- `WHERE status='committed'` so partial-write Lance rows never surface
|
||||
-- to callers, and a crashed phase 2 retry simply re-runs against the
|
||||
-- still-pending row (Lance MergeInsert dedupes on `chunk_id`).
|
||||
--
|
||||
-- The third state, `tombstone`, is reserved for the deletion pipeline:
|
||||
-- when a chunk row goes away, the matching Lance row should also be
|
||||
-- garbage-collected, but the GC scheduler is out of P3-3 scope. The
|
||||
-- BEFORE DELETE trigger below stages the marker so a future GC has a
|
||||
-- well-defined claim; see the comment block on the trigger for why
|
||||
-- it currently coexists with V001's `ON DELETE CASCADE` FK rather than
|
||||
-- replacing it.
|
||||
|
||||
ALTER TABLE embedding_records ADD COLUMN status TEXT NOT NULL DEFAULT 'pending'
|
||||
CHECK (status IN ('pending','committed','tombstone'));
|
||||
|
||||
ALTER TABLE embedding_records ADD COLUMN vector_committed INTEGER NOT NULL DEFAULT 0;
|
||||
|
||||
CREATE INDEX idx_embed_status ON embedding_records(status);
|
||||
|
||||
-- Tombstone trigger.
|
||||
--
|
||||
-- Intent: when a `chunks` row is about to be deleted, mark its
|
||||
-- dependent `embedding_records` rows as `status='tombstone'` so a later
|
||||
-- GC pass can drop the matching Lance rows in lockstep.
|
||||
--
|
||||
-- Caveat (carried into a future migration): V001 declared the FK as
|
||||
-- `chunk_id REFERENCES chunks(chunk_id) ON DELETE CASCADE`. SQLite's
|
||||
-- documented order is "BEFORE-DELETE trigger fires first, then CASCADE
|
||||
-- runs", so this UPDATE will land a `tombstone` value that is
|
||||
-- immediately followed by the CASCADE removing the row. The trigger is
|
||||
-- therefore best-effort under the current FK; the only path that
|
||||
-- actually preserves the tombstone is to drop the CASCADE (table
|
||||
-- recreation, since SQLite has no DROP CONSTRAINT) — that is queued
|
||||
-- for a P+ migration once the GC scheduler exists and we have actual
|
||||
-- production rows to migrate. Keeping the trigger here documents the
|
||||
-- design intent and gives the deletion-pipeline observer a stable hook
|
||||
-- to wire into.
|
||||
CREATE TRIGGER chunks_bd_tombstone_embeddings BEFORE DELETE ON chunks BEGIN
|
||||
UPDATE embedding_records SET status='tombstone' WHERE chunk_id = old.chunk_id;
|
||||
END;
|
||||
Reference in New Issue
Block a user