First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending|committed|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
318 lines
12 KiB
Rust
318 lines
12 KiB
Rust
//! Embedding-records writers used by `kb-store-vector` (P3-3).
|
|
//!
|
|
//! The `VectorStore` impl in `kb-store-vector` performs a two-phase write:
|
|
//! phase 1 stages an `embedding_records` row at `status='pending'` before
|
|
//! issuing the Lance write, and phase 3 promotes those same rows to
|
|
//! `status='committed'` after the Lance commit lands. We surface those
|
|
//! two SQL statements here (rather than expose a generic write
|
|
//! connection) so the SQL stays inside the crate that owns the schema —
|
|
//! kb-store-vector consumes a typed, narrowly-scoped API and never
|
|
//! touches the connection mutex itself.
|
|
//!
|
|
//! Both helpers wrap a single `INSERT OR REPLACE` / `UPDATE` per row
|
|
//! inside a single SQLite transaction, so a partial failure leaves
|
|
//! either all rows pending (phase 1) or all rows committed (phase 3),
|
|
//! never a mixed batch.
|
|
|
|
use anyhow::{Context, Result};
|
|
use rusqlite::{params, params_from_iter};
|
|
use time::OffsetDateTime;
|
|
use time::format_description::well_known::Rfc3339;
|
|
|
|
use crate::error::StoreError;
|
|
use crate::store::SqliteStore;
|
|
|
|
/// Row payload for [`SqliteStore::put_embedding_records_pending`].
|
|
///
|
|
/// Mirrors the columns of `embedding_records` minus the lifecycle markers
|
|
/// (`status` and `vector_committed`) — those are forced to `'pending'`
|
|
/// and `0` by phase 1.
|
|
///
|
|
/// `created_at` is `OffsetDateTime` rather than a pre-formatted string so
|
|
/// the helper owns the RFC3339 formatting (the same formatting choice
|
|
/// the asset / document / job writers make).
|
|
#[derive(Clone, Debug)]
|
|
pub struct EmbeddingRecordRow {
|
|
pub embedding_id: String,
|
|
pub chunk_id: String,
|
|
pub model_id: String,
|
|
pub model_version: String,
|
|
pub dimensions: usize,
|
|
pub lance_table: String,
|
|
pub created_at: OffsetDateTime,
|
|
}
|
|
|
|
impl SqliteStore {
|
|
/// Phase 1 of the kb-store-vector two-phase write: stage every
|
|
/// `embedding_records` row with `status='pending'`,
|
|
/// `vector_committed=0`. `INSERT OR REPLACE` (rather than UPSERT) is
|
|
/// the right shape here because re-running phase 1 for an
|
|
/// already-pending row resets `vector_committed` to 0 and the
|
|
/// `created_at` to the new attempt's timestamp — both desired,
|
|
/// because a retry should look like a fresh attempt to the GC pass.
|
|
///
|
|
/// All rows are written in a single transaction; if any row fails
|
|
/// the entire batch is rolled back and the caller can retry without
|
|
/// worrying about partial pending state.
|
|
pub fn put_embedding_records_pending(
|
|
&self,
|
|
rows: &[EmbeddingRecordRow],
|
|
) -> Result<()> {
|
|
if rows.is_empty() {
|
|
return Ok(());
|
|
}
|
|
let mut conn = self.lock_conn();
|
|
let tx = conn.transaction().map_err(StoreError::from)?;
|
|
{
|
|
let mut stmt = tx
|
|
.prepare(
|
|
"INSERT OR REPLACE INTO embedding_records (
|
|
embedding_id, chunk_id, model_id, model_version,
|
|
dimensions, lance_table, created_at,
|
|
status, vector_committed
|
|
) VALUES (?, ?, ?, ?, ?, ?, ?, 'pending', 0)",
|
|
)
|
|
.map_err(StoreError::from)?;
|
|
for row in rows {
|
|
let created_at = row
|
|
.created_at
|
|
.format(&Rfc3339)
|
|
.context("format embedding_records.created_at")?;
|
|
stmt.execute(params![
|
|
row.embedding_id,
|
|
row.chunk_id,
|
|
row.model_id,
|
|
row.model_version,
|
|
row.dimensions as i64,
|
|
row.lance_table,
|
|
created_at,
|
|
])
|
|
.map_err(StoreError::from)?;
|
|
}
|
|
}
|
|
tx.commit().map_err(StoreError::from)?;
|
|
Ok(())
|
|
}
|
|
|
|
/// Phase 3 of the kb-store-vector two-phase write: after the Lance
|
|
/// MergeInsert commits, flip the listed embedding rows to
|
|
/// `status='committed'`, `vector_committed=1`. Rows that aren't
|
|
/// currently `pending` (e.g. already committed by a duplicate batch,
|
|
/// or tombstoned by a chunks DELETE between phase 1 and phase 3)
|
|
/// are deliberately left alone via `WHERE status='pending'` — we
|
|
/// never resurrect a tombstone, and we never blindly re-mark a
|
|
/// committed row.
|
|
///
|
|
/// All updates run in a single statement (single SQL `UPDATE …
|
|
/// WHERE embedding_id IN (?, ?, …)`) inside one transaction —
|
|
/// avoids the per-row `execute()` round-trip the previous
|
|
/// implementation paid.
|
|
pub fn mark_embedding_records_committed(
|
|
&self,
|
|
embedding_ids: &[String],
|
|
) -> Result<()> {
|
|
if embedding_ids.is_empty() {
|
|
return Ok(());
|
|
}
|
|
let mut conn = self.lock_conn();
|
|
let tx = conn.transaction().map_err(StoreError::from)?;
|
|
{
|
|
let placeholders = std::iter::repeat_n("?", embedding_ids.len())
|
|
.collect::<Vec<_>>()
|
|
.join(",");
|
|
let sql = format!(
|
|
"UPDATE embedding_records
|
|
SET status='committed', vector_committed=1
|
|
WHERE status='pending'
|
|
AND embedding_id IN ({placeholders})"
|
|
);
|
|
tx.execute(&sql, params_from_iter(embedding_ids.iter()))
|
|
.map_err(StoreError::from)?;
|
|
}
|
|
tx.commit().map_err(StoreError::from)?;
|
|
Ok(())
|
|
}
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
use kb_config::Config;
|
|
use tempfile::TempDir;
|
|
use time::OffsetDateTime;
|
|
|
|
/// Minimal config pointing at a tempdir for the SQLite file.
|
|
fn config_for(tmp: &TempDir) -> Config {
|
|
let mut c = Config::defaults();
|
|
c.storage.data_dir = tmp.path().to_string_lossy().into_owned();
|
|
c
|
|
}
|
|
|
|
/// Seed a chunks row + the doc / asset rows it FKs to. The minimum
|
|
/// needed for embedding_records inserts not to fail the FK to
|
|
/// chunks.
|
|
fn seed_chunk(store: &SqliteStore, chunk_id: &str) {
|
|
let conn = store.lock_conn();
|
|
// Asset, document, chunk — all hand-rolled at the SQL layer to
|
|
// keep the test self-contained (no kb-parse/kb-chunk dep).
|
|
conn.execute(
|
|
"INSERT INTO assets (
|
|
asset_id, source_uri, workspace_path, media_type, byte_len,
|
|
checksum, storage_kind, storage_path, discovered_at
|
|
) VALUES (?, ?, ?, ?, ?, ?, 'reference', '/tmp/x', ?)",
|
|
params![
|
|
"0123456789abcdef0123456789abcdef",
|
|
"file:///tmp/x",
|
|
"x.md",
|
|
"{}",
|
|
0_i64,
|
|
"deadbeef",
|
|
"1970-01-01T00:00:00Z",
|
|
],
|
|
)
|
|
.unwrap();
|
|
conn.execute(
|
|
"INSERT INTO documents (
|
|
doc_id, asset_id, workspace_path, title, lang, source_type,
|
|
trust_level, parser_version, doc_version, schema_version,
|
|
metadata_json, provenance_json, created_at, updated_at
|
|
) VALUES (?, ?, ?, NULL, NULL, 'fs', 'unverified', 'v1', 1, 1, '{}', '{}', ?, ?)",
|
|
params![
|
|
"fedcba9876543210fedcba9876543210",
|
|
"0123456789abcdef0123456789abcdef",
|
|
"x.md",
|
|
"1970-01-01T00:00:00Z",
|
|
"1970-01-01T00:00:00Z",
|
|
],
|
|
)
|
|
.unwrap();
|
|
conn.execute(
|
|
"INSERT INTO chunks (
|
|
chunk_id, doc_id, text, heading_path_json, section_label,
|
|
source_spans_json, token_estimate, chunker_version,
|
|
policy_hash, block_ids_json, created_at
|
|
) VALUES (?, ?, 'hi', '[]', NULL, '[]', 1, 'v1', 'hash', '[]', ?)",
|
|
params![chunk_id, "fedcba9876543210fedcba9876543210", "1970-01-01T00:00:00Z"],
|
|
)
|
|
.unwrap();
|
|
}
|
|
|
|
fn open_store(tmp: &TempDir) -> SqliteStore {
|
|
let cfg = config_for(tmp);
|
|
let store = SqliteStore::open(&cfg).unwrap();
|
|
store.run_migrations().unwrap();
|
|
store
|
|
}
|
|
|
|
#[test]
|
|
fn pending_then_committed_round_trip() {
|
|
let tmp = TempDir::new().unwrap();
|
|
let store = open_store(&tmp);
|
|
let chunk = "11112222333344445555666677778888";
|
|
seed_chunk(&store, chunk);
|
|
|
|
let row = EmbeddingRecordRow {
|
|
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
|
|
chunk_id: chunk.to_string(),
|
|
model_id: "test-model".to_string(),
|
|
model_version: "v1".to_string(),
|
|
dimensions: 4,
|
|
lance_table: "chunk_embeddings_test_model_4".to_string(),
|
|
created_at: OffsetDateTime::now_utc(),
|
|
};
|
|
store
|
|
.put_embedding_records_pending(std::slice::from_ref(&row))
|
|
.unwrap();
|
|
|
|
// Inspect: the row exists at status='pending'.
|
|
{
|
|
let conn = store.read_conn();
|
|
let (status, committed): (String, i64) = conn
|
|
.query_row(
|
|
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
|
params![row.embedding_id],
|
|
|r| Ok((r.get(0)?, r.get(1)?)),
|
|
)
|
|
.unwrap();
|
|
assert_eq!(status, "pending");
|
|
assert_eq!(committed, 0);
|
|
}
|
|
|
|
store
|
|
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
|
|
.unwrap();
|
|
{
|
|
let conn = store.read_conn();
|
|
let (status, committed): (String, i64) = conn
|
|
.query_row(
|
|
"SELECT status, vector_committed FROM embedding_records WHERE embedding_id = ?",
|
|
params![row.embedding_id],
|
|
|r| Ok((r.get(0)?, r.get(1)?)),
|
|
)
|
|
.unwrap();
|
|
assert_eq!(status, "committed");
|
|
assert_eq!(committed, 1);
|
|
}
|
|
}
|
|
|
|
#[test]
|
|
fn empty_batches_are_noops() {
|
|
let tmp = TempDir::new().unwrap();
|
|
let store = open_store(&tmp);
|
|
store.put_embedding_records_pending(&[]).unwrap();
|
|
store.mark_embedding_records_committed(&[]).unwrap();
|
|
}
|
|
|
|
#[test]
|
|
fn replay_phase_one_resets_vector_committed() {
|
|
// INSERT OR REPLACE: a phase-1 retry on a row that briefly
|
|
// reached `committed` (in some adversarial out-of-order replay)
|
|
// resets it to `pending`. Confirms the documented semantics.
|
|
let tmp = TempDir::new().unwrap();
|
|
let store = open_store(&tmp);
|
|
let chunk = "11112222333344445555666677778888";
|
|
seed_chunk(&store, chunk);
|
|
|
|
let row = EmbeddingRecordRow {
|
|
embedding_id: "aaaa1111bbbb2222cccc3333dddd4444".to_string(),
|
|
chunk_id: chunk.to_string(),
|
|
model_id: "test-model".to_string(),
|
|
model_version: "v1".to_string(),
|
|
dimensions: 4,
|
|
lance_table: "chunk_embeddings_test_model_4".to_string(),
|
|
created_at: OffsetDateTime::now_utc(),
|
|
};
|
|
store
|
|
.put_embedding_records_pending(std::slice::from_ref(&row))
|
|
.unwrap();
|
|
store
|
|
.mark_embedding_records_committed(std::slice::from_ref(&row.embedding_id))
|
|
.unwrap();
|
|
store
|
|
.put_embedding_records_pending(std::slice::from_ref(&row))
|
|
.unwrap();
|
|
|
|
let conn = store.read_conn();
|
|
let status: String = conn
|
|
.query_row(
|
|
"SELECT status FROM embedding_records WHERE embedding_id = ?",
|
|
params![row.embedding_id],
|
|
|r| r.get(0),
|
|
)
|
|
.unwrap();
|
|
assert_eq!(status, "pending");
|
|
}
|
|
|
|
#[test]
|
|
fn mark_committed_skips_non_pending() {
|
|
// The phase-3 UPDATE explicitly filters `status='pending'`, so
|
|
// calling it on an embedding_id that was never staged (or that
|
|
// already became a tombstone) is a no-op rather than an error.
|
|
let tmp = TempDir::new().unwrap();
|
|
let store = open_store(&tmp);
|
|
store
|
|
.mark_embedding_records_committed(&["does-not-exist".to_string()])
|
|
.unwrap();
|
|
}
|
|
}
|