Files
kebab/tasks/p3/p3-3-lancedb-store.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

7.7 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P3 kebab-store-vector (LanceDB) p3-3 LanceDB VectorStore + embedding_records writer completed
p3-2
p1-6
p3-4
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
§5.6 embedding_records
§6.3 lancedb table naming
§7.2 VectorStore
§9 versioning

p3-3 — LanceDB VectorStore

Goal

Implement VectorStore over LanceDB (embedded). Stores per-model tables (chunk_embeddings_<model>_<dim>.lance), upserts vectors transactionally with a row in embedding_records (SQLite), and serves search for the vector retrieval mode.

Why now / why this size

Closes the loop chunk → vector. Splits cleanly from kebab-search so hybrid (p3-4) can compose lexical + vector retrievers without leaking storage details.

Allowed dependencies

  • kebab-core
  • kebab-config
  • kebab-store-sqlite (only for writing/reading rows in embedding_records)
  • lancedb
  • arrow (and arrow-array, arrow-schema)
  • serde, serde_json
  • tracing
  • thiserror

Forbidden dependencies

  • kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-embed* (consumes Vec<f32> via input only — no embedding logic here), kebab-search, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop

Inputs

input type source
VectorRecord[..] kebab_core::VectorRecord kebab-app::embed_index (P3 facade)
query vector &[f32] kebab-embed-local (Embedder::embed for query)
filters kebab_core::SearchFilters SearchQuery
kebab-config::Config.storage.vector_dir path runtime

Outputs

output type downstream
Lance tables under vector_dir/chunk_embeddings_<model>_<dim>.lance/ filesystem future searches
embedding_records rows SQLite reverse lookup, reindex bookkeeping
Vec<VectorHit> kebab_core::VectorHit hybrid retriever (p3-4)

Public surface (signatures only — no new types)

pub struct LanceVectorStore { /* internal: connection + sqlite handle */ }

impl LanceVectorStore {
    pub fn new(config: &kebab_config::Config, sqlite: std::sync::Arc<kebab_store_sqlite::SqliteStore>) -> anyhow::Result<Self>;
}

impl kebab_core::VectorStore for LanceVectorStore {
    fn ensure_table(&self, model: &kebab_core::EmbeddingModelId, dim: usize) -> anyhow::Result<kebab_core::IndexId>;
    fn upsert(&self, recs: &[kebab_core::VectorRecord]) -> anyhow::Result<()>;
    fn search(&self, query_vec: &[f32], k: usize, filters: &kebab_core::SearchFilters) -> anyhow::Result<Vec<kebab_core::VectorHit>>;
}

Behavior contract

  • Table naming: chunk_embeddings_<model_id>_<dim>.lance. Model IDs must be sanitized (replace non [A-Za-z0-9-] with _) to avoid filesystem issues.
  • ensure_table is idempotent: opens existing or creates with explicit Arrow schema:
    chunk_id : Utf8 (primary)
    doc_id   : Utf8
    embedding: FixedSizeList<Float32, dim>
    model_id : Utf8
    embedding_version : Utf8
    text     : Utf8
    heading_path : Utf8
    created_at : Timestamp(Microsecond, UTC)
    
  • For corpora < 100k rows, no IVF index — flat cosine. Above that threshold, the next migration task (P+) introduces IVF; this task does not.
  • upsert ordering: SQLite-first, Lance-second with an explicit 3-state marker so reconciliation is unambiguous (no "best-effort 2PC" hand-wave).
    1. INSERT OR REPLACE INTO embedding_records (..., status='pending', vector_committed=0) for every input row (single SQLite tx).
    2. Issue Lance upsert (MergeInsert keyed on chunk_id).
    3. On Lance success: UPDATE embedding_records SET status='committed', vector_committed=1 WHERE embedding_id IN (...).
    4. On Lance failure or process crash: rows stay at status='pending'. Next upsert re-tries them automatically (idempotent — Lance MergeInsert dedupes on chunk_id).
  • embedding_records.status is the single source of truth: search joins embedding_records and filters WHERE status='committed', so partial-write Lance rows are never returned even if they exist on disk. This guarantees search results' embedding_id always points at a committed Lance row.
  • Adds two columns to embedding_records (additive — V003__embedding_status.sql migration, not a v1 wire schema change): status TEXT NOT NULL CHECK (status IN ('pending','committed','tombstone')) default 'pending', and vector_committed INTEGER NOT NULL DEFAULT 0.
  • Tombstones: when a chunk is deleted (CASCADE from chunks), a BEFORE DELETE trigger flips status='tombstone' instead of letting the row be deleted, so a later GC can drop the matching Lance row in lockstep. GC scheduling itself is out of scope for v1; reserving the slot here keeps the schema honest.
  • Dimension mismatch (record dim ≠ table dim) returns anyhow::Error from upsert and writes nothing.
  • search performs cosine similarity, applies SearchFilters post-fetch (filter-then-limit may over-fetch internally — fetch 2 * k then trim).
  • VectorHit { chunk_id, score, doc_id, text, heading_path }. LanceDB returns cosine distance in [0, 2] (= 1 - cosine_similarity for L2-normalized vectors, range [-1, 1] → distance [0, 2]). Convert: similarity = 1.0 - distance ∈ [-1, 1], then shift to [0, 1] via score = (similarity + 1.0) / 2.0 rather than clamping. Clamping would crush all negative similarities to 0 and discard ranking signal between "unrelated" (sim ≈ 0) and "opposite" (sim ≈ -1). The shift preserves order. Clamping is reserved for floating-point sentinels (NaN → score 0, log warning).
  • search returns empty Vec (not error) when table absent.
  • index_id for ensure_table per design §4.2 with collection = "chunk_embeddings", index_kind = "flat", params_hash = blake3(serde_json(table_schema)).

Storage / wire effects

  • Writes Lance tables under data_dir/lancedb/.
  • Writes/reads embedding_records rows.
  • Reads chunks/documents not from this crate (the caller pre-fetches text + heading via VectorRecord).

Test plan

kind description fixture / data
unit ensure_table creates dir; second call returns same IndexId tmp data_dir
unit upsert of 10 records makes them retrievable via search (k=5) tmp data_dir
unit dimension mismatch → error, no Lance row written tmp data_dir
unit filter tags_any removes non-matching docs tmp data_dir + seeded sqlite tags
unit model isolation: two models live in two directories with same chunk_id tmp data_dir
unit search before any upsert returns empty Vec tmp data_dir
determinism same query vector + same data → same top-k order tmp data_dir
snapshot Vec<VectorHit> JSON for fixed corpus stable fixtures/vector/run-1.json

All tests under cargo test -p kebab-store-vector.

Definition of Done

  • cargo check -p kebab-store-vector passes
  • cargo test -p kebab-store-vector passes
  • No imports outside Allowed dependencies
  • embedding_records rows align 1:1 with Lance rows after a successful upsert batch
  • PR links design §5.6, §6.3, §7.2

Out of scope

  • IVF / PQ index tuning (P+).
  • Image / multimodal vector tables (P6).
  • kebab-app orchestration of indexing jobs (embed_index facade method body).

Risks / notes

  • LanceDB's Rust API requires Arrow batches; constructing them per upsert is allocation-heavy — batch by configurable chunk size to avoid memory spikes.
  • Filter-then-limit can starve k results; over-fetch by 2 * k initially and double on retry up to a cap.
  • WAL stability: ensure Lance commits before SQLite INSERT INTO embedding_records to avoid orphan SQLite rows.