tasks: add P3 component specs (embedder, fastembed, lancedb, hybrid)

This commit is contained in:
kb
2026-04-27 11:59:46 +00:00
parent c044b97a34
commit 5b813ce39e
4 changed files with 496 additions and 0 deletions

View File

@@ -0,0 +1,100 @@
---
phase: P3
component: kb-embed (trait crate)
task_id: p3-1
title: "Embedder trait + EmbeddingInput/Kind validation"
status: planned
depends_on: [p0-1]
unblocks: [p3-2, p3-3, p3-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.7 SearchHit.embedding_model, §7.1 EmbeddingInput/Kind, §7.2 Embedder, §11 LLM/embedding split]
---
# p3-1 — Embedder trait crate
## Goal
Provide the `kb-embed` crate that re-exports `Embedder` trait, `EmbeddingInput`/`EmbeddingKind`, and offers a mock implementation for downstream tests. This task is **trait-only**; concrete adapters live in p3-2.
## Why now / why this size
Concrete adapters (fastembed, ollama-embed, candle) need a stable trait surface. Owning the trait + a mock implementation in a tiny crate keeps `kb-store-vector` and `kb-search` testable without touching real models.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `serde`
- `thiserror`
- `tracing`
## Forbidden dependencies
- `fastembed`, `ort`, `tokenizers`, `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| `EmbeddingInput` | `kb_core::EmbeddingInput<'_>` | callers (parser-side or query-side) |
| model identity | `(EmbeddingModelId, EmbeddingVersion, dimensions)` | adapter at construction |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<Vec<f32>>` | row-aligned with input | `kb-store-vector`, `kb-search` (vector mode) |
## Public surface (signatures only — no new types)
```rust
pub use kb_core::{EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion, Embedder};
/// Test-only mock that produces deterministic vectors.
pub struct MockEmbedder { /* internal: model_id, dims, seed */ }
impl MockEmbedder {
pub fn new(model_id: kb_core::EmbeddingModelId, version: kb_core::EmbeddingVersion, dimensions: usize) -> Self;
}
impl kb_core::Embedder for MockEmbedder { /* per §7.2 */ }
```
## Behavior contract
- `MockEmbedder::embed` produces vectors deterministically from `(text, kind)`: e.g., `vector[i] = hash_to_unit_float(text, kind, i, seed)` so two identical inputs produce identical vectors and different inputs produce nearly-orthogonal vectors. Used by downstream tests.
- `MockEmbedder` must respect `EmbeddingKind::Document` vs `Query` — different prefix mixed into the hash so query embeddings differ from document embeddings of the same text (mirrors real e5 behavior).
- `dimensions()` returns the value passed at construction; callers must trust it.
- Real adapters (p3-2) MUST NOT implement `Embedder` here.
- The crate may expose a tiny helper `pub fn assert_vector_shape(vecs: &[Vec<f32>], expected_dims: usize)` for downstream tests.
## Storage / wire effects
- None.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | trait dyn dispatch via `Box<dyn Embedder>` works | inline |
| unit | `MockEmbedder` produces identical vector for identical input | inline |
| unit | `EmbeddingKind::Document` vs `Query` for same text yield different vectors | inline |
| unit | dimensions match construction-time value | inline |
| contract | property test: 100 random inputs, each vector has length == dimensions, all finite floats | inline (proptest) |
All tests under `cargo test -p kb-embed`.
## Definition of Done
- [ ] `cargo check -p kb-embed` passes
- [ ] `cargo test -p kb-embed` passes
- [ ] No external embedding dep present
- [ ] PR links design §7.2 Embedder, §11
## Out of scope
- Real adapter (`kb-embed-local` is p3-2).
- Reranker traits (P+).
## Risks / notes
- `MockEmbedder` is for tests; do not let it leak into release builds via default features. Gate behind `cfg(test)` or a `mock` feature flag.
- Trait re-exports keep the call site stable even if `kb-core` reorganizes; downstream crates should `use kb_embed::Embedder` rather than `use kb_core::Embedder`.

View File

@@ -0,0 +1,119 @@
---
phase: P3
component: kb-embed-local (fastembed adapter)
task_id: p3-2
title: "fastembed-rs Embedder for multilingual-e5-small"
status: planned
depends_on: [p3-1]
unblocks: [p3-3, p3-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§7.2 Embedder, §11.3 local embedding, §6.4 [models.embedding], §9 versioning]
---
# p3-2 — fastembed adapter
## Goal
Provide `FastembedEmbedder` implementing `Embedder` for `multilingual-e5-small` (default) using `fastembed-rs` (ONNX runtime). Apply Document/Query prefix per §11.3. Honor `batch_size` from config.
## Why now / why this size
First real `Embedder`. Drives `EmbeddingId` recipe (model_id + model_version + dims) downstream. Isolated from store/search so model swaps remain config-only.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `kb-embed`
- `fastembed = "4"` (or current stable)
- `tokenizers`
- `ort` (transitive via fastembed)
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, network HTTP libs (model download is fastembed's responsibility)
## Inputs
| input | type | source |
|-------|------|--------|
| `kb-config::Config.models.embedding` | settings | runtime |
| `EmbeddingInput[..]` | `kb_core::EmbeddingInput<'_>[]` | callers |
| model cache | `data_dir/models/fastembed/` | filesystem |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<Vec<f32>>` | row-aligned, `dimensions = 384` | `kb-store-vector`, query vectors for hybrid search |
| model identity | `(EmbeddingModelId, EmbeddingVersion, usize)` | record fields, `embedding_id` recipe |
## Public surface (signatures only — no new types)
```rust
pub struct FastembedEmbedder { /* internal: TextEmbedding instance + model meta */ }
impl FastembedEmbedder {
pub fn new(config: &kb_config::Config) -> anyhow::Result<Self>;
}
impl kb_core::Embedder for FastembedEmbedder {
fn model_id(&self) -> kb_core::EmbeddingModelId;
fn model_version(&self) -> kb_core::EmbeddingVersion;
fn dimensions(&self) -> usize;
fn embed(&self, inputs: &[kb_core::EmbeddingInput<'_>]) -> anyhow::Result<Vec<Vec<f32>>>;
}
```
## Behavior contract
- Default model `multilingual-e5-small` (384 dims). `model_id()` returns `EmbeddingModelId("multilingual-e5-small")`.
- `model_version()` returns `EmbeddingVersion("v1")` initially. Bump per §9 if fastembed upgrades the bundled weights.
- Apply e5 prefix per §11.3: input prefixed with `"passage: "` for `EmbeddingKind::Document`, `"query: "` for `EmbeddingKind::Query` BEFORE tokenization.
- Batch processing respects `config.models.embedding.batch_size`. Inputs longer than the batch are split into multiple inference calls and concatenated.
- L2-normalize each vector before returning (e5 convention).
- Dimensions must equal `config.models.embedding.dimensions` AND the model's actual dim. Mismatch returns `anyhow::Error` at construction (not at first `embed`).
- Model files cached under `config.storage.model_dir/fastembed/` (downloaded on first use).
- Determinism: identical input + identical model version → identical vectors (tolerance < 1e-6 on aggregate hash for snapshot tests).
- No async runtime: the trait is synchronous. fastembed is sync internally.
## Storage / wire effects
- Reads/writes `data_dir/models/fastembed/` (model cache).
- Otherwise no DB or wire effects.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | construction with default config returns dims=384 | tmp config |
| unit | construction with mismatched dims returns error | tmp config |
| unit | `EmbeddingKind::Query` vs `Document` for same text yield different vectors (cosine < 1.0) | inline |
| unit | output vectors are L2-normalized (norm ~= 1.0 ± 1e-3) | inline |
| determinism | identical input twice → identical output (hash-of-floats compare) | inline |
| performance | batch of 64 short inputs completes in < 5s on CI host | tmp config (skip on slow CI via `#[ignore]`) |
| snapshot | aggregate hash of vectors for 5 known sentences stable across runs | `fixtures/embed/known-sentences.json` |
All tests under `cargo test -p kb-embed-local`. Mark slow tests `#[ignore]` and run via `cargo test -- --ignored` in dedicated CI lane.
## Definition of Done
- [ ] `cargo check -p kb-embed-local` passes
- [ ] `cargo test -p kb-embed-local` passes (excluding `#[ignore]`)
- [ ] First-run model download works under `data_dir/models/fastembed/`
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §11.3, §6.4, §9
## Out of scope
- Reranker (P+).
- Other model providers (Ollama embedding endpoint, candle) — separate adapter crates.
- Visual / image embeddings (P6).
## Risks / notes
- ONNX runtime first-load latency on M-series Macs (Metal) can be 1-2 s; tests share a `OnceCell<FastembedEmbedder>`.
- Forgetting the e5 prefix silently degrades retrieval quality. Tests must assert query/document yield distinct vectors.
- Bumping `EmbeddingVersion` invalidates every `embedding_id`. Treat as a versioning event per §9 — provides justification in PR body.

View File

@@ -0,0 +1,132 @@
---
phase: P3
component: kb-store-vector (LanceDB)
task_id: p3-3
title: "LanceDB VectorStore + embedding_records writer"
status: planned
depends_on: [p3-2, p1-6]
unblocks: [p3-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§5.6 embedding_records, §6.3 lancedb table naming, §7.2 VectorStore, §9 versioning]
---
# p3-3 — LanceDB VectorStore
## Goal
Implement `VectorStore` over LanceDB (embedded). Stores per-model tables (`chunk_embeddings_<model>_<dim>.lance`), upserts vectors transactionally with a row in `embedding_records` (SQLite), and serves `search` for the vector retrieval mode.
## Why now / why this size
Closes the loop chunk → vector. Splits cleanly from `kb-search` so hybrid (p3-4) can compose lexical + vector retrievers without leaking storage details.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `kb-store-sqlite` (only for writing/reading rows in `embedding_records`)
- `lancedb`
- `arrow` (and `arrow-array`, `arrow-schema`)
- `serde`, `serde_json`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-embed*` (consumes `Vec<f32>` via input only — no embedding logic here), `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| `VectorRecord[..]` | `kb_core::VectorRecord` | `kb-app::embed_index` (P3 facade) |
| query vector | `&[f32]` | `kb-embed-local` (`Embedder::embed` for query) |
| filters | `kb_core::SearchFilters` | `SearchQuery` |
| `kb-config::Config.storage.vector_dir` | path | runtime |
## Outputs
| output | type | downstream |
|--------|------|------------|
| Lance tables under `vector_dir/chunk_embeddings_<model>_<dim>.lance/` | filesystem | future searches |
| `embedding_records` rows | SQLite | reverse lookup, reindex bookkeeping |
| `Vec<VectorHit>` | `kb_core::VectorHit` | hybrid retriever (p3-4) |
## Public surface (signatures only — no new types)
```rust
pub struct LanceVectorStore { /* internal: connection + sqlite handle */ }
impl LanceVectorStore {
pub fn new(config: &kb_config::Config, sqlite: std::sync::Arc<kb_store_sqlite::SqliteStore>) -> anyhow::Result<Self>;
}
impl kb_core::VectorStore for LanceVectorStore {
fn ensure_table(&self, model: &kb_core::EmbeddingModelId, dim: usize) -> anyhow::Result<kb_core::IndexId>;
fn upsert(&self, recs: &[kb_core::VectorRecord]) -> anyhow::Result<()>;
fn search(&self, query_vec: &[f32], k: usize, filters: &kb_core::SearchFilters) -> anyhow::Result<Vec<kb_core::VectorHit>>;
}
```
## Behavior contract
- Table naming: `chunk_embeddings_<model_id>_<dim>.lance`. Model IDs must be sanitized (replace non `[A-Za-z0-9-]` with `_`) to avoid filesystem issues.
- `ensure_table` is idempotent: opens existing or creates with explicit Arrow schema:
```
chunk_id : Utf8 (primary)
doc_id : Utf8
embedding: FixedSizeList<Float32, dim>
model_id : Utf8
embedding_version : Utf8
text : Utf8
heading_path : Utf8
created_at : Timestamp(Microsecond, UTC)
```
- For corpora < 100k rows, no IVF index — flat cosine. Above that threshold, the next migration task (P+) introduces IVF; this task does not.
- `upsert` is best-effort 2-step (Lance commit, then SQLite `INSERT OR REPLACE INTO embedding_records`). On SQLite failure after Lance commit, log a warning; the next `upsert` reconciles via the `UNIQUE(chunk_id, model_id, model_version, dimensions)` constraint.
- Dimension mismatch (record dim ≠ table dim) returns `anyhow::Error` from `upsert` and writes nothing.
- `search` performs cosine similarity, applies `SearchFilters` post-fetch (filter-then-limit may over-fetch internally — fetch `2 * k` then trim).
- `VectorHit { chunk_id, score, doc_id, text, heading_path }`; score in [0, 1] (cosine similarity, clamped).
- `search` returns empty `Vec` (not error) when table absent.
- `index_id` for `ensure_table` per design §4.2 with `collection = "chunk_embeddings"`, `index_kind = "flat"`, `params_hash = blake3(serde_json(table_schema))`.
## Storage / wire effects
- Writes Lance tables under `data_dir/lancedb/`.
- Writes/reads `embedding_records` rows.
- Reads chunks/documents not from this crate (the caller pre-fetches text + heading via `VectorRecord`).
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | `ensure_table` creates dir; second call returns same `IndexId` | tmp data_dir |
| unit | `upsert` of 10 records makes them retrievable via `search` (k=5) | tmp data_dir |
| unit | dimension mismatch → error, no Lance row written | tmp data_dir |
| unit | filter `tags_any` removes non-matching docs | tmp data_dir + seeded sqlite tags |
| unit | model isolation: two models live in two directories with same `chunk_id` | tmp data_dir |
| unit | search before any upsert returns empty Vec | tmp data_dir |
| determinism | same query vector + same data → same top-k order | tmp data_dir |
| snapshot | `Vec<VectorHit>` JSON for fixed corpus stable | `fixtures/vector/run-1.json` |
All tests under `cargo test -p kb-store-vector`.
## Definition of Done
- [ ] `cargo check -p kb-store-vector` passes
- [ ] `cargo test -p kb-store-vector` passes
- [ ] No imports outside Allowed dependencies
- [ ] `embedding_records` rows align 1:1 with Lance rows after a successful upsert batch
- [ ] PR links design §5.6, §6.3, §7.2
## Out of scope
- IVF / PQ index tuning (P+).
- Image / multimodal vector tables (P6).
- `kb-app` orchestration of indexing jobs (`embed_index` facade method body).
## Risks / notes
- LanceDB's Rust API requires Arrow batches; constructing them per upsert is allocation-heavy — batch by configurable chunk size to avoid memory spikes.
- Filter-then-limit can starve `k` results; over-fetch by `2 * k` initially and double on retry up to a cap.
- WAL stability: ensure Lance commits before SQLite `INSERT INTO embedding_records` to avoid orphan SQLite rows.

View File

@@ -0,0 +1,145 @@
---
phase: P3
component: kb-search (hybrid)
task_id: p3-4
title: "Hybrid Retriever (RRF) over lexical + vector"
status: planned
depends_on: [p2-2, p3-3]
unblocks: [p4-3]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.7 RetrievalDetail, §0 Q3, §1.6 search --explain, §6.4 [search] rrf settings]
---
# p3-4 — Hybrid Retriever (RRF)
## Goal
Compose `LexicalRetriever` (p2-2) and a vector retriever wrapper around `LanceVectorStore` (p3-3) into a single `Retriever` that dispatches by `SearchMode`. For `Hybrid`, fuse via Reciprocal Rank Fusion (RRF) and populate full `RetrievalDetail` per `SearchHit`.
## Why now / why this size
Single mediator. Keeps the lexical and vector retrievers focused; only this task knows how to fuse. RAG (p4-3) consumes hybrid output without caring about the underlying retrievers.
## Allowed dependencies
- `kb-core`
- `kb-config`
- `kb-store-sqlite` (for `LexicalRetriever`)
- `kb-store-vector` (for `LanceVectorStore`)
- `kb-embed` (trait only — for query embedding via `Embedder`)
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`. (`kb-embed-local` is a runtime-injected `dyn Embedder`; this crate must not depend on the concrete adapter directly.)
## Inputs
| input | type | source |
|-------|------|--------|
| `LexicalRetriever` | trait object | constructed elsewhere |
| `LanceVectorStore` | trait object | constructed elsewhere |
| `Box<dyn Embedder>` | for query embedding | runtime-injected |
| `kb-config::Config.search` | `default_k`, `hybrid_fusion`, `rrf_k` | runtime |
| `SearchQuery` | `kb_core::SearchQuery` | `kb-app::search` |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<SearchHit>` (with full `RetrievalDetail`) | `kb_core::SearchHit` | `kb-cli` printer, `kb-rag` packer |
## Public surface (signatures only — no new types)
```rust
pub struct HybridRetriever {
lexical: std::sync::Arc<dyn kb_core::Retriever>,
vector: std::sync::Arc<dyn kb_core::Retriever>, // wrapper over LanceVectorStore + Embedder
fusion: FusionPolicy,
k: usize,
}
pub enum FusionPolicy { Rrf { k_rrf: u32 } }
impl HybridRetriever {
pub fn new(
config: &kb_config::Config,
lexical: std::sync::Arc<dyn kb_core::Retriever>,
vector: std::sync::Arc<dyn kb_core::Retriever>,
) -> Self;
}
impl kb_core::Retriever for HybridRetriever {
fn search(&self, query: &kb_core::SearchQuery) -> anyhow::Result<Vec<kb_core::SearchHit>>;
fn index_version(&self) -> kb_core::IndexVersion;
}
/// Wrapper that turns a VectorStore + Embedder into a Retriever.
pub struct VectorRetriever {
store: std::sync::Arc<dyn kb_core::VectorStore>,
embed: std::sync::Arc<dyn kb_core::Embedder>,
/* heading_path/snippet enrichment hits SQLite via kb-store-sqlite read accessor */
}
impl VectorRetriever {
pub fn new(store: std::sync::Arc<dyn kb_core::VectorStore>, embed: std::sync::Arc<dyn kb_core::Embedder>, sqlite: std::sync::Arc<kb_store_sqlite::SqliteStore>) -> Self;
}
impl kb_core::Retriever for VectorRetriever { /* per §7.2 */ }
```
## Behavior contract
- `SearchMode::Lexical` dispatches solely to `lexical`. `RetrievalDetail.method = Lexical`, `vector_*` fields are `None`.
- `SearchMode::Vector` dispatches solely to `vector`. `RetrievalDetail.method = Vector`, `lexical_*` fields are `None`.
- `SearchMode::Hybrid`:
- run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required).
- fuse with RRF: `score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
- sort by fused score DESC, take top `query.k`.
- populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = computed fused score.
- if a chunk appears in only one retriever, its `RetrievalDetail` still gets populated with `Some(...)` from that side and `None` for the other.
- tie-break by `lexical_rank` ascending, then `chunk_id` ascending (deterministic).
- `VectorRetriever`:
- embeds the query via `embed.embed(&[EmbeddingInput { text: query.text, kind: Query }])`.
- calls `VectorStore::search(query_vec, query.k * 2, query.filters)` (over-fetch for filter losses), trims to `k`.
- hydrates `doc_path` / `heading_path` / `section_label` / `chunker_version` / `embedding_model` from SQLite by joining on `chunk_id`.
- builds `Citation` from chunk's first source span (same logic as p2-2).
- `index_version()` returns the lexical index version when in pure lexical mode, else the vector index version, else "hybrid:<lex_iv>+<vec_iv>".
## Storage / wire effects
- Reads only. No mutations.
- Output JSON conforms to `search_hit.v1`.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | pure lexical mode delegates 1:1 to `lexical.search` | mock retrievers |
| unit | pure vector mode delegates 1:1 to `vector.search` | mock retrievers |
| unit | hybrid: chunk only in lexical receives `vector_*: None`, but still has a fused score | mock retrievers |
| unit | RRF formula matches expected with `k_rrf=60` | inline math test |
| unit | tie-break deterministic (same fused score → stable order) | inline |
| unit | hybrid recall ≥ max(lexical recall, vector recall) on a tiny corpus where each mode finds disjoint hits | tmp DB + Lance + MockEmbedder |
| determinism | identical query twice → byte-identical `Vec<SearchHit>` | tmp DB |
| snapshot | hybrid output JSON stable | `fixtures/search/hybrid/run-1.json` |
All tests under `cargo test -p kb-search hybrid`.
## Definition of Done
- [ ] `cargo check -p kb-search` passes
- [ ] `cargo test -p kb-search hybrid` passes
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.7, §6.4 search, §0 Q3
## Out of scope
- Reranker (P+).
- Multimodal retrieval (image/audio) — P6+.
- Score calibration across modes (RRF makes scores rank-comparable; absolute calibration is P+).
## Risks / notes
- Mismatched `index_version` between lexical and vector should be flagged at construction so users notice stale indexes.
- Over-fetching at the vector retriever (`2 * k`) is conservative; if filters reject everything, the hybrid `k` may shrink. Document this in CLI `--explain`.
- RRF is rank-based, so absolute lexical bm25 normalization (p2-2) doesn't affect fused order; still keep normalization for `--explain` readability.