diff --git a/tasks/p3/p3-1-embedder-trait.md b/tasks/p3/p3-1-embedder-trait.md new file mode 100644 index 0000000..4305515 --- /dev/null +++ b/tasks/p3/p3-1-embedder-trait.md @@ -0,0 +1,100 @@ +--- +phase: P3 +component: kb-embed (trait crate) +task_id: p3-1 +title: "Embedder trait + EmbeddingInput/Kind validation" +status: planned +depends_on: [p0-1] +unblocks: [p3-2, p3-3, p3-4] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§3.7 SearchHit.embedding_model, §7.1 EmbeddingInput/Kind, §7.2 Embedder, §11 LLM/embedding split] +--- + +# p3-1 — Embedder trait crate + +## Goal + +Provide the `kb-embed` crate that re-exports `Embedder` trait, `EmbeddingInput`/`EmbeddingKind`, and offers a mock implementation for downstream tests. This task is **trait-only**; concrete adapters live in p3-2. + +## Why now / why this size + +Concrete adapters (fastembed, ollama-embed, candle) need a stable trait surface. Owning the trait + a mock implementation in a tiny crate keeps `kb-store-vector` and `kb-search` testable without touching real models. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `serde` +- `thiserror` +- `tracing` + +## Forbidden dependencies + +- `fastembed`, `ort`, `tokenizers`, `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| `EmbeddingInput` | `kb_core::EmbeddingInput<'_>` | callers (parser-side or query-side) | +| model identity | `(EmbeddingModelId, EmbeddingVersion, dimensions)` | adapter at construction | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `Vec>` | row-aligned with input | `kb-store-vector`, `kb-search` (vector mode) | + +## Public surface (signatures only — no new types) + +```rust +pub use kb_core::{EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion, Embedder}; + +/// Test-only mock that produces deterministic vectors. +pub struct MockEmbedder { /* internal: model_id, dims, seed */ } +impl MockEmbedder { + pub fn new(model_id: kb_core::EmbeddingModelId, version: kb_core::EmbeddingVersion, dimensions: usize) -> Self; +} +impl kb_core::Embedder for MockEmbedder { /* per §7.2 */ } +``` + +## Behavior contract + +- `MockEmbedder::embed` produces vectors deterministically from `(text, kind)`: e.g., `vector[i] = hash_to_unit_float(text, kind, i, seed)` so two identical inputs produce identical vectors and different inputs produce nearly-orthogonal vectors. Used by downstream tests. +- `MockEmbedder` must respect `EmbeddingKind::Document` vs `Query` — different prefix mixed into the hash so query embeddings differ from document embeddings of the same text (mirrors real e5 behavior). +- `dimensions()` returns the value passed at construction; callers must trust it. +- Real adapters (p3-2) MUST NOT implement `Embedder` here. +- The crate may expose a tiny helper `pub fn assert_vector_shape(vecs: &[Vec], expected_dims: usize)` for downstream tests. + +## Storage / wire effects + +- None. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | trait dyn dispatch via `Box` works | inline | +| unit | `MockEmbedder` produces identical vector for identical input | inline | +| unit | `EmbeddingKind::Document` vs `Query` for same text yield different vectors | inline | +| unit | dimensions match construction-time value | inline | +| contract | property test: 100 random inputs, each vector has length == dimensions, all finite floats | inline (proptest) | + +All tests under `cargo test -p kb-embed`. + +## Definition of Done + +- [ ] `cargo check -p kb-embed` passes +- [ ] `cargo test -p kb-embed` passes +- [ ] No external embedding dep present +- [ ] PR links design §7.2 Embedder, §11 + +## Out of scope + +- Real adapter (`kb-embed-local` is p3-2). +- Reranker traits (P+). + +## Risks / notes + +- `MockEmbedder` is for tests; do not let it leak into release builds via default features. Gate behind `cfg(test)` or a `mock` feature flag. +- Trait re-exports keep the call site stable even if `kb-core` reorganizes; downstream crates should `use kb_embed::Embedder` rather than `use kb_core::Embedder`. diff --git a/tasks/p3/p3-2-fastembed-adapter.md b/tasks/p3/p3-2-fastembed-adapter.md new file mode 100644 index 0000000..a1cae38 --- /dev/null +++ b/tasks/p3/p3-2-fastembed-adapter.md @@ -0,0 +1,119 @@ +--- +phase: P3 +component: kb-embed-local (fastembed adapter) +task_id: p3-2 +title: "fastembed-rs Embedder for multilingual-e5-small" +status: planned +depends_on: [p3-1] +unblocks: [p3-3, p3-4] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§7.2 Embedder, §11.3 local embedding, §6.4 [models.embedding], §9 versioning] +--- + +# p3-2 — fastembed adapter + +## Goal + +Provide `FastembedEmbedder` implementing `Embedder` for `multilingual-e5-small` (default) using `fastembed-rs` (ONNX runtime). Apply Document/Query prefix per §11.3. Honor `batch_size` from config. + +## Why now / why this size + +First real `Embedder`. Drives `EmbeddingId` recipe (model_id + model_version + dims) downstream. Isolated from store/search so model swaps remain config-only. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-embed` +- `fastembed = "4"` (or current stable) +- `tokenizers` +- `ort` (transitive via fastembed) +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`, network HTTP libs (model download is fastembed's responsibility) + +## Inputs + +| input | type | source | +|-------|------|--------| +| `kb-config::Config.models.embedding` | settings | runtime | +| `EmbeddingInput[..]` | `kb_core::EmbeddingInput<'_>[]` | callers | +| model cache | `data_dir/models/fastembed/` | filesystem | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `Vec>` | row-aligned, `dimensions = 384` | `kb-store-vector`, query vectors for hybrid search | +| model identity | `(EmbeddingModelId, EmbeddingVersion, usize)` | record fields, `embedding_id` recipe | + +## Public surface (signatures only — no new types) + +```rust +pub struct FastembedEmbedder { /* internal: TextEmbedding instance + model meta */ } + +impl FastembedEmbedder { + pub fn new(config: &kb_config::Config) -> anyhow::Result; +} + +impl kb_core::Embedder for FastembedEmbedder { + fn model_id(&self) -> kb_core::EmbeddingModelId; + fn model_version(&self) -> kb_core::EmbeddingVersion; + fn dimensions(&self) -> usize; + fn embed(&self, inputs: &[kb_core::EmbeddingInput<'_>]) -> anyhow::Result>>; +} +``` + +## Behavior contract + +- Default model `multilingual-e5-small` (384 dims). `model_id()` returns `EmbeddingModelId("multilingual-e5-small")`. +- `model_version()` returns `EmbeddingVersion("v1")` initially. Bump per §9 if fastembed upgrades the bundled weights. +- Apply e5 prefix per §11.3: input prefixed with `"passage: "` for `EmbeddingKind::Document`, `"query: "` for `EmbeddingKind::Query` BEFORE tokenization. +- Batch processing respects `config.models.embedding.batch_size`. Inputs longer than the batch are split into multiple inference calls and concatenated. +- L2-normalize each vector before returning (e5 convention). +- Dimensions must equal `config.models.embedding.dimensions` AND the model's actual dim. Mismatch returns `anyhow::Error` at construction (not at first `embed`). +- Model files cached under `config.storage.model_dir/fastembed/` (downloaded on first use). +- Determinism: identical input + identical model version → identical vectors (tolerance < 1e-6 on aggregate hash for snapshot tests). +- No async runtime: the trait is synchronous. fastembed is sync internally. + +## Storage / wire effects + +- Reads/writes `data_dir/models/fastembed/` (model cache). +- Otherwise no DB or wire effects. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | construction with default config returns dims=384 | tmp config | +| unit | construction with mismatched dims returns error | tmp config | +| unit | `EmbeddingKind::Query` vs `Document` for same text yield different vectors (cosine < 1.0) | inline | +| unit | output vectors are L2-normalized (norm ~= 1.0 ± 1e-3) | inline | +| determinism | identical input twice → identical output (hash-of-floats compare) | inline | +| performance | batch of 64 short inputs completes in < 5s on CI host | tmp config (skip on slow CI via `#[ignore]`) | +| snapshot | aggregate hash of vectors for 5 known sentences stable across runs | `fixtures/embed/known-sentences.json` | + +All tests under `cargo test -p kb-embed-local`. Mark slow tests `#[ignore]` and run via `cargo test -- --ignored` in dedicated CI lane. + +## Definition of Done + +- [ ] `cargo check -p kb-embed-local` passes +- [ ] `cargo test -p kb-embed-local` passes (excluding `#[ignore]`) +- [ ] First-run model download works under `data_dir/models/fastembed/` +- [ ] No imports outside Allowed dependencies +- [ ] PR links design §11.3, §6.4, §9 + +## Out of scope + +- Reranker (P+). +- Other model providers (Ollama embedding endpoint, candle) — separate adapter crates. +- Visual / image embeddings (P6). + +## Risks / notes + +- ONNX runtime first-load latency on M-series Macs (Metal) can be 1-2 s; tests share a `OnceCell`. +- Forgetting the e5 prefix silently degrades retrieval quality. Tests must assert query/document yield distinct vectors. +- Bumping `EmbeddingVersion` invalidates every `embedding_id`. Treat as a versioning event per §9 — provides justification in PR body. diff --git a/tasks/p3/p3-3-lancedb-store.md b/tasks/p3/p3-3-lancedb-store.md new file mode 100644 index 0000000..717e4fd --- /dev/null +++ b/tasks/p3/p3-3-lancedb-store.md @@ -0,0 +1,132 @@ +--- +phase: P3 +component: kb-store-vector (LanceDB) +task_id: p3-3 +title: "LanceDB VectorStore + embedding_records writer" +status: planned +depends_on: [p3-2, p1-6] +unblocks: [p3-4] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§5.6 embedding_records, §6.3 lancedb table naming, §7.2 VectorStore, §9 versioning] +--- + +# p3-3 — LanceDB VectorStore + +## Goal + +Implement `VectorStore` over LanceDB (embedded). Stores per-model tables (`chunk_embeddings__.lance`), upserts vectors transactionally with a row in `embedding_records` (SQLite), and serves `search` for the vector retrieval mode. + +## Why now / why this size + +Closes the loop chunk → vector. Splits cleanly from `kb-search` so hybrid (p3-4) can compose lexical + vector retrievers without leaking storage details. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-store-sqlite` (only for writing/reading rows in `embedding_records`) +- `lancedb` +- `arrow` (and `arrow-array`, `arrow-schema`) +- `serde`, `serde_json` +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-embed*` (consumes `Vec` via input only — no embedding logic here), `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop` + +## Inputs + +| input | type | source | +|-------|------|--------| +| `VectorRecord[..]` | `kb_core::VectorRecord` | `kb-app::embed_index` (P3 facade) | +| query vector | `&[f32]` | `kb-embed-local` (`Embedder::embed` for query) | +| filters | `kb_core::SearchFilters` | `SearchQuery` | +| `kb-config::Config.storage.vector_dir` | path | runtime | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| Lance tables under `vector_dir/chunk_embeddings__.lance/` | filesystem | future searches | +| `embedding_records` rows | SQLite | reverse lookup, reindex bookkeeping | +| `Vec` | `kb_core::VectorHit` | hybrid retriever (p3-4) | + +## Public surface (signatures only — no new types) + +```rust +pub struct LanceVectorStore { /* internal: connection + sqlite handle */ } + +impl LanceVectorStore { + pub fn new(config: &kb_config::Config, sqlite: std::sync::Arc) -> anyhow::Result; +} + +impl kb_core::VectorStore for LanceVectorStore { + fn ensure_table(&self, model: &kb_core::EmbeddingModelId, dim: usize) -> anyhow::Result; + fn upsert(&self, recs: &[kb_core::VectorRecord]) -> anyhow::Result<()>; + fn search(&self, query_vec: &[f32], k: usize, filters: &kb_core::SearchFilters) -> anyhow::Result>; +} +``` + +## Behavior contract + +- Table naming: `chunk_embeddings__.lance`. Model IDs must be sanitized (replace non `[A-Za-z0-9-]` with `_`) to avoid filesystem issues. +- `ensure_table` is idempotent: opens existing or creates with explicit Arrow schema: + ``` + chunk_id : Utf8 (primary) + doc_id : Utf8 + embedding: FixedSizeList + model_id : Utf8 + embedding_version : Utf8 + text : Utf8 + heading_path : Utf8 + created_at : Timestamp(Microsecond, UTC) + ``` +- For corpora < 100k rows, no IVF index — flat cosine. Above that threshold, the next migration task (P+) introduces IVF; this task does not. +- `upsert` is best-effort 2-step (Lance commit, then SQLite `INSERT OR REPLACE INTO embedding_records`). On SQLite failure after Lance commit, log a warning; the next `upsert` reconciles via the `UNIQUE(chunk_id, model_id, model_version, dimensions)` constraint. +- Dimension mismatch (record dim ≠ table dim) returns `anyhow::Error` from `upsert` and writes nothing. +- `search` performs cosine similarity, applies `SearchFilters` post-fetch (filter-then-limit may over-fetch internally — fetch `2 * k` then trim). +- `VectorHit { chunk_id, score, doc_id, text, heading_path }`; score in [0, 1] (cosine similarity, clamped). +- `search` returns empty `Vec` (not error) when table absent. +- `index_id` for `ensure_table` per design §4.2 with `collection = "chunk_embeddings"`, `index_kind = "flat"`, `params_hash = blake3(serde_json(table_schema))`. + +## Storage / wire effects + +- Writes Lance tables under `data_dir/lancedb/`. +- Writes/reads `embedding_records` rows. +- Reads chunks/documents not from this crate (the caller pre-fetches text + heading via `VectorRecord`). + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | `ensure_table` creates dir; second call returns same `IndexId` | tmp data_dir | +| unit | `upsert` of 10 records makes them retrievable via `search` (k=5) | tmp data_dir | +| unit | dimension mismatch → error, no Lance row written | tmp data_dir | +| unit | filter `tags_any` removes non-matching docs | tmp data_dir + seeded sqlite tags | +| unit | model isolation: two models live in two directories with same `chunk_id` | tmp data_dir | +| unit | search before any upsert returns empty Vec | tmp data_dir | +| determinism | same query vector + same data → same top-k order | tmp data_dir | +| snapshot | `Vec` JSON for fixed corpus stable | `fixtures/vector/run-1.json` | + +All tests under `cargo test -p kb-store-vector`. + +## Definition of Done + +- [ ] `cargo check -p kb-store-vector` passes +- [ ] `cargo test -p kb-store-vector` passes +- [ ] No imports outside Allowed dependencies +- [ ] `embedding_records` rows align 1:1 with Lance rows after a successful upsert batch +- [ ] PR links design §5.6, §6.3, §7.2 + +## Out of scope + +- IVF / PQ index tuning (P+). +- Image / multimodal vector tables (P6). +- `kb-app` orchestration of indexing jobs (`embed_index` facade method body). + +## Risks / notes + +- LanceDB's Rust API requires Arrow batches; constructing them per upsert is allocation-heavy — batch by configurable chunk size to avoid memory spikes. +- Filter-then-limit can starve `k` results; over-fetch by `2 * k` initially and double on retry up to a cap. +- WAL stability: ensure Lance commits before SQLite `INSERT INTO embedding_records` to avoid orphan SQLite rows. diff --git a/tasks/p3/p3-4-hybrid-fusion.md b/tasks/p3/p3-4-hybrid-fusion.md new file mode 100644 index 0000000..95ec8e6 --- /dev/null +++ b/tasks/p3/p3-4-hybrid-fusion.md @@ -0,0 +1,145 @@ +--- +phase: P3 +component: kb-search (hybrid) +task_id: p3-4 +title: "Hybrid Retriever (RRF) over lexical + vector" +status: planned +depends_on: [p2-2, p3-3] +unblocks: [p4-3] +contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md +contract_sections: [§3.7 RetrievalDetail, §0 Q3, §1.6 search --explain, §6.4 [search] rrf settings] +--- + +# p3-4 — Hybrid Retriever (RRF) + +## Goal + +Compose `LexicalRetriever` (p2-2) and a vector retriever wrapper around `LanceVectorStore` (p3-3) into a single `Retriever` that dispatches by `SearchMode`. For `Hybrid`, fuse via Reciprocal Rank Fusion (RRF) and populate full `RetrievalDetail` per `SearchHit`. + +## Why now / why this size + +Single mediator. Keeps the lexical and vector retrievers focused; only this task knows how to fuse. RAG (p4-3) consumes hybrid output without caring about the underlying retrievers. + +## Allowed dependencies + +- `kb-core` +- `kb-config` +- `kb-store-sqlite` (for `LexicalRetriever`) +- `kb-store-vector` (for `LanceVectorStore`) +- `kb-embed` (trait only — for query embedding via `Embedder`) +- `tracing` +- `thiserror` + +## Forbidden dependencies + +- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`. (`kb-embed-local` is a runtime-injected `dyn Embedder`; this crate must not depend on the concrete adapter directly.) + +## Inputs + +| input | type | source | +|-------|------|--------| +| `LexicalRetriever` | trait object | constructed elsewhere | +| `LanceVectorStore` | trait object | constructed elsewhere | +| `Box` | for query embedding | runtime-injected | +| `kb-config::Config.search` | `default_k`, `hybrid_fusion`, `rrf_k` | runtime | +| `SearchQuery` | `kb_core::SearchQuery` | `kb-app::search` | + +## Outputs + +| output | type | downstream | +|--------|------|------------| +| `Vec` (with full `RetrievalDetail`) | `kb_core::SearchHit` | `kb-cli` printer, `kb-rag` packer | + +## Public surface (signatures only — no new types) + +```rust +pub struct HybridRetriever { + lexical: std::sync::Arc, + vector: std::sync::Arc, // wrapper over LanceVectorStore + Embedder + fusion: FusionPolicy, + k: usize, +} + +pub enum FusionPolicy { Rrf { k_rrf: u32 } } + +impl HybridRetriever { + pub fn new( + config: &kb_config::Config, + lexical: std::sync::Arc, + vector: std::sync::Arc, + ) -> Self; +} + +impl kb_core::Retriever for HybridRetriever { + fn search(&self, query: &kb_core::SearchQuery) -> anyhow::Result>; + fn index_version(&self) -> kb_core::IndexVersion; +} + +/// Wrapper that turns a VectorStore + Embedder into a Retriever. +pub struct VectorRetriever { + store: std::sync::Arc, + embed: std::sync::Arc, + /* heading_path/snippet enrichment hits SQLite via kb-store-sqlite read accessor */ +} +impl VectorRetriever { + pub fn new(store: std::sync::Arc, embed: std::sync::Arc, sqlite: std::sync::Arc) -> Self; +} +impl kb_core::Retriever for VectorRetriever { /* per §7.2 */ } +``` + +## Behavior contract + +- `SearchMode::Lexical` dispatches solely to `lexical`. `RetrievalDetail.method = Lexical`, `vector_*` fields are `None`. +- `SearchMode::Vector` dispatches solely to `vector`. `RetrievalDetail.method = Vector`, `lexical_*` fields are `None`. +- `SearchMode::Hybrid`: + - run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required). + - fuse with RRF: `score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0. + - sort by fused score DESC, take top `query.k`. + - populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = computed fused score. + - if a chunk appears in only one retriever, its `RetrievalDetail` still gets populated with `Some(...)` from that side and `None` for the other. + - tie-break by `lexical_rank` ascending, then `chunk_id` ascending (deterministic). +- `VectorRetriever`: + - embeds the query via `embed.embed(&[EmbeddingInput { text: query.text, kind: Query }])`. + - calls `VectorStore::search(query_vec, query.k * 2, query.filters)` (over-fetch for filter losses), trims to `k`. + - hydrates `doc_path` / `heading_path` / `section_label` / `chunker_version` / `embedding_model` from SQLite by joining on `chunk_id`. + - builds `Citation` from chunk's first source span (same logic as p2-2). +- `index_version()` returns the lexical index version when in pure lexical mode, else the vector index version, else "hybrid:+". + +## Storage / wire effects + +- Reads only. No mutations. +- Output JSON conforms to `search_hit.v1`. + +## Test plan + +| kind | description | fixture / data | +|------|-------------|----------------| +| unit | pure lexical mode delegates 1:1 to `lexical.search` | mock retrievers | +| unit | pure vector mode delegates 1:1 to `vector.search` | mock retrievers | +| unit | hybrid: chunk only in lexical receives `vector_*: None`, but still has a fused score | mock retrievers | +| unit | RRF formula matches expected with `k_rrf=60` | inline math test | +| unit | tie-break deterministic (same fused score → stable order) | inline | +| unit | hybrid recall ≥ max(lexical recall, vector recall) on a tiny corpus where each mode finds disjoint hits | tmp DB + Lance + MockEmbedder | +| determinism | identical query twice → byte-identical `Vec` | tmp DB | +| snapshot | hybrid output JSON stable | `fixtures/search/hybrid/run-1.json` | + +All tests under `cargo test -p kb-search hybrid`. + +## Definition of Done + +- [ ] `cargo check -p kb-search` passes +- [ ] `cargo test -p kb-search hybrid` passes +- [ ] No imports outside Allowed dependencies +- [ ] PR links design §3.7, §6.4 search, §0 Q3 + +## Out of scope + +- Reranker (P+). +- Multimodal retrieval (image/audio) — P6+. +- Score calibration across modes (RRF makes scores rank-comparable; absolute calibration is P+). + +## Risks / notes + +- Mismatched `index_version` between lexical and vector should be flagged at construction so users notice stale indexes. +- Over-fetching at the vector retriever (`2 * k`) is conservative; if filters reject everything, the hybrid `k` may shrink. Document this in CLI `--explain`. +- RRF is rank-based, so absolute lexical bm25 normalization (p2-2) doesn't affect fused order; still keep normalization for `--explain` readability.