마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147 lines
7.2 KiB
Markdown
147 lines
7.2 KiB
Markdown
---
|
|
phase: P3
|
|
component: kebab-search (hybrid)
|
|
task_id: p3-4
|
|
title: "Hybrid Retriever (RRF) over lexical + vector"
|
|
status: completed
|
|
depends_on: [p2-2, p3-3]
|
|
unblocks: [p4-3]
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [§3.7 RetrievalDetail, §0 Q3, §1.6 search --explain, §6.4 [search] rrf settings]
|
|
---
|
|
|
|
# p3-4 — Hybrid Retriever (RRF)
|
|
|
|
## Goal
|
|
|
|
Compose `LexicalRetriever` (p2-2) and a vector retriever wrapper around `LanceVectorStore` (p3-3) into a single `Retriever` that dispatches by `SearchMode`. For `Hybrid`, fuse via Reciprocal Rank Fusion (RRF) and populate full `RetrievalDetail` per `SearchHit`.
|
|
|
|
## Why now / why this size
|
|
|
|
Single mediator. Keeps the lexical and vector retrievers focused; only this task knows how to fuse. RAG (p4-3) consumes hybrid output without caring about the underlying retrievers.
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-config`
|
|
- `kebab-store-sqlite` (for `LexicalRetriever`)
|
|
- `kebab-store-vector` (for `LanceVectorStore`)
|
|
- `kebab-embed` (trait only — for query embedding via `Embedder`)
|
|
- `tracing`
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`. (`kebab-embed-local` is a runtime-injected `dyn Embedder`; this crate must not depend on the concrete adapter directly.)
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| `LexicalRetriever` | trait object | constructed elsewhere |
|
|
| `LanceVectorStore` | trait object | constructed elsewhere |
|
|
| `Box<dyn Embedder>` | for query embedding | runtime-injected |
|
|
| `kebab-config::Config.search` | `default_k`, `hybrid_fusion`, `rrf_k` | runtime |
|
|
| `SearchQuery` | `kebab_core::SearchQuery` | `kebab-app::search` |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `Vec<SearchHit>` (with full `RetrievalDetail`) | `kebab_core::SearchHit` | `kebab-cli` printer, `kebab-rag` packer |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub struct HybridRetriever {
|
|
lexical: std::sync::Arc<dyn kebab_core::Retriever>,
|
|
vector: std::sync::Arc<dyn kebab_core::Retriever>, // wrapper over LanceVectorStore + Embedder
|
|
fusion: FusionPolicy,
|
|
k: usize,
|
|
}
|
|
|
|
pub enum FusionPolicy { Rrf { k_rrf: u32 } }
|
|
|
|
impl HybridRetriever {
|
|
pub fn new(
|
|
config: &kebab_config::Config,
|
|
lexical: std::sync::Arc<dyn kebab_core::Retriever>,
|
|
vector: std::sync::Arc<dyn kebab_core::Retriever>,
|
|
) -> Self;
|
|
}
|
|
|
|
impl kebab_core::Retriever for HybridRetriever {
|
|
fn search(&self, query: &kebab_core::SearchQuery) -> anyhow::Result<Vec<kebab_core::SearchHit>>;
|
|
fn index_version(&self) -> kebab_core::IndexVersion;
|
|
}
|
|
|
|
/// Wrapper that turns a VectorStore + Embedder into a Retriever.
|
|
pub struct VectorRetriever {
|
|
store: std::sync::Arc<dyn kebab_core::VectorStore>,
|
|
embed: std::sync::Arc<dyn kebab_core::Embedder>,
|
|
/* heading_path/snippet enrichment hits SQLite via kebab-store-sqlite read accessor */
|
|
}
|
|
impl VectorRetriever {
|
|
pub fn new(store: std::sync::Arc<dyn kebab_core::VectorStore>, embed: std::sync::Arc<dyn kebab_core::Embedder>, sqlite: std::sync::Arc<kebab_store_sqlite::SqliteStore>) -> Self;
|
|
}
|
|
impl kebab_core::Retriever for VectorRetriever { /* per §7.2 */ }
|
|
```
|
|
|
|
## Behavior contract
|
|
|
|
- `SearchMode::Lexical` dispatches solely to `lexical`. `RetrievalDetail.method = Lexical`, `vector_*` fields are `None`.
|
|
- `SearchMode::Vector` dispatches solely to `vector`. `RetrievalDetail.method = Vector`, `lexical_*` fields are `None`.
|
|
- `SearchMode::Hybrid`:
|
|
- run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required).
|
|
- fuse with RRF: `raw(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
|
|
- **normalize fusion_score to [0, 1]** (post-merge fix, 2026-05): divide by `num_retrievers / (k_rrf + 1)` so the top-1-everywhere case maps to `1.0` and single-retriever chunks cap around `0.5`. Without this, raw RRF tops out at `≈ 0.033` and is incomparable with the `[0, 1]` lexical / vector `fusion_score` (and incompatible with the `config.rag.score_gate` default `0.05` — every hybrid query refused). RRF's rank ordering is preserved (we divide every score by the same positive constant). See [HOTFIXES.md](../HOTFIXES.md).
|
|
- sort by fused score DESC, take top `query.k`.
|
|
- populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = normalized fused score.
|
|
- if a chunk appears in only one retriever, its `RetrievalDetail` still gets populated with `Some(...)` from that side and `None` for the other.
|
|
- tie-break by `lexical_rank` ascending, then `chunk_id` ascending (deterministic).
|
|
- `VectorRetriever`:
|
|
- embeds the query via `embed.embed(&[EmbeddingInput { text: query.text, kind: Query }])`.
|
|
- calls `VectorStore::search(query_vec, query.k * 2, query.filters)` (over-fetch for filter losses), trims to `k`.
|
|
- hydrates `doc_path` / `heading_path` / `section_label` / `chunker_version` / `embedding_model` from SQLite by joining on `chunk_id`.
|
|
- builds `Citation` from chunk's first source span (same logic as p2-2).
|
|
- `index_version()` returns the lexical index version when in pure lexical mode, else the vector index version, else "hybrid:<lex_iv>+<vec_iv>".
|
|
|
|
## Storage / wire effects
|
|
|
|
- Reads only. No mutations.
|
|
- Output JSON conforms to `search_hit.v1`.
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | pure lexical mode delegates 1:1 to `lexical.search` | mock retrievers |
|
|
| unit | pure vector mode delegates 1:1 to `vector.search` | mock retrievers |
|
|
| unit | hybrid: chunk only in lexical receives `vector_*: None`, but still has a fused score | mock retrievers |
|
|
| unit | RRF formula matches expected with `k_rrf=60` | inline math test |
|
|
| unit | tie-break deterministic (same fused score → stable order) | inline |
|
|
| unit | hybrid recall ≥ max(lexical recall, vector recall) on a tiny corpus where each mode finds disjoint hits | tmp DB + Lance + MockEmbedder |
|
|
| determinism | identical query twice → byte-identical `Vec<SearchHit>` | tmp DB |
|
|
| snapshot | hybrid output JSON stable | `fixtures/search/hybrid/run-1.json` |
|
|
|
|
All tests under `cargo test -p kebab-search hybrid`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-search` passes
|
|
- [ ] `cargo test -p kebab-search hybrid` passes
|
|
- [ ] No imports outside Allowed dependencies
|
|
- [ ] PR links design §3.7, §6.4 search, §0 Q3
|
|
|
|
## Out of scope
|
|
|
|
- Reranker (P+).
|
|
- Multimodal retrieval (image/audio) — P6+.
|
|
- Score calibration across modes (RRF makes scores rank-comparable; absolute calibration is P+).
|
|
|
|
## Risks / notes
|
|
|
|
- Mismatched `index_version` between lexical and vector should be flagged at construction so users notice stale indexes.
|
|
- Over-fetching at the vector retriever (`2 * k`) is conservative; if filters reject everything, the hybrid `k` may shrink. Document this in CLI `--explain`.
|
|
- RRF is rank-based, so absolute lexical bm25 normalization (p2-2) doesn't affect fused order; still keep normalization for `--explain` readability.
|