Files
kebab/tasks/p2/p2-2-lexical-retriever.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

139 lines
6.0 KiB
Markdown

---
phase: P2
component: kebab-search (lexical mode)
task_id: p2-2
title: "Lexical Retriever via SQLite FTS5 + bm25 + citation"
status: completed
depends_on: [p2-1]
unblocks: [p3-4, p4-3]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.7 SearchQuery/Hit, §0 Q3 citation (URI fragment), §1.5/1.6 search output, §2.2 wire schema, §6.4 search settings]
---
# p2-2 — Lexical Retriever (FTS5 + bm25)
## Goal
Implement `kebab_core::Retriever` for `SearchMode::Lexical` using SQLite FTS5. Returns `SearchHit` with `bm25` ranking, `snippet()`-derived preview, and proper W3C-fragment citation.
## Why now / why this size
First concrete `Retriever`. Lets `kebab search --mode lexical` work without any embedding/LLM infrastructure. Establishes the SearchHit construction contract that hybrid (p3-4) reuses.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `kebab-store-sqlite` (read access to `chunks_fts` + `chunks` + `documents`)
- `rusqlite`
- `tracing`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-vector`, `kebab-embed*`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| `SearchQuery` (mode=Lexical) | `kebab_core::SearchQuery` | `kebab-app::search` |
| `kebab-config::search` settings (`default_k`, `snippet_chars`) | `kebab_config::Config` | runtime |
| SQLite connection (read) | `rusqlite::Connection` | `kebab-store-sqlite` |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<SearchHit>` | `kebab_core::SearchHit` | `kebab-cli` printer, `kebab-rag` packer (P4), hybrid (p3-4) |
## Public surface (signatures only — no new types)
```rust
pub struct LexicalRetriever { /* internal: holds an Arc<rusqlite::Connection> + IndexVersion */ }
impl LexicalRetriever {
pub fn new(store: std::sync::Arc<kebab_store_sqlite::SqliteStore>, index_version: kebab_core::IndexVersion) -> Self;
}
impl kebab_core::Retriever for LexicalRetriever {
fn search(&self, query: &kebab_core::SearchQuery) -> anyhow::Result<Vec<kebab_core::SearchHit>>;
fn index_version(&self) -> kebab_core::IndexVersion;
}
```
## Behavior contract
- SQL pattern (read-only):
```sql
SELECT
f.chunk_id, f.doc_id,
bm25(chunks_fts) AS score,
snippet(chunks_fts, 3, '', '', '…', :snippet_words) AS snippet,
c.heading_path_json, c.section_label, c.source_spans_json, c.chunker_version,
d.workspace_path, d.title
FROM chunks_fts f
JOIN chunks c ON c.chunk_id = f.chunk_id
JOIN documents d ON d.doc_id = f.doc_id
WHERE chunks_fts MATCH :match
ORDER BY score
LIMIT :k
```
with `score` ASC because SQLite FTS5 returns negative bm25 (lower = better). Convert to a positive normalized score for `SearchHit.retrieval.fusion_score`: `score = -bm25_raw / (1 + abs(bm25_raw))` (bounded ~[0,1]).
- `:match` building: tokenize the query string conservatively (split on whitespace, escape FTS5 special chars, default to AND of terms; if the user supplied an explicit FTS5 expression, pass it through when wrapped in single quotes).
- `:snippet_words` derived from `config.search.snippet_chars / 4` (~chars-per-token estimate). Snippet length must not exceed `snippet_chars` characters.
- `SearchHit.citation` constructed from `chunks.source_spans_json` first span:
- `Line` → `Citation::Line { path, start, end, section: section_label }`
- `Page` → `Citation::Page { path, page, section: section_label }`
- other variants → forwarded as-is.
- `SearchHit.retrieval` = `RetrievalDetail { method: SearchMode::Lexical, lexical_score: Some(normalized), vector_score: None, fusion_score: normalized, lexical_rank: Some(rank), vector_rank: None }`.
- `index_version()` returns the `IndexVersion` configured at construction (e.g., `"v1.0"`).
- Filters (`SearchFilters`):
- `tags_any` → join `document_tags` and add `IN (:tags)` condition
- `lang` → `documents.lang = :lang`
- `path_glob` → SQL `LIKE` with glob translated via `globset`
- `trust_min` → ordered enum compare
- Empty match string returns `Ok(vec![])` (no error).
- Determinism: same DB + same query → same `Vec<SearchHit>` order.
## Storage / wire effects
- Reads only. Never mutates `kebab.sqlite`.
- Wire: `Vec<SearchHit>` serialized via wire schema `search_hit.v1` when `kebab-cli --json` is used.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | empty corpus → empty `Vec<SearchHit>` | tmp DB |
| unit | single-doc corpus matches keyword and returns 1 hit with citation | tmp DB seeded from `fixtures/markdown/code-and-table.md` |
| unit | snippet length ≤ `snippet_chars` | tmp DB |
| unit | filter `tags_any=["rust"]` excludes docs without that tag | tmp DB |
| unit | citation line range round-trip equals chunk's `source_spans` first span | tmp DB |
| unit | bm25 normalization keeps top-1 score in (0, 1] | tmp DB with 3 ranked chunks |
| determinism | identical query twice produces identical hit order and scores | tmp DB |
| snapshot | `Vec<SearchHit>` JSON for fixed corpus stable | `fixtures/search/lexical/run-1.json` |
All tests under `cargo test -p kebab-search lexical`.
## Definition of Done
- [ ] `cargo check -p kebab-search` passes
- [ ] `cargo test -p kebab-search lexical` passes
- [ ] No imports outside Allowed dependencies (`cargo tree -p kebab-search` audit)
- [ ] Output JSON conforms to `docs/wire-schema/v1/search_hit.schema.json`
- [ ] PR links design §3.7, §0 Q3, §2.2
## Out of scope
- Vector search (p3-3).
- Hybrid fusion (p3-4).
- Reranker (P+).
- Korean morphological tokenizer (P+).
## Risks / notes
- bm25 raw scores depend on FTS5 internals; the normalization formula chosen here is for display + RRF input. Avoid leaking raw bm25 to wire schema.
- `globset` translation of `path_glob`: ensure `*` does not match `/` to avoid surprising matches.
- SQLite FTS5 query string is sensitive to special characters (`"`, `^`, `*`, `:`, `(`, `)`); always escape unless the caller explicitly opted into FTS5 syntax.