Files
kebab/tasks/p2/p2-2-lexical-retriever.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

6.0 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P2 kebab-search (lexical mode) p2-2 Lexical Retriever via SQLite FTS5 + bm25 + citation completed
p2-1
p3-4
p4-3
../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
§3.7 SearchQuery/Hit
§0 Q3 citation (URI fragment)
§1.5/1.6 search output
§2.2 wire schema
§6.4 search settings

p2-2 — Lexical Retriever (FTS5 + bm25)

Goal

Implement kebab_core::Retriever for SearchMode::Lexical using SQLite FTS5. Returns SearchHit with bm25 ranking, snippet()-derived preview, and proper W3C-fragment citation.

Why now / why this size

First concrete Retriever. Lets kebab search --mode lexical work without any embedding/LLM infrastructure. Establishes the SearchHit construction contract that hybrid (p3-4) reuses.

Allowed dependencies

  • kebab-core
  • kebab-config
  • kebab-store-sqlite (read access to chunks_fts + chunks + documents)
  • rusqlite
  • tracing
  • thiserror

Forbidden dependencies

  • kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-vector, kebab-embed*, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop

Inputs

input type source
SearchQuery (mode=Lexical) kebab_core::SearchQuery kebab-app::search
kebab-config::search settings (default_k, snippet_chars) kebab_config::Config runtime
SQLite connection (read) rusqlite::Connection kebab-store-sqlite

Outputs

output type downstream
Vec<SearchHit> kebab_core::SearchHit kebab-cli printer, kebab-rag packer (P4), hybrid (p3-4)

Public surface (signatures only — no new types)

pub struct LexicalRetriever { /* internal: holds an Arc<rusqlite::Connection> + IndexVersion */ }

impl LexicalRetriever {
    pub fn new(store: std::sync::Arc<kebab_store_sqlite::SqliteStore>, index_version: kebab_core::IndexVersion) -> Self;
}

impl kebab_core::Retriever for LexicalRetriever {
    fn search(&self, query: &kebab_core::SearchQuery) -> anyhow::Result<Vec<kebab_core::SearchHit>>;
    fn index_version(&self) -> kebab_core::IndexVersion;
}

Behavior contract

  • SQL pattern (read-only):
    SELECT
      f.chunk_id, f.doc_id,
      bm25(chunks_fts) AS score,
      snippet(chunks_fts, 3, '', '', '…', :snippet_words) AS snippet,
      c.heading_path_json, c.section_label, c.source_spans_json, c.chunker_version,
      d.workspace_path, d.title
    FROM chunks_fts f
    JOIN chunks c   ON c.chunk_id = f.chunk_id
    JOIN documents d ON d.doc_id = f.doc_id
    WHERE chunks_fts MATCH :match
    ORDER BY score
    LIMIT :k
    
    with score ASC because SQLite FTS5 returns negative bm25 (lower = better). Convert to a positive normalized score for SearchHit.retrieval.fusion_score: score = -bm25_raw / (1 + abs(bm25_raw)) (bounded ~[0,1]).
  • :match building: tokenize the query string conservatively (split on whitespace, escape FTS5 special chars, default to AND of terms; if the user supplied an explicit FTS5 expression, pass it through when wrapped in single quotes).
  • :snippet_words derived from config.search.snippet_chars / 4 (~chars-per-token estimate). Snippet length must not exceed snippet_chars characters.
  • SearchHit.citation constructed from chunks.source_spans_json first span:
    • LineCitation::Line { path, start, end, section: section_label }
    • PageCitation::Page { path, page, section: section_label }
    • other variants → forwarded as-is.
  • SearchHit.retrieval = RetrievalDetail { method: SearchMode::Lexical, lexical_score: Some(normalized), vector_score: None, fusion_score: normalized, lexical_rank: Some(rank), vector_rank: None }.
  • index_version() returns the IndexVersion configured at construction (e.g., "v1.0").
  • Filters (SearchFilters):
    • tags_any → join document_tags and add IN (:tags) condition
    • langdocuments.lang = :lang
    • path_glob → SQL LIKE with glob translated via globset
    • trust_min → ordered enum compare
  • Empty match string returns Ok(vec![]) (no error).
  • Determinism: same DB + same query → same Vec<SearchHit> order.

Storage / wire effects

  • Reads only. Never mutates kebab.sqlite.
  • Wire: Vec<SearchHit> serialized via wire schema search_hit.v1 when kebab-cli --json is used.

Test plan

kind description fixture / data
unit empty corpus → empty Vec<SearchHit> tmp DB
unit single-doc corpus matches keyword and returns 1 hit with citation tmp DB seeded from fixtures/markdown/code-and-table.md
unit snippet length ≤ snippet_chars tmp DB
unit filter tags_any=["rust"] excludes docs without that tag tmp DB
unit citation line range round-trip equals chunk's source_spans first span tmp DB
unit bm25 normalization keeps top-1 score in (0, 1] tmp DB with 3 ranked chunks
determinism identical query twice produces identical hit order and scores tmp DB
snapshot Vec<SearchHit> JSON for fixed corpus stable fixtures/search/lexical/run-1.json

All tests under cargo test -p kebab-search lexical.

Definition of Done

  • cargo check -p kebab-search passes
  • cargo test -p kebab-search lexical passes
  • No imports outside Allowed dependencies (cargo tree -p kebab-search audit)
  • Output JSON conforms to docs/wire-schema/v1/search_hit.schema.json
  • PR links design §3.7, §0 Q3, §2.2

Out of scope

  • Vector search (p3-3).
  • Hybrid fusion (p3-4).
  • Reranker (P+).
  • Korean morphological tokenizer (P+).

Risks / notes

  • bm25 raw scores depend on FTS5 internals; the normalization formula chosen here is for display + RRF input. Avoid leaking raw bm25 to wire schema.
  • globset translation of path_glob: ensure * does not match / to avoid surprising matches.
  • SQLite FTS5 query string is sensitive to special characters (", ^, *, :, (, )); always escape unless the caller explicitly opted into FTS5 syntax.