Files

altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report

마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 04:01:55 +00:00

6.0 KiB

Raw Permalink Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections

phase

component

task_id

title

status

depends_on

unblocks

contract_source

contract_sections

kebab-search (lexical mode)

p2-2

Lexical Retriever via SQLite FTS5 + bm25 + citation

completed

p2-1

p3-4

p4-3

../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md

§3.7 SearchQuery/Hit

§0 Q3 citation (URI fragment)

§1.5/1.6 search output

§2.2 wire schema

§6.4 search settings

p2-2 — Lexical Retriever (FTS5 + bm25)

Goal

Implement kebab_core::Retriever for SearchMode::Lexical using SQLite FTS5. Returns SearchHit with bm25 ranking, snippet()-derived preview, and proper W3C-fragment citation.

Why now / why this size

First concrete Retriever. Lets kebab search --mode lexical work without any embedding/LLM infrastructure. Establishes the SearchHit construction contract that hybrid (p3-4) reuses.

Allowed dependencies

kebab-core
kebab-config
kebab-store-sqlite (read access to chunks_fts + chunks + documents)
rusqlite
tracing
thiserror

Forbidden dependencies

kebab-source-fs, kebab-parse-md, kebab-normalize, kebab-chunk, kebab-store-vector, kebab-embed*, kebab-llm*, kebab-rag, kebab-tui, kebab-desktop

Inputs

input	type	source
`SearchQuery` (mode=Lexical)	`kebab_core::SearchQuery`	`kebab-app::search`
`kebab-config::search` settings (`default_k`, `snippet_chars`)	`kebab_config::Config`	runtime
SQLite connection (read)	`rusqlite::Connection`	`kebab-store-sqlite`

Outputs

output	type	downstream
`Vec<SearchHit>`	`kebab_core::SearchHit`	`kebab-cli` printer, `kebab-rag` packer (P4), hybrid (p3-4)

Public surface (signatures only — no new types)

pub struct LexicalRetriever { /* internal: holds an Arc<rusqlite::Connection> + IndexVersion */ }

impl LexicalRetriever {
    pub fn new(store: std::sync::Arc<kebab_store_sqlite::SqliteStore>, index_version: kebab_core::IndexVersion) -> Self;
}

impl kebab_core::Retriever for LexicalRetriever {
    fn search(&self, query: &kebab_core::SearchQuery) -> anyhow::Result<Vec<kebab_core::SearchHit>>;
    fn index_version(&self) -> kebab_core::IndexVersion;
}

Behavior contract

SQL pattern (read-only):

SELECT
  f.chunk_id, f.doc_id,
  bm25(chunks_fts) AS score,
  snippet(chunks_fts, 3, '', '', '…', :snippet_words) AS snippet,
  c.heading_path_json, c.section_label, c.source_spans_json, c.chunker_version,
  d.workspace_path, d.title
FROM chunks_fts f
JOIN chunks c   ON c.chunk_id = f.chunk_id
JOIN documents d ON d.doc_id = f.doc_id
WHERE chunks_fts MATCH :match
ORDER BY score
LIMIT :k

with score ASC because SQLite FTS5 returns negative bm25 (lower = better). Convert to a positive normalized score for SearchHit.retrieval.fusion_score: score = -bm25_raw / (1 + abs(bm25_raw)) (bounded ~[0,1]).

:match building: tokenize the query string conservatively (split on whitespace, escape FTS5 special chars, default to AND of terms; if the user supplied an explicit FTS5 expression, pass it through when wrapped in single quotes).
:snippet_words derived from config.search.snippet_chars / 4 (~chars-per-token estimate). Snippet length must not exceed snippet_chars characters.
SearchHit.citation constructed from chunks.source_spans_json first span:
- Line → Citation::Line { path, start, end, section: section_label }
- Page → Citation::Page { path, page, section: section_label }
- other variants → forwarded as-is.
SearchHit.retrieval = RetrievalDetail { method: SearchMode::Lexical, lexical_score: Some(normalized), vector_score: None, fusion_score: normalized, lexical_rank: Some(rank), vector_rank: None }.
index_version() returns the IndexVersion configured at construction (e.g., "v1.0").
Filters (SearchFilters):
- tags_any → join document_tags and add IN (:tags) condition
- lang → documents.lang = :lang
- path_glob → SQL LIKE with glob translated via globset
- trust_min → ordered enum compare
Empty match string returns Ok(vec![]) (no error).
Determinism: same DB + same query → same Vec<SearchHit> order.

Storage / wire effects

Reads only. Never mutates kebab.sqlite.
Wire: Vec<SearchHit> serialized via wire schema search_hit.v1 when kebab-cli --json is used.

Test plan

kind	description	fixture / data
unit	empty corpus → empty `Vec<SearchHit>`	tmp DB
unit	single-doc corpus matches keyword and returns 1 hit with citation	tmp DB seeded from `fixtures/markdown/code-and-table.md`
unit	snippet length ≤ `snippet_chars`	tmp DB
unit	filter `tags_any=["rust"]` excludes docs without that tag	tmp DB
unit	citation line range round-trip equals chunk's `source_spans` first span	tmp DB
unit	bm25 normalization keeps top-1 score in (0, 1]	tmp DB with 3 ranked chunks
determinism	identical query twice produces identical hit order and scores	tmp DB
snapshot	`Vec<SearchHit>` JSON for fixed corpus stable	`fixtures/search/lexical/run-1.json`

All tests under cargo test -p kebab-search lexical.

Definition of Done

cargo check -p kebab-search passes
cargo test -p kebab-search lexical passes
No imports outside Allowed dependencies (cargo tree -p kebab-search audit)
Output JSON conforms to docs/wire-schema/v1/search_hit.schema.json
PR links design §3.7, §0 Q3, §2.2

Out of scope

Vector search (p3-3).
Hybrid fusion (p3-4).
Reranker (P+).
Korean morphological tokenizer (P+).

Risks / notes

bm25 raw scores depend on FTS5 internals; the normalization formula chosen here is for display + RRF input. Avoid leaking raw bm25 to wire schema.
globset translation of path_glob: ensure * does not match / to avoid surprising matches.
SQLite FTS5 query string is sensitive to special characters (", ^, *, :, (, )); always escape unless the caller explicitly opted into FTS5 syntax.

6.0 KiB Raw Permalink Blame History