tasks: add P2 component specs (fts-schema, lexical-retriever)
This commit is contained in:
100
tasks/p2/p2-1-fts-schema.md
Normal file
100
tasks/p2/p2-1-fts-schema.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
phase: P2
|
||||
component: kb-store-sqlite (FTS5 migration)
|
||||
task_id: p2-1
|
||||
title: "FTS5 virtual table + triggers (V002 migration)"
|
||||
status: planned
|
||||
depends_on: [p1-6]
|
||||
unblocks: [p2-2]
|
||||
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
|
||||
contract_sections: [§5.5 chunks_fts + triggers, §9 versioning]
|
||||
---
|
||||
|
||||
# p2-1 — FTS5 virtual table + triggers
|
||||
|
||||
## Goal
|
||||
|
||||
Add `chunks_fts` virtual table and three sync triggers via migration `V002__fts.sql`. Backfill existing chunks if any.
|
||||
|
||||
## Why now / why this size
|
||||
|
||||
`chunks_fts` is the lexical index for `kb-search`. Splitting it from p1-6 keeps P1 focused on relational data; bringing it as `V002` lets users upgrade an existing P1 DB without re-ingesting.
|
||||
|
||||
## Allowed dependencies
|
||||
|
||||
- `kb-core`
|
||||
- `kb-config`
|
||||
- `kb-store-sqlite` (extends migrations)
|
||||
- `rusqlite`
|
||||
- `refinery`
|
||||
|
||||
## Forbidden dependencies
|
||||
|
||||
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-vector`, `kb-embed*`, `kb-search` (consumer is p2-2), `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
|
||||
|
||||
## Inputs
|
||||
|
||||
| input | type | source |
|
||||
|-------|------|--------|
|
||||
| existing `chunks` rows | SQLite | from p1-6 |
|
||||
| migration runner | `refinery` | from p1-6 |
|
||||
|
||||
## Outputs
|
||||
|
||||
| output | type | downstream |
|
||||
|--------|------|------------|
|
||||
| `chunks_fts` virtual table populated | SQLite | p2-2 lexical retriever |
|
||||
| three triggers synced with `chunks` | SQLite | every later chunk write |
|
||||
|
||||
## Public surface (signatures only — no new types)
|
||||
|
||||
```rust
|
||||
pub fn rebuild_chunks_fts(conn: &rusqlite::Connection) -> anyhow::Result<()>;
|
||||
```
|
||||
|
||||
(Used by `kb index --rebuild-fts`. Re-runs `INSERT INTO chunks_fts SELECT ... FROM chunks` after `DELETE FROM chunks_fts;`.)
|
||||
|
||||
## Behavior contract
|
||||
|
||||
- Migration file `migrations/V002__fts.sql` ships exactly the SQL in design §5.5 (FTS5 virtual table with `unicode61 remove_diacritics 2` tokenizer + `chunks_ai` / `chunks_ad` / `chunks_au` triggers).
|
||||
- On migration apply, backfill: `INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text) SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;`.
|
||||
- `rebuild_chunks_fts` is idempotent: full delete then re-insert from `chunks`.
|
||||
- Triggers ensure that every future `INSERT`/`UPDATE`/`DELETE` on `chunks` keeps `chunks_fts` in sync within the same transaction.
|
||||
- `chunks_fts` row count must equal `chunks` row count after any successful migration / rebuild.
|
||||
|
||||
## Storage / wire effects
|
||||
|
||||
- Writes: `chunks_fts` virtual table inside `kb.sqlite`.
|
||||
- Reads: existing `chunks` rows for backfill.
|
||||
|
||||
## Test plan
|
||||
|
||||
| kind | description | fixture / data |
|
||||
|------|-------------|----------------|
|
||||
| migration | apply `V002` to a DB seeded with N chunks; `chunks_fts` contains exactly N rows | tmp DB seeded |
|
||||
| trigger | INSERT into `chunks` propagates to `chunks_fts` | tmp DB |
|
||||
| trigger | DELETE from `chunks` removes the corresponding `chunks_fts` row | tmp DB |
|
||||
| trigger | UPDATE of `chunks.text` updates `chunks_fts` text | tmp DB |
|
||||
| function | `rebuild_chunks_fts` produces deterministic content equal to fresh backfill | tmp DB |
|
||||
| migration | running `V002` twice is a no-op (refinery handles idempotency) | tmp DB |
|
||||
|
||||
All tests under `cargo test -p kb-store-sqlite fts`.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [ ] `cargo check -p kb-store-sqlite` passes
|
||||
- [ ] `cargo test -p kb-store-sqlite fts` passes
|
||||
- [ ] `migrations/V002__fts.sql` matches design §5.5 verbatim (CI diff check)
|
||||
- [ ] No imports outside Allowed dependencies
|
||||
- [ ] PR links design §5.5
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Search query implementation (p2-2).
|
||||
- Vector / hybrid search (P3).
|
||||
- Korean morphological tokenizer (kept as P+ note; default `unicode61 remove_diacritics 2`).
|
||||
|
||||
## Risks / notes
|
||||
|
||||
- FTS5 triggers run inside the same transaction as their host `chunks` mutation; bulk ingest performance may need batching considerations later.
|
||||
- `chunks_fts` is a **content-less** FTS5 table per §5.5 (with UNINDEXED `chunk_id`/`doc_id`). Tests should rely on `bm25(chunks_fts)` ranking only — not on raw scoring values.
|
||||
138
tasks/p2/p2-2-lexical-retriever.md
Normal file
138
tasks/p2/p2-2-lexical-retriever.md
Normal file
@@ -0,0 +1,138 @@
|
||||
---
|
||||
phase: P2
|
||||
component: kb-search (lexical mode)
|
||||
task_id: p2-2
|
||||
title: "Lexical Retriever via SQLite FTS5 + bm25 + citation"
|
||||
status: planned
|
||||
depends_on: [p2-1]
|
||||
unblocks: [p3-4, p4-3]
|
||||
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
|
||||
contract_sections: [§3.7 SearchQuery/Hit, §0 Q3 citation (URI fragment), §1.5/1.6 search output, §2.2 wire schema, §6.4 search settings]
|
||||
---
|
||||
|
||||
# p2-2 — Lexical Retriever (FTS5 + bm25)
|
||||
|
||||
## Goal
|
||||
|
||||
Implement `kb_core::Retriever` for `SearchMode::Lexical` using SQLite FTS5. Returns `SearchHit` with `bm25` ranking, `snippet()`-derived preview, and proper W3C-fragment citation.
|
||||
|
||||
## Why now / why this size
|
||||
|
||||
First concrete `Retriever`. Lets `kb search --mode lexical` work without any embedding/LLM infrastructure. Establishes the SearchHit construction contract that hybrid (p3-4) reuses.
|
||||
|
||||
## Allowed dependencies
|
||||
|
||||
- `kb-core`
|
||||
- `kb-config`
|
||||
- `kb-store-sqlite` (read access to `chunks_fts` + `chunks` + `documents`)
|
||||
- `rusqlite`
|
||||
- `tracing`
|
||||
- `thiserror`
|
||||
|
||||
## Forbidden dependencies
|
||||
|
||||
- `kb-source-fs`, `kb-parse-md`, `kb-normalize`, `kb-chunk`, `kb-store-vector`, `kb-embed*`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
|
||||
|
||||
## Inputs
|
||||
|
||||
| input | type | source |
|
||||
|-------|------|--------|
|
||||
| `SearchQuery` (mode=Lexical) | `kb_core::SearchQuery` | `kb-app::search` |
|
||||
| `kb-config::search` settings (`default_k`, `snippet_chars`) | `kb_config::Config` | runtime |
|
||||
| SQLite connection (read) | `rusqlite::Connection` | `kb-store-sqlite` |
|
||||
|
||||
## Outputs
|
||||
|
||||
| output | type | downstream |
|
||||
|--------|------|------------|
|
||||
| `Vec<SearchHit>` | `kb_core::SearchHit` | `kb-cli` printer, `kb-rag` packer (P4), hybrid (p3-4) |
|
||||
|
||||
## Public surface (signatures only — no new types)
|
||||
|
||||
```rust
|
||||
pub struct LexicalRetriever { /* internal: holds an Arc<rusqlite::Connection> + IndexVersion */ }
|
||||
|
||||
impl LexicalRetriever {
|
||||
pub fn new(store: std::sync::Arc<kb_store_sqlite::SqliteStore>, index_version: kb_core::IndexVersion) -> Self;
|
||||
}
|
||||
|
||||
impl kb_core::Retriever for LexicalRetriever {
|
||||
fn search(&self, query: &kb_core::SearchQuery) -> anyhow::Result<Vec<kb_core::SearchHit>>;
|
||||
fn index_version(&self) -> kb_core::IndexVersion;
|
||||
}
|
||||
```
|
||||
|
||||
## Behavior contract
|
||||
|
||||
- SQL pattern (read-only):
|
||||
```sql
|
||||
SELECT
|
||||
f.chunk_id, f.doc_id,
|
||||
bm25(chunks_fts) AS score,
|
||||
snippet(chunks_fts, 3, '', '', '…', :snippet_words) AS snippet,
|
||||
c.heading_path_json, c.section_label, c.source_spans_json, c.chunker_version,
|
||||
d.workspace_path, d.title
|
||||
FROM chunks_fts f
|
||||
JOIN chunks c ON c.chunk_id = f.chunk_id
|
||||
JOIN documents d ON d.doc_id = f.doc_id
|
||||
WHERE chunks_fts MATCH :match
|
||||
ORDER BY score
|
||||
LIMIT :k
|
||||
```
|
||||
with `score` ASC because SQLite FTS5 returns negative bm25 (lower = better). Convert to a positive normalized score for `SearchHit.retrieval.fusion_score`: `score = -bm25_raw / (1 + abs(bm25_raw))` (bounded ~[0,1]).
|
||||
- `:match` building: tokenize the query string conservatively (split on whitespace, escape FTS5 special chars, default to AND of terms; if the user supplied an explicit FTS5 expression, pass it through when wrapped in single quotes).
|
||||
- `:snippet_words` derived from `config.search.snippet_chars / 4` (~chars-per-token estimate). Snippet length must not exceed `snippet_chars` characters.
|
||||
- `SearchHit.citation` constructed from `chunks.source_spans_json` first span:
|
||||
- `Line` → `Citation::Line { path, start, end, section: section_label }`
|
||||
- `Page` → `Citation::Page { path, page, section: section_label }`
|
||||
- other variants → forwarded as-is.
|
||||
- `SearchHit.retrieval` = `RetrievalDetail { method: SearchMode::Lexical, lexical_score: Some(normalized), vector_score: None, fusion_score: normalized, lexical_rank: Some(rank), vector_rank: None }`.
|
||||
- `index_version()` returns the `IndexVersion` configured at construction (e.g., `"v1.0"`).
|
||||
- Filters (`SearchFilters`):
|
||||
- `tags_any` → join `document_tags` and add `IN (:tags)` condition
|
||||
- `lang` → `documents.lang = :lang`
|
||||
- `path_glob` → SQL `LIKE` with glob translated via `globset`
|
||||
- `trust_min` → ordered enum compare
|
||||
- Empty match string returns `Ok(vec![])` (no error).
|
||||
- Determinism: same DB + same query → same `Vec<SearchHit>` order.
|
||||
|
||||
## Storage / wire effects
|
||||
|
||||
- Reads only. Never mutates `kb.sqlite`.
|
||||
- Wire: `Vec<SearchHit>` serialized via wire schema `search_hit.v1` when `kb-cli --json` is used.
|
||||
|
||||
## Test plan
|
||||
|
||||
| kind | description | fixture / data |
|
||||
|------|-------------|----------------|
|
||||
| unit | empty corpus → empty `Vec<SearchHit>` | tmp DB |
|
||||
| unit | single-doc corpus matches keyword and returns 1 hit with citation | tmp DB seeded from `fixtures/markdown/code-and-table.md` |
|
||||
| unit | snippet length ≤ `snippet_chars` | tmp DB |
|
||||
| unit | filter `tags_any=["rust"]` excludes docs without that tag | tmp DB |
|
||||
| unit | citation line range round-trip equals chunk's `source_spans` first span | tmp DB |
|
||||
| unit | bm25 normalization keeps top-1 score in (0, 1] | tmp DB with 3 ranked chunks |
|
||||
| determinism | identical query twice produces identical hit order and scores | tmp DB |
|
||||
| snapshot | `Vec<SearchHit>` JSON for fixed corpus stable | `fixtures/search/lexical/run-1.json` |
|
||||
|
||||
All tests under `cargo test -p kb-search lexical`.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [ ] `cargo check -p kb-search` passes
|
||||
- [ ] `cargo test -p kb-search lexical` passes
|
||||
- [ ] No imports outside Allowed dependencies (`cargo tree -p kb-search` audit)
|
||||
- [ ] Output JSON conforms to `docs/wire-schema/v1/search_hit.schema.json`
|
||||
- [ ] PR links design §3.7, §0 Q3, §2.2
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Vector search (p3-3).
|
||||
- Hybrid fusion (p3-4).
|
||||
- Reranker (P+).
|
||||
- Korean morphological tokenizer (P+).
|
||||
|
||||
## Risks / notes
|
||||
|
||||
- bm25 raw scores depend on FTS5 internals; the normalization formula chosen here is for display + RRF input. Avoid leaking raw bm25 to wire schema.
|
||||
- `globset` translation of `path_glob`: ensure `*` does not match `/` to avoid surprising matches.
|
||||
- SQLite FTS5 query string is sensitive to special characters (`"`, `^`, `*`, `:`, `(`, `)`); always escape unless the caller explicitly opted into FTS5 syntax.
|
||||
Reference in New Issue
Block a user