Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.
New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
token into a quoted literal (inner " doubled); single-quote-wrapped
text passes through verbatim as raw FTS5 syntax. Empty query
short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
enforces the char cap on the way out (chars per design §6.4 — accepts
the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
Region / Time forwarded; Byte / empty array fall back to Line{1,1}
with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
uses globset with literal_separator(true) — guarantees '*' does not
cross '/' per spec Risks/notes — applied as Rust post-filter with
+128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
tiebreaker on identical bm25). Tested explicitly with two chunks of
identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
set, vector_* None.
kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
Read-only contract is doc-only — kb-search needs MutexGuard for
prepare_cached + iter, which a closure-scoped wrapper would awkwardly
constrain. Closure variant left as a P3 follow-up.
Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.
Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).
Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual
table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED
chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror
every chunks mutation into chunks_fts inside the host transaction.
V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT
so existing P1 databases gain searchability without re-ingest. Refinery
bookkeeping makes double-apply naturally idempotent.
Adds rebuild_chunks_fts(&Connection) escape hatch for kb index
--rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT
instead of Transaction so callers can invoke from inside an outer
transaction; WAL serializes writers so no DELETE/INSERT race vs.
concurrent chunks mutators is possible.
Tests (10): V001-only → V002 cold-upgrade backfill (literal path),
chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild
idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5
verbatim diff guard (anchored extraction from both files), WAL/SHM
release on store drop. All 185 workspace tests pass.
Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no
new deps). FTS query implementation deferred to p2-2 (lexical-retriever).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M9: add a `TODO(P2/P3)` comment near the NULL persistence at
documents.rs (put_chunks). The `section_label` column exists in the §5.5
DDL but neither the in-memory Chunk struct nor the §2.6 wire schema
carries the field, so NULL is the correct canonical value today —
flag the future-bump intent in-line rather than leaving it implicit.
M10: add a one-line invariant comment near the i64 -> u32 narrowing for
`doc_version` in `get_document`. The invariant is documented at the
write site (UPSERT bumps by 1 per re-ingest) — restate it at the read
site so the cast is not silently load-bearing.
No behaviour change. No tests touched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not
sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No
external pattern matches; the variant is reached only via `#[from]`.
M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a
test that never landed. Delete it so the dead-code allow goes with it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Important review fixes on the kb-store-sqlite write path:
I1. Atomic asset write. put_asset_with_bytes now stages bytes to
`<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into
place (atomic on POSIX same-fs). On any failure between staging and
rename we best-effort `remove_file` the temp so the previous orphan
risk on UPSERT failure is gone. Reference mode is unchanged (no I/O,
no orphan risk).
I2. Poison-tolerant mutex. New `lock_conn` helper does
`.lock().unwrap_or_else(|p| p.into_inner())`, so a single panic mid-
transaction no longer poisons every subsequent store call. The
rusqlite Transaction Drop already rolls back on panic, leaving the
Connection state safe to reuse. All 13 prior `.expect("sqlite mutex
poisoned")` sites in store.rs / documents.rs / jobs.rs now route
through `lock_conn`.
I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a
hand-construction bypass the `FromStr` 32-hex invariant. Added
`validate_asset_id` (32 ASCII hex chars) at every store entry that
turns an AssetId into a path: `put_asset_with_bytes` and
`DocumentStore::put_asset`. This shuts a potential path-traversal via
`assets_path_for`'s `&id[..2]` shard slice.
Tests:
- `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a
row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips
a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final
file and no leaked `*.tmp.*`.
- `put_asset_with_bytes_rejects_invalid_asset_id` — passes
`AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`).
Asserts error and zero filesystem artifacts under `data_dir/assets/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 8 test categories from the task plan, plus a JobRepo subset:
migration — tests/migration.rs: fresh DB after run_migrations
exposes every required §5 table + index.
unit (copy) — tests/asset_writer.rs: copy mode writes file with
mode 0o644 + correct bytes.
unit (ref) — tests/asset_writer.rs: reference mode does not write
file; row records source path.
unit (cs) — tests/asset_writer.rs: tampered checksum returns a
Conflict-flavoured anyhow error.
unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
doc_version 1→2; tags re-derived.
unit (rb) — tests/idempotency.rs: put_blocks with FK violation
rolls back; pre-existing rows unchanged.
contract — tests/contract_roundtrip.rs: drives kb-parse-md +
kb-normalize + kb-chunk on
fixtures/markdown/code-and-table.md, persists, then
reloads via DocumentStore::get_document /
get_chunk and asserts byte-equal round-trip.
snapshot — tests/ingest_report_snapshot.rs +
snapshots/ingest_report.snapshot.json: pin the wire
JSON form of kb_core::IngestReport for an inline
fixture run.
jobs — tests/jobs.rs: create → progress → finish flow;
error message round-trip; list filters on status/kind.
Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.