Commit Graph

10 Commits

Author SHA1 Message Date
17d52461b2 feat(p3-5): wire kb-app facade — ingest / search / list / inspect
Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real
bodies that compose the libraries shipped through P3-4. After this
commit, `cargo run -p kb-cli -- index` actually walks the workspace
and persists chunks (SQLite + optionally LanceDB), and
`cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns
real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3
owns it.

App lifecycle (crates/kb-app/src/app.rs):
- Internal pub(crate) struct App holds the Config plus
  Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind
  OnceLock<Arc<...>> for memoization. First call pays the ~470MB
  fastembed init / Lance open; subsequent calls return the cached
  Arc::clone. OnceLock::set race losers fall back to get().cloned()
  so the lazy-init is concurrent-safe.
- One-shot CLI invocations pay the cost once at most. The P9 TUI
  (which holds an App for the session) gets memoization for free.

ingest pipeline (lib.rs):
- FsSourceConnector::scan(&scope) → per asset:
  parse_frontmatter → parse_blocks → build_canonical_document →
  MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document →
  put_blocks → put_chunks. One transaction per document per design
  §5.8 (kb-store-sqlite's put_* methods own the transactions).
- When provider != "none" and dimensions > 0: build embedder once,
  embed each doc's chunks as Document kind, ensure_table once at the
  top of the run, then upsert the VectorRecord batch. Lexical-only
  config (provider == "none") skips both — verified by
  ingest_provider_none_skips_lance test.
- Per-asset parse failures recorded as IngestItemKind::Error with
  the warning attached; the run continues. Only structural failures
  (DB unreachable etc.) abort.
- Aggregate counts (assets_scanned / new / updated / skipped /
  errors / chunks_indexed / embeddings_indexed / duration_ms) flow
  into both the JobRepo progress_json AND a dedicated ingest_runs
  row written via SqliteStore::record_ingest_run (new
  pub(crate) helper added to kb-store-sqlite — see below).
  summary_only=true writes items_json=NULL but still populates the
  count columns.

search dispatch:
- SearchMode::Lexical → LexicalRetriever directly.
- SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore.
- SearchMode::Hybrid → HybridRetriever composing the two.
- Vector / Hybrid with provider=none returns a clear error naming the
  config key to flip ("models.embedding.provider").

list_docs / inspect_doc / inspect_chunk delegate straight to
DocumentStore trait methods. Returns Err with actionable message on
not-found.

Test seam: each public free function has a matching
#[doc(hidden)] pub fn *_with_config(cfg, ...) companion that
integration tests invoke directly (the public form internally calls
load_config()). pub(crate) would not reach across the integration-
tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the
function comment flags it as test-only.

kb-store-sqlite additions:
- pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore
  for the kb-app aggregate-counts persistence path. Helper writes
  the ingest_runs row directly with all aggregate columns; jobs
  table still gets a JobRepo create/update_progress/finish trio in
  parallel.

Tests (11 default, 2 #[ignore] AVX-gated):
- ingest_lexical: round-trip, idempotent, summary_only_drops_items,
  provider_none_skips_lance (asserts no .lance dir on disk),
  records_ingest_runs_row_with_aggregate_counts, tags_any filter,
  inspect_doc_not_found, inspect_chunk_not_found.
- search_lexical: lexical hits with embedding_model=None,
  empty_query_returns_empty, vector_mode_with_provider_none returns
  clear error.
- search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector
  mode embedding_model assertion (#[ignore], AVX). Both run on the
  AVX VM in ~21s combined (first run pays the model download).
- TestEnv pins workspace.root + storage.{data_dir,model_dir} to a
  TempDir so tests don't touch the user's $HOME/.local/share.
- Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has
  three small markdown files with varied frontmatter (rust+cargo+
  python tags) so the tags_any filter test exercises a non-trivial
  predicate.

Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo
clippy --workspace --all-targets -- -D warnings clean. CLI smoke
verified manually: `cargo run -p kb-cli -- index` returns a real
IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns
hits with citations; `cargo run -p kb-cli -- list docs` lists the
indexed documents.

Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types,
kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector,
kb-embed, kb-embed-local plus existing tracing / anyhow / serde /
toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm*,
kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from
cargo tree -p kb-app.

Out of scope per spec: ask body (P4-3), --rebuild-fts wiring,
--resume checkpointing (P+), --watch (P+), TUI / desktop integration
(P9 consumes this facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:11:21 +00:00
3cd5117a7e feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status
First VectorStore implementation. Per-model Lance tables under
config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance
MergeInsert → SQLite-committed) with crash-safe retry, search via
cosine distance with the spec's score-shift (preserves negative
similarity ranking signal that clamping would crush).

V003 migration:
- Adds status (CHECK constraint pending|committed|tombstone, default
  pending) and vector_committed columns to embedding_records.
- BEFORE DELETE trigger on chunks flips dependent rows to tombstone.
  Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE
  runs first then row vanishes via CASCADE. Spec-faithful tombstone
  preservation requires recreating embedding_records to drop the
  CASCADE — deferred to a P+ migration since no production rows exist
  yet (P3-3 is the first writer). V003 SQL comment explains.

LanceVectorStore:
- ensure_table is idempotent: opens existing or creates with the
  Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32,
  dim>, model_id, embedding_version, text, heading_path, created_at).
- IndexId computed via id_for_index with collection="chunk_embeddings",
  index_kind="flat", params_hash = blake3(descriptor JSON). Schema
  bumps automatically rotate the IndexId.
- upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status=
  'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed
  on chunk_id (idempotent re-run); phase-3 UPDATE status='committed',
  vector_committed=1. If phase-2 fails the rows stay 'pending' and
  the next upsert call retries idempotently.
- search joins embedding_records WHERE status='committed' so partial-
  write rows never surface. Cosine distance from Lance ∈ [0, 2] →
  similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈
  [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters
  via SqliteStore::filter_chunks (added in this commit).
- Sync trait + async LanceDB bridged by an embedded current-thread
  tokio runtime. Doc-comment on the struct flags the "do NOT call
  from inside another tokio runtime" panic (block_on cannot nest).
  kb-app's job scheduler is sync today.

kb-store-sqlite additions:
- pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1
  INSERT OR REPLACE (status='pending', vector_committed=0).
- pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3
  single UPDATE … WHERE embedding_id IN (?, ?, …) via
  params_from_iter, guarded by WHERE status='pending' so tombstones
  don't get clobbered.
- pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId>
  consolidates the JOIN against documents/document_tags/
  embedding_records + path_glob via globset. Lets kb-store-vector
  honor SearchFilters without depending on rusqlite or globset
  directly. (kb-search's filter logic is structurally different —
  interleaved with the FTS5 SELECT — so it stays as-is for now;
  consolidation is a P+ refactor.)
- 4 new unit tests cover the phase-1 round-trip, empty batch,
  replay reset of pending rows, and the WHERE-status-pending guard.

Tests:
- 9 lib unit tests in kb-store-vector covering paths/sanitization,
  arrow_batch dim validation + descriptor hash, bm25-style cosine
  score shift math.
- 4 new kb-store-sqlite unit tests on filter_chunks (committed-only,
  tags/lang/trust/path_glob, order preservation, empty input).
- 4 new kb-store-sqlite unit tests on the embedding_records helpers.
- 8 integration tests in upsert_search.rs and 1 snapshot test marked
  #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke
  require_avx_or_panic() at the top of each body so a missing-AVX
  --ignored run fails loudly instead of silently passing. This dev
  host (qemu64 model) lacks AVX so these were NOT exercised end-to-
  end here — first CI lane on AVX hardware will validate them.
- Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder
  with an _comment marker. Snapshot test panics until the placeholder
  is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware.
- Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace
  --all-targets -- -D warnings clean.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb,
arrow + arrow-array + arrow-schema, serde, serde_json, tracing,
thiserror) plus forced waivers — anyhow (trait return type), tokio
+ futures (LanceDB async-only API), blake3 (params_hash). rusqlite
and globset are NOT direct deps of kb-store-vector — confirmed via
cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for
the test fixture seeder only.

Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app
embed_index orchestration (P3-4 facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:01:31 +00:00
b335151d18 feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25)
Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.

New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
  token into a quoted literal (inner " doubled); single-quote-wrapped
  text passes through verbatim as raw FTS5 syntax. Empty query
  short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
  any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
  word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
  enforces the char cap on the way out (chars per design §6.4 — accepts
  the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
  Region / Time forwarded; Byte / empty array fall back to Line{1,1}
  with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
  trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
  uses globset with literal_separator(true) — guarantees '*' does not
  cross '/' per spec Risks/notes — applied as Rust post-filter with
  +128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
  tiebreaker on identical bm25). Tested explicitly with two chunks of
  identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
  set, vector_* None.

kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
  Read-only contract is doc-only — kb-search needs MutexGuard for
  prepare_cached + iter, which a closure-scoped wrapper would awkwardly
  constrain. Closure variant left as a P3 follow-up.

Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.

Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).

Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 05:20:35 +00:00
94bfc50efd feat(p2-1): chunks_fts virtual table + sync triggers (V002 migration)
Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual
table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED
chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror
every chunks mutation into chunks_fts inside the host transaction.

V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT
so existing P1 databases gain searchability without re-ingest. Refinery
bookkeeping makes double-apply naturally idempotent.

Adds rebuild_chunks_fts(&Connection) escape hatch for kb index
--rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT
instead of Transaction so callers can invoke from inside an outer
transaction; WAL serializes writers so no DELETE/INSERT race vs.
concurrent chunks mutators is possible.

Tests (10): V001-only → V002 cold-upgrade backfill (literal path),
chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild
idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5
verbatim diff guard (anchored extraction from both files), WAL/SHM
release on store drop. All 185 workspace tests pass.

Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no
new deps). FTS query implementation deferred to p2-2 (lexical-retriever).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:42:15 +00:00
b7367dedfe p1-6: doc-only TODO markers (section_label, doc_version invariant)
M9: add a `TODO(P2/P3)` comment near the NULL persistence at
documents.rs (put_chunks). The `section_label` column exists in the §5.5
DDL but neither the in-memory Chunk struct nor the §2.6 wire schema
carries the field, so NULL is the correct canonical value today —
flag the future-bump intent in-line rather than leaving it implicit.

M10: add a one-line invariant comment near the i64 -> u32 narrowing for
`doc_version` in `get_document`. The invariant is documented at the
write site (UPSERT bumps by 1 per re-ingest) — restate it at the read
site so the cast is not silently load-bearing.

No behaviour change. No tests touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:34:17 +00:00
15b4d80efc p1-6: rename StoreError::Sqlx -> Sqlite, drop dead assets_root helper
M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not
sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No
external pattern matches; the variant is reached only via `#[from]`.

M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a
test that never landed. Delete it so the dead-code allow goes with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:33:50 +00:00
e41279de96 p1-6: harden store boundary (atomic asset write, poison-tolerant mutex, AssetId validation)
Three Important review fixes on the kb-store-sqlite write path:

I1. Atomic asset write. put_asset_with_bytes now stages bytes to
    `<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into
    place (atomic on POSIX same-fs). On any failure between staging and
    rename we best-effort `remove_file` the temp so the previous orphan
    risk on UPSERT failure is gone. Reference mode is unchanged (no I/O,
    no orphan risk).

I2. Poison-tolerant mutex. New `lock_conn` helper does
    `.lock().unwrap_or_else(|p| p.into_inner())`, so a single panic mid-
    transaction no longer poisons every subsequent store call. The
    rusqlite Transaction Drop already rolls back on panic, leaving the
    Connection state safe to reuse. All 13 prior `.expect("sqlite mutex
    poisoned")` sites in store.rs / documents.rs / jobs.rs now route
    through `lock_conn`.

I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a
    hand-construction bypass the `FromStr` 32-hex invariant. Added
    `validate_asset_id` (32 ASCII hex chars) at every store entry that
    turns an AssetId into a path: `put_asset_with_bytes` and
    `DocumentStore::put_asset`. This shuts a potential path-traversal via
    `assets_path_for`'s `&id[..2]` shard slice.

Tests:
- `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a
  row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips
  a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final
  file and no leaked `*.tmp.*`.
- `put_asset_with_bytes_rejects_invalid_asset_id` — passes
  `AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`).
  Asserts error and zero filesystem artifacts under `data_dir/assets/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:33:19 +00:00
efdb71b1c3 p1-6: list_documents filter coverage test
Round-trip three docs (en/ko, varied tags, varied trust) and exercise
each DocFilter axis: default (all), lang, path_glob (workspace_path
GLOB), tags_any (intersection via document_tags subquery + per-row tag
hydration), and trust_min (Primary > Secondary > Generated rank gate).
2026-04-30 17:14:17 +00:00
111f40ddf0 p1-6: kb-store-sqlite test suite (8 categories)
All 8 test categories from the task plan, plus a JobRepo subset:

  migration   — tests/migration.rs: fresh DB after run_migrations
                exposes every required §5 table + index.
  unit (copy) — tests/asset_writer.rs: copy mode writes file with
                mode 0o644 + correct bytes.
  unit (ref)  — tests/asset_writer.rs: reference mode does not write
                file; row records source path.
  unit (cs)   — tests/asset_writer.rs: tampered checksum returns a
                Conflict-flavoured anyhow error.
  unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
                doc_version 1→2; tags re-derived.
  unit (rb)   — tests/idempotency.rs: put_blocks with FK violation
                rolls back; pre-existing rows unchanged.
  contract    — tests/contract_roundtrip.rs: drives kb-parse-md +
                kb-normalize + kb-chunk on
                fixtures/markdown/code-and-table.md, persists, then
                reloads via DocumentStore::get_document /
                get_chunk and asserts byte-equal round-trip.
  snapshot    — tests/ingest_report_snapshot.rs +
                snapshots/ingest_report.snapshot.json: pin the wire
                JSON form of kb_core::IngestReport for an inline
                fixture run.
  jobs        — tests/jobs.rs: create → progress → finish flow;
                error message round-trip; list filters on status/kind.

Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
2026-04-30 17:13:03 +00:00
a3390d5171 p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL
New workspace member crate `kb-store-sqlite` (allowed deps only:
kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json,
time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md /
kb-normalize / kb-chunk for the contract round-trip test).

Migration V001 replaces the P0-1 stub with the full §5 DDL (assets,
documents, document_tags, blocks, chunks with policy_hash,
embedding_records, jobs, ingest_runs, answers, eval_runs,
eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers
remain deferred to V002 (P2-1).

Public surface per task spec:
  SqliteStore::open / run_migrations / put_asset_with_bytes
  impl DocumentStore for SqliteStore (7 trait methods)
  impl JobRepo for SqliteStore (4 trait methods)
  StoreError { Sqlx, Migration, Conflict }

Behavior:
- Pragmas at open: foreign_keys=ON, journal_mode=WAL,
  synchronous=NORMAL, temp_store=MEMORY.
- Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to
  data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else
  reference. blake3(bytes) verified against asset.checksum; mismatch →
  Conflict.
- Idempotency: put_document UPSERTs and bumps doc_version + 1 on
  conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags
  re-derived per put_document.
- get_document rehydrates blocks via payload_json ordered by stream
  ordinal.
- list_documents builds dynamic WHERE from DocFilter (lang / trust_min
  / path_glob via GLOB / tags_any via document_tags subquery).
- JobRepo: jobs.kind/status are stored as lowercase enum tags; create
  mints a 32-hex JobId via blake3(kind || payload || nanos).

Tests follow in subsequent commits.
2026-04-30 17:08:36 +00:00