Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real
bodies that compose the libraries shipped through P3-4. After this
commit, `cargo run -p kb-cli -- index` actually walks the workspace
and persists chunks (SQLite + optionally LanceDB), and
`cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns
real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3
owns it.
App lifecycle (crates/kb-app/src/app.rs):
- Internal pub(crate) struct App holds the Config plus
Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind
OnceLock<Arc<...>> for memoization. First call pays the ~470MB
fastembed init / Lance open; subsequent calls return the cached
Arc::clone. OnceLock::set race losers fall back to get().cloned()
so the lazy-init is concurrent-safe.
- One-shot CLI invocations pay the cost once at most. The P9 TUI
(which holds an App for the session) gets memoization for free.
ingest pipeline (lib.rs):
- FsSourceConnector::scan(&scope) → per asset:
parse_frontmatter → parse_blocks → build_canonical_document →
MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document →
put_blocks → put_chunks. One transaction per document per design
§5.8 (kb-store-sqlite's put_* methods own the transactions).
- When provider != "none" and dimensions > 0: build embedder once,
embed each doc's chunks as Document kind, ensure_table once at the
top of the run, then upsert the VectorRecord batch. Lexical-only
config (provider == "none") skips both — verified by
ingest_provider_none_skips_lance test.
- Per-asset parse failures recorded as IngestItemKind::Error with
the warning attached; the run continues. Only structural failures
(DB unreachable etc.) abort.
- Aggregate counts (assets_scanned / new / updated / skipped /
errors / chunks_indexed / embeddings_indexed / duration_ms) flow
into both the JobRepo progress_json AND a dedicated ingest_runs
row written via SqliteStore::record_ingest_run (new
pub(crate) helper added to kb-store-sqlite — see below).
summary_only=true writes items_json=NULL but still populates the
count columns.
search dispatch:
- SearchMode::Lexical → LexicalRetriever directly.
- SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore.
- SearchMode::Hybrid → HybridRetriever composing the two.
- Vector / Hybrid with provider=none returns a clear error naming the
config key to flip ("models.embedding.provider").
list_docs / inspect_doc / inspect_chunk delegate straight to
DocumentStore trait methods. Returns Err with actionable message on
not-found.
Test seam: each public free function has a matching
#[doc(hidden)] pub fn *_with_config(cfg, ...) companion that
integration tests invoke directly (the public form internally calls
load_config()). pub(crate) would not reach across the integration-
tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the
function comment flags it as test-only.
kb-store-sqlite additions:
- pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore
for the kb-app aggregate-counts persistence path. Helper writes
the ingest_runs row directly with all aggregate columns; jobs
table still gets a JobRepo create/update_progress/finish trio in
parallel.
Tests (11 default, 2 #[ignore] AVX-gated):
- ingest_lexical: round-trip, idempotent, summary_only_drops_items,
provider_none_skips_lance (asserts no .lance dir on disk),
records_ingest_runs_row_with_aggregate_counts, tags_any filter,
inspect_doc_not_found, inspect_chunk_not_found.
- search_lexical: lexical hits with embedding_model=None,
empty_query_returns_empty, vector_mode_with_provider_none returns
clear error.
- search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector
mode embedding_model assertion (#[ignore], AVX). Both run on the
AVX VM in ~21s combined (first run pays the model download).
- TestEnv pins workspace.root + storage.{data_dir,model_dir} to a
TempDir so tests don't touch the user's $HOME/.local/share.
- Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has
three small markdown files with varied frontmatter (rust+cargo+
python tags) so the tags_any filter test exercises a non-trivial
predicate.
Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo
clippy --workspace --all-targets -- -D warnings clean. CLI smoke
verified manually: `cargo run -p kb-cli -- index` returns a real
IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns
hits with citations; `cargo run -p kb-cli -- list docs` lists the
indexed documents.
Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types,
kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector,
kb-embed, kb-embed-local plus existing tracing / anyhow / serde /
toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm*,
kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from
cargo tree -p kb-app.
Out of scope per spec: ask body (P4-3), --rebuild-fts wiring,
--resume checkpointing (P+), --watch (P+), TUI / desktop integration
(P9 consumes this facade).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>