feat(p3-5): app-wiring — kb-app facade bodies for ingest / search / list / inspect #19

Merged

altair823 merged 1 commits from feat/p3-5-app-wiring into main

2026-05-01 12:32:45 +00:00

Author	SHA1	Message	Date
altair823	17d52461b2	feat(p3-5): wire kb-app facade — ingest / search / list / inspect Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real bodies that compose the libraries shipped through P3-4. After this commit, `cargo run -p kb-cli -- index` actually walks the workspace and persists chunks (SQLite + optionally LanceDB), and `cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3 owns it. App lifecycle (crates/kb-app/src/app.rs): - Internal pub(crate) struct App holds the Config plus Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind OnceLock<Arc<...>> for memoization. First call pays the ~470MB fastembed init / Lance open; subsequent calls return the cached Arc::clone. OnceLock::set race losers fall back to get().cloned() so the lazy-init is concurrent-safe. - One-shot CLI invocations pay the cost once at most. The P9 TUI (which holds an App for the session) gets memoization for free. ingest pipeline (lib.rs): - FsSourceConnector::scan(&scope) → per asset: parse_frontmatter → parse_blocks → build_canonical_document → MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document → put_blocks → put_chunks. One transaction per document per design §5.8 (kb-store-sqlite's put_* methods own the transactions). - When provider != "none" and dimensions > 0: build embedder once, embed each doc's chunks as Document kind, ensure_table once at the top of the run, then upsert the VectorRecord batch. Lexical-only config (provider == "none") skips both — verified by ingest_provider_none_skips_lance test. - Per-asset parse failures recorded as IngestItemKind::Error with the warning attached; the run continues. Only structural failures (DB unreachable etc.) abort. - Aggregate counts (assets_scanned / new / updated / skipped / errors / chunks_indexed / embeddings_indexed / duration_ms) flow into both the JobRepo progress_json AND a dedicated ingest_runs row written via SqliteStore::record_ingest_run (new pub(crate) helper added to kb-store-sqlite — see below). summary_only=true writes items_json=NULL but still populates the count columns. search dispatch: - SearchMode::Lexical → LexicalRetriever directly. - SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore. - SearchMode::Hybrid → HybridRetriever composing the two. - Vector / Hybrid with provider=none returns a clear error naming the config key to flip ("models.embedding.provider"). list_docs / inspect_doc / inspect_chunk delegate straight to DocumentStore trait methods. Returns Err with actionable message on not-found. Test seam: each public free function has a matching #[doc(hidden)] pub fn _with_config(cfg, ...) companion that integration tests invoke directly (the public form internally calls load_config()). pub(crate) would not reach across the integration- tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the function comment flags it as test-only. kb-store-sqlite additions: - pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore for the kb-app aggregate-counts persistence path. Helper writes the ingest_runs row directly with all aggregate columns; jobs table still gets a JobRepo create/update_progress/finish trio in parallel. Tests (11 default, 2 #[ignore] AVX-gated): - ingest_lexical: round-trip, idempotent, summary_only_drops_items, provider_none_skips_lance (asserts no .lance dir on disk), records_ingest_runs_row_with_aggregate_counts, tags_any filter, inspect_doc_not_found, inspect_chunk_not_found. - search_lexical: lexical hits with embedding_model=None, empty_query_returns_empty, vector_mode_with_provider_none returns clear error. - search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector mode embedding_model assertion (#[ignore], AVX). Both run on the AVX VM in ~21s combined (first run pays the model download). - TestEnv pins workspace.root + storage.{data_dir,model_dir} to a TempDir so tests don't touch the user's $HOME/.local/share. - Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has three small markdown files with varied frontmatter (rust+cargo+ python tags) so the tags_any filter test exercises a non-trivial predicate. Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo clippy --workspace --all-targets -- -D warnings clean. CLI smoke verified manually: `cargo run -p kb-cli -- index` returns a real IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns hits with citations; `cargo run -p kb-cli -- list docs` lists the indexed documents. Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types, kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector, kb-embed, kb-embed-local plus existing tracing / anyhow / serde / toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm, kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from cargo tree -p kb-app. Out of scope per spec: ask body (P4-3), --rebuild-fts wiring, --resume checkpointing (P+), --watch (P+), TUI / desktop integration (P9 consumes this facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:11:21 +00:00

Author

SHA1

Message

Date

altair823

17d52461b2

feat(p3-5): wire kb-app facade — ingest / search / list / inspect

Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real
bodies that compose the libraries shipped through P3-4. After this
commit, `cargo run -p kb-cli -- index` actually walks the workspace
and persists chunks (SQLite + optionally LanceDB), and
`cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns
real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3
owns it.

App lifecycle (crates/kb-app/src/app.rs):
- Internal pub(crate) struct App holds the Config plus
  Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind
  OnceLock<Arc<...>> for memoization. First call pays the ~470MB
  fastembed init / Lance open; subsequent calls return the cached
  Arc::clone. OnceLock::set race losers fall back to get().cloned()
  so the lazy-init is concurrent-safe.
- One-shot CLI invocations pay the cost once at most. The P9 TUI
  (which holds an App for the session) gets memoization for free.

ingest pipeline (lib.rs):
- FsSourceConnector::scan(&scope) → per asset:
  parse_frontmatter → parse_blocks → build_canonical_document →
  MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document →
  put_blocks → put_chunks. One transaction per document per design
  §5.8 (kb-store-sqlite's put_* methods own the transactions).
- When provider != "none" and dimensions > 0: build embedder once,
  embed each doc's chunks as Document kind, ensure_table once at the
  top of the run, then upsert the VectorRecord batch. Lexical-only
  config (provider == "none") skips both — verified by
  ingest_provider_none_skips_lance test.
- Per-asset parse failures recorded as IngestItemKind::Error with
  the warning attached; the run continues. Only structural failures
  (DB unreachable etc.) abort.
- Aggregate counts (assets_scanned / new / updated / skipped /
  errors / chunks_indexed / embeddings_indexed / duration_ms) flow
  into both the JobRepo progress_json AND a dedicated ingest_runs
  row written via SqliteStore::record_ingest_run (new
  pub(crate) helper added to kb-store-sqlite — see below).
  summary_only=true writes items_json=NULL but still populates the
  count columns.

search dispatch:
- SearchMode::Lexical → LexicalRetriever directly.
- SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore.
- SearchMode::Hybrid → HybridRetriever composing the two.
- Vector / Hybrid with provider=none returns a clear error naming the
  config key to flip ("models.embedding.provider").

list_docs / inspect_doc / inspect_chunk delegate straight to
DocumentStore trait methods. Returns Err with actionable message on
not-found.

Test seam: each public free function has a matching
#[doc(hidden)] pub fn *_with_config(cfg, ...) companion that
integration tests invoke directly (the public form internally calls
load_config()). pub(crate) would not reach across the integration-
tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the
function comment flags it as test-only.

kb-store-sqlite additions:
- pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore
  for the kb-app aggregate-counts persistence path. Helper writes
  the ingest_runs row directly with all aggregate columns; jobs
  table still gets a JobRepo create/update_progress/finish trio in
  parallel.

Tests (11 default, 2 #[ignore] AVX-gated):
- ingest_lexical: round-trip, idempotent, summary_only_drops_items,
  provider_none_skips_lance (asserts no .lance dir on disk),
  records_ingest_runs_row_with_aggregate_counts, tags_any filter,
  inspect_doc_not_found, inspect_chunk_not_found.
- search_lexical: lexical hits with embedding_model=None,
  empty_query_returns_empty, vector_mode_with_provider_none returns
  clear error.
- search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector
  mode embedding_model assertion (#[ignore], AVX). Both run on the
  AVX VM in ~21s combined (first run pays the model download).
- TestEnv pins workspace.root + storage.{data_dir,model_dir} to a
  TempDir so tests don't touch the user's $HOME/.local/share.
- Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has
  three small markdown files with varied frontmatter (rust+cargo+
  python tags) so the tags_any filter test exercises a non-trivial
  predicate.

Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo
clippy --workspace --all-targets -- -D warnings clean. CLI smoke
verified manually: `cargo run -p kb-cli -- index` returns a real
IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns
hits with citations; `cargo run -p kb-cli -- list docs` lists the
indexed documents.

Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types,
kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector,
kb-embed, kb-embed-local plus existing tracing / anyhow / serde /
toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm*,
kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from
cargo tree -p kb-app.

Out of scope per spec: ask body (P4-3), --rebuild-fts wiring,
--resume checkpointing (P+), --watch (P+), TUI / desktop integration
(P9 consumes this facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 12:11:21 +00:00

feat(p3-5): app-wiring — kb-app facade bodies for ingest / search / list / inspect #19

1 Commits