kebab

Author	SHA1	Message	Date
altair823	27c669fbf9	feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel Establishes the kb-llm trait crate so concrete LLM adapters (p4-2 Ollama, future llama.cpp / candle) target a stable surface. Pure re- export of kb_core::{LanguageModel, GenerateRequest, TokenChunk, FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic mock for downstream RAG tests (p4-3) that need an LLM trait object without an Ollama dependency. MockLanguageModel (cfg(feature = "mock"), default OFF): - Holds canned_response + canned_finish + canned_usage + (model_id, provider, context_tokens). Pure in-memory; no I/O. - generate_stream() honors GenerateRequest.stop: scans every non-empty stop string against the canned response, takes the earliest byte position (Iterator::min returns the first equal element on ties so declaration order in req.stop wins), truncates with a direct byte- slice (str::find returns a UTF-8 char boundary by contract). - When a stop matches, finish_reason is overridden to Stop (matches OpenAI / Ollama real-world behaviour); otherwise the caller's canned_finish passes through verbatim. - Emits one TokenChunk::Token per Unicode scalar value (char), NOT per grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining marks split. Acceptable for trait-shape testing; real adapters MAY combine. Documented in module docs. - Always terminates with TokenChunk::Done { finish_reason, usage } even if the canned response is empty. The returned iterator is a boxed Vec<TokenChunk>::into_iter().map(Ok), trivially Send. - Real adapters MAY return Err from generate_stream itself (e.g. connection refused) before any chunk is yielded; the mock never does. Documented for the trait re-exporter consumer audience. Helpers: - assert_finish_chunk(chunks) — asserts the last chunk is a Done. Useful for proptests asserting trait contract over random inputs. Tests: - cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests. - cargo test -p kb-llm --features mock: 9 tests including 100-case proptest over random Unicode strings asserting Done terminator, char-count == streamed Token chunks, concat == canned (truncated by stop), plus explicit cases for stop-string truncation, first-stop- match precedence, model_ref dimensions=None invariant, finish reason pass-through. - All 271 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating verified: - cargo build --release -p kb-llm (default): nm shows zero MockLanguageModel symbols. - cargo build --release -p kb-llm --features mock: three trait-impl symbols present. Spec invariant "release builds MUST NOT include MockLanguageModel" enforced at the symbol level. Allowed deps respected: only kb-core (path) and anyhow (workspace, forced by trait return type). Dropped kb-config / serde / thiserror / tracing from the spec's allowed list — they are listed as Allowed but nothing in this skeleton crate references them, and dropping them keeps the dependency graph slim for downstream consumers. p4-2/p4-3 will add what they need at their own dep sites. Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-, kb-embed, kb-search, kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm. Out of scope: real adapter (p4-2 Ollama), token counting against the real tokenizer, server-side cancellation / abort signals (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:37:46 +00:00
altair823	17d52461b2	feat(p3-5): wire kb-app facade — ingest / search / list / inspect Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real bodies that compose the libraries shipped through P3-4. After this commit, `cargo run -p kb-cli -- index` actually walks the workspace and persists chunks (SQLite + optionally LanceDB), and `cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3 owns it. App lifecycle (crates/kb-app/src/app.rs): - Internal pub(crate) struct App holds the Config plus Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind OnceLock<Arc<...>> for memoization. First call pays the ~470MB fastembed init / Lance open; subsequent calls return the cached Arc::clone. OnceLock::set race losers fall back to get().cloned() so the lazy-init is concurrent-safe. - One-shot CLI invocations pay the cost once at most. The P9 TUI (which holds an App for the session) gets memoization for free. ingest pipeline (lib.rs): - FsSourceConnector::scan(&scope) → per asset: parse_frontmatter → parse_blocks → build_canonical_document → MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document → put_blocks → put_chunks. One transaction per document per design §5.8 (kb-store-sqlite's put_* methods own the transactions). - When provider != "none" and dimensions > 0: build embedder once, embed each doc's chunks as Document kind, ensure_table once at the top of the run, then upsert the VectorRecord batch. Lexical-only config (provider == "none") skips both — verified by ingest_provider_none_skips_lance test. - Per-asset parse failures recorded as IngestItemKind::Error with the warning attached; the run continues. Only structural failures (DB unreachable etc.) abort. - Aggregate counts (assets_scanned / new / updated / skipped / errors / chunks_indexed / embeddings_indexed / duration_ms) flow into both the JobRepo progress_json AND a dedicated ingest_runs row written via SqliteStore::record_ingest_run (new pub(crate) helper added to kb-store-sqlite — see below). summary_only=true writes items_json=NULL but still populates the count columns. search dispatch: - SearchMode::Lexical → LexicalRetriever directly. - SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore. - SearchMode::Hybrid → HybridRetriever composing the two. - Vector / Hybrid with provider=none returns a clear error naming the config key to flip ("models.embedding.provider"). list_docs / inspect_doc / inspect_chunk delegate straight to DocumentStore trait methods. Returns Err with actionable message on not-found. Test seam: each public free function has a matching #[doc(hidden)] pub fn _with_config(cfg, ...) companion that integration tests invoke directly (the public form internally calls load_config()). pub(crate) would not reach across the integration- tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the function comment flags it as test-only. kb-store-sqlite additions: - pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore for the kb-app aggregate-counts persistence path. Helper writes the ingest_runs row directly with all aggregate columns; jobs table still gets a JobRepo create/update_progress/finish trio in parallel. Tests (11 default, 2 #[ignore] AVX-gated): - ingest_lexical: round-trip, idempotent, summary_only_drops_items, provider_none_skips_lance (asserts no .lance dir on disk), records_ingest_runs_row_with_aggregate_counts, tags_any filter, inspect_doc_not_found, inspect_chunk_not_found. - search_lexical: lexical hits with embedding_model=None, empty_query_returns_empty, vector_mode_with_provider_none returns clear error. - search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector mode embedding_model assertion (#[ignore], AVX). Both run on the AVX VM in ~21s combined (first run pays the model download). - TestEnv pins workspace.root + storage.{data_dir,model_dir} to a TempDir so tests don't touch the user's $HOME/.local/share. - Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has three small markdown files with varied frontmatter (rust+cargo+ python tags) so the tags_any filter test exercises a non-trivial predicate. Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo clippy --workspace --all-targets -- -D warnings clean. CLI smoke verified manually: `cargo run -p kb-cli -- index` returns a real IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns hits with citations; `cargo run -p kb-cli -- list docs` lists the indexed documents. Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types, kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector, kb-embed, kb-embed-local plus existing tracing / anyhow / serde / toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm, kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from cargo tree -p kb-app. Out of scope per spec: ask body (P4-3), --rebuild-fts wiring, --resume checkpointing (P+), --watch (P+), TUI / desktop integration (P9 consumes this facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:11:21 +00:00
altair823	ccd49ef546	feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF) Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever wrapper around LanceVectorStore (P3-3) into a single Retriever that dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and vector candidates via Reciprocal Rank Fusion and populates the full RetrievalDetail per SearchHit so kb search --explain can attribute scores back to each side. Public surface (kb-search crate): - pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>, Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction. - pub struct HybridRetriever { lexical, vector, fusion, k }. - pub enum FusionPolicy { Rrf { k_rrf: u32 } }. VectorRetriever: - Embeds query.text as EmbeddingKind::Query before delegating to VectorStore::search(query_vec, query.k * 2, &query.filters). Over- fetches by ×2 for filter losses; LanceVectorStore applies the filters internally so they propagate naturally. - Hydrates each VectorHit into a full SearchHit by joining on chunk_id in a single IN-clause batch (no N+1): doc_path, section_label, chunker_version, source_spans for citation, plus embedding_model from embedder.model_id(). - Snippet trimmed to config.search.snippet_chars (vector mode lacks FTS5 highlighting; chunk text prefix is the next-best signal). - Citation built from the chunk's first source span via the shared citation_helper module — extracted from lexical.rs so both retrievers compute citations identically (Byte/empty fallback to Line{1,1} preserved with tracing::warn). - RetrievalDetail.method = Vector for standalone calls; both fusion_score and vector_score set to the LanceVectorStore-shifted cosine score; lexical_* None. HybridRetriever: - Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail. - Hybrid mode runs both retrievers with k * 2 fanout, fuses with RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC with deterministic tiebreaker (lex_rank ASC then chunk_id ASC), takes top query.k. Fusion math runs in f64 throughout; cast to f32 only at the SearchHit boundary where bounded magnitude (≤ ~0.033 at k_rrf=60) makes f32 precision sufficient for ranking. - Per-hit lexical preferred for snippet/citation/heading_path/ chunker_version/embedding_model when the chunk appears in both retrievers — FTS5 highlighting is more user-relevant than vector's truncated text. Vector-only chunks fall through to vector hit data. - index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at construction; mismatched lex/vec versions trigger a tracing::warn so users notice stale indexes (spec line 143). kb-search additions: - citation_helper.rs — pub(crate) citation_from_first_span shared between lexical and vector retrievers. Extracted from lexical.rs; no behavior drift. Tests (38 default + 3 ignored): - 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid preserves single-side hits with the missing side's RetrievalDetail None, deterministic tiebreaker on identical fused scores, composite index_version, mismatched-version warn at construction. - 2 unit tests in vector.rs covering the snippet-prefix and citation fallback paths. - 11 unit tests in lexical.rs (unchanged from P2-2). - 13 lexical integration tests (unchanged). - 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus recall (lex returns A,B; vec returns C,D; hybrid returns all 4), determinism over two queries, snapshot stability against tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was regenerated against this branch on an AVX-enabled VM and contains 4 real chunks (c1/c2 lex+vec, c3/c4 vec-only). - KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of silently passing — matches the P3-2/P3-3 fail-loud-instead-of- silent-pass philosophy. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow). kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder> trait object; the concrete adapter is runtime-injected by kb-app. Out of scope: reranker (P+), score calibration across modes (RRF is rank-comparable so absolute calibration is P+), multimodal retrieval (P6+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:22:21 +00:00
altair823	3cd5117a7e	feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending\|committed\|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:01:31 +00:00
altair823	bcbe2b8531	feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small First real Embedder implementation. Wraps fastembed-rs (ONNX runtime) with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME} template expansion so model files land under config.storage.model_dir/ fastembed/ without polluting kb-config's public API. Public surface: - pub struct FastembedEmbedder - pub fn new(config: &Config) -> Result<Self> - impl kb_core::Embedder (via kb-embed re-export) Behavior: - Default model multilingual-e5-small (384 dims). model_id and model_version come from config.models.embedding.{model,version}. - Pre-load dim check via TextEmbedding::get_model_info: dim mismatch bails before paying the ~470MB ONNX init cost. - e5 prefix applied BEFORE tokenization: "passage: " for EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned by prefix_input unit tests. - Batches inputs into chunks of config.models.embedding.batch_size, concatenates results in input order. - L2 normalization is performed by fastembed 4.9's default transformer pipeline (verified at fastembed/src/text_embedding/output.rs:43); we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so a future fastembed bump that drops this invariant fails loudly. - Synchronous (no async runtime). Mutex serializes calls into the underlying ONNX session — conservative; ORT Session is Send+Sync but callers (kb-app indexer) batch sequentially anyway. Revisit if profiling shows contention. - First-run model download surfaces via tracing::info before/after TextEmbedding::try_new — users no longer stare at a silent 30-60s pause during the 470MB pull. Tests: - 11 default-lane tests covering: check_dim match/mismatch (no model load), prefix_input Document/Query/empty, resolve_model known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive ~ expansion). XDG tests serialize on a Mutex + RAII guard since edition 2024 makes set_var/remove_var unsafe. - 7 #[ignore] integration tests covering: full construction with default config, dim-mismatch belt-and-braces, Document vs Query cosine differential, L2 unit norm, byte-equal determinism, batch-64 performance under 5s, snapshot-hash stability over a 5-sentence multilingual fixture. - Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints the captured hash and panics with paste-back instructions, so first --ignored run forces the maintainer to pin the baseline rather than silently passing. - Workspace: 222 tests pass (default lane); clippy clean. Allowed deps respected: kb-config, kb-embed (re-exports kb-core trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and ort enter transitively through fastembed; reqwest/hyper/hf-hub also transitive (model download is fastembed's responsibility per spec carve-out). No direct kb-core dep needed — re-exports cover it. Pinned to fastembed 4.x rather than the recent 5.x to limit blast radius; consider bump when p3-3 (lancedb-store) consumes the embedder output shape. Out of scope: reranker (P+), Ollama embedding endpoint, candle adapter, image embeddings (P6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:39:38 +00:00
altair823	2e3eb8f437	feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder Establishes the kb-embed trait crate so concrete embedding adapters (p3-2 fastembed, future ollama-embed/candle) target a stable surface. Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic mock for downstream tests. MockEmbedder (cfg(feature = "mock"), default OFF): - Per-component hash recipe: blake3(seed_le8 \|\| kind_byte \|\| text_len_le8 \|\| text \|\| i_le8). Length-prefixed text avoids the domain-separation ambiguity where two (text, i) pairs could shift bytes between text tail and the i field. - Document = 0u8, Query = 1u8 — same text different kind yields different vectors (mirrors e5 prefix behaviour). - Per component: blake3 first 8 bytes → u64 → reinterpret as i64 → f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32 rounds to -1.0; range [-1, 1] holds. - L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision loss) before f32 cast. Zero-norm guard skips the divide. - with_seed(...) constructor lets two embedders share identity but produce different vectors — useful for downstream parametric tests. Helpers: - assert_vector_shape(vecs, dims) — len + finite check. - assert_unit_norm(vecs, tolerance) — caller-supplied tolerance; 5e-4 documented as safe for dims=384 under f32 epsilon × √dims. Tests: - cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests. - cargo test -p kb-embed --features mock: 7 tests including 100-case proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text), Doc(text1) ≠ Doc(text2). - All 220 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating: nm on the release rlib confirms zero MockEmbedder symbols under default features; three trait impl symbols under --features mock. Spec invariant "release builds MUST NOT include MockEmbedder" verified at the symbol level. Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing, plus anyhow (forced by trait return type) and blake3 (justified by the determinism contract; already in workspace lockfile via kb-core). No fastembed/ort/tokenizers anywhere. Out of scope: real adapter (p3-2), reranker traits (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:15:44 +00:00
altair823	b335151d18	feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1) to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with bm25-derived ranking, snippet() previews, and W3C-fragment-style Citation built from the chunk's first source_spans entry. New crate kb-search: - LexicalRetriever::new(Arc<SqliteStore>, IndexVersion). - search() builds an FTS5 MATCH expression by escaping every whitespace token into a quoted literal (inner " doubled); single-quote-wrapped text passes through verbatim as raw FTS5 syntax. Empty query short-circuits to Ok(vec![]). - bm25 normalization: score = -bm25 / (1 + \|bm25\|), bounded (0, 1] for any FTS5-returned negative bm25. - Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet enforces the char cap on the way out (chars per design §6.4 — accepts the combining-mark trade-off). - Citation from chunks.source_spans_json first span: Line / Page / Region / Time forwarded; Byte / empty array fall back to Line{1,1} with a tracing::warn so forward-compat regressions surface. - Filters: tags_any (subquery on document_tags), lang (= column), trust_min (CASE-rank in SQL) all applied at SQL level. path_glob uses globset with literal_separator(true) — guarantees '' does not cross '/' per spec Risks/notes — applied as Rust post-filter with +128 row over-fetch when set, then rank reassigned 1..k contiguously. - Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex tiebreaker on identical bm25). Tested explicitly with two chunks of identical text content. - RetrievalDetail: method=Lexical, both lexical_score and fusion_score set, vector_ None. kb-store-sqlite: - Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>. Read-only contract is doc-only — kb-search needs MutexGuard for prepare_cached + iter, which a closure-scoped wrapper would awkwardly constrain. Closure variant left as a P3 follow-up. Tests (26 new): empty corpus, empty query, single hit + citation round-trip, snippet length cap, tags_any exclusion, lang+trust composition, path_glob with '' not crossing '/', citation line round- trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical- score tiebreaker), index_version passthrough, snapshot (crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace: 211 tests pass, cargo clippy --workspace --all-targets -D warnings clean. Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (forced by trait return type), serde_json (parses _json TEXT columns), globset (path_glob '*' boundary). Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4), reranker (P+), Korean morphological tokenizer (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:20:35 +00:00
altair823	111f40ddf0	p1-6: kb-store-sqlite test suite (8 categories) All 8 test categories from the task plan, plus a JobRepo subset: migration — tests/migration.rs: fresh DB after run_migrations exposes every required §5 table + index. unit (copy) — tests/asset_writer.rs: copy mode writes file with mode 0o644 + correct bytes. unit (ref) — tests/asset_writer.rs: reference mode does not write file; row records source path. unit (cs) — tests/asset_writer.rs: tampered checksum returns a Conflict-flavoured anyhow error. unit (idem) — tests/idempotency.rs: same put_document twice → 1 row, doc_version 1→2; tags re-derived. unit (rb) — tests/idempotency.rs: put_blocks with FK violation rolls back; pre-existing rows unchanged. contract — tests/contract_roundtrip.rs: drives kb-parse-md + kb-normalize + kb-chunk on fixtures/markdown/code-and-table.md, persists, then reloads via DocumentStore::get_document / get_chunk and asserts byte-equal round-trip. snapshot — tests/ingest_report_snapshot.rs + snapshots/ingest_report.snapshot.json: pin the wire JSON form of kb_core::IngestReport for an inline fixture run. jobs — tests/jobs.rs: create → progress → finish flow; error message round-trip; list filters on status/kind. Drops the unused `serde` direct dep from Cargo.toml; serde_json brings its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1` to live only in the dev tree.	2026-04-30 17:13:03 +00:00
altair823	a3390d5171	p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL New workspace member crate `kb-store-sqlite` (allowed deps only: kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json, time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md / kb-normalize / kb-chunk for the contract round-trip test). Migration V001 replaces the P0-1 stub with the full §5 DDL (assets, documents, document_tags, blocks, chunks with policy_hash, embedding_records, jobs, ingest_runs, answers, eval_runs, eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers remain deferred to V002 (P2-1). Public surface per task spec: SqliteStore::open / run_migrations / put_asset_with_bytes impl DocumentStore for SqliteStore (7 trait methods) impl JobRepo for SqliteStore (4 trait methods) StoreError { Sqlx, Migration, Conflict } Behavior: - Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY. - Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else reference. blake3(bytes) verified against asset.checksum; mismatch → Conflict. - Idempotency: put_document UPSERTs and bumps doc_version + 1 on conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags re-derived per put_document. - get_document rehydrates blocks via payload_json ordered by stream ordinal. - list_documents builds dynamic WHERE from DocFilter (lang / trust_min / path_glob via GLOB / tags_any via document_tags subquery). - JobRepo: jobs.kind/status are stored as lowercase enum tags; create mints a 32-hex JobId via blake3(kind \|\| payload \|\| nanos). Tests follow in subsequent commits.	2026-04-30 17:08:36 +00:00
altair823	58f7b8573d	p1-5: add long-section fixture + Vec<Chunk> snapshot test Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s, nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table, and a multi-paragraph Gamma section) into the JSON snapshot baseline. Confirms the priority rules end-to-end: - Heading boundaries hold across H1 → H2 → H1 transitions - The code block emits one chunk at 427 tokens > target=200 - The table stays single-chunk - Gamma's paragraph stream splits with one block of overlap seed A second test runs the full parse → normalize → chunk pipeline 5 times and asserts identical chunk_ids each pass. Drops the unused `kb-config` and `serde` from regular dependencies — they were declared but no source path imports them; `serde` flows in transitively via `kb-core` as a public API requirement, and `ChunkingCfg` lives in `kb-config` but the chunker takes `ChunkPolicy` directly. Production deps are now exactly the allowed set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer, tracing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:33:29 +00:00
altair823	8142449eb7	p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:42 +00:00
altair823	e0df42984e	p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path) I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from build_canonical_document are tracked separately and attributed to "kb-normalize", so the I1 mapping change does not lie about kb-normalize-originated drops. I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with an invalid empty AssetId (would violate AssetId::from_str's 32-hex invariant). Block is dropped, Warning surfaces in Provenance with src mention, attributed to kb-normalize (lift-stage decision). TODO(P8) comment marks this as a placeholder until the audio extractor lands. I3: NFC-normalize each heading_path string in lift_block before feeding into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does not NFC heading text and serde_json_canonicalizer v0.3 does not either, so canonically-equivalent NFD/NFC inputs would produce different block_ids without this normalization. Mirrors the existing doc_id NFC handling via to_posix. Minors: - M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer, blake3 (unused); keep tracing (now wired) + unicode-normalization (now used by I3). - M5: determinism_1000_iterations_under_1s now uses the same 5-block fixture as block_ordinals_scoped_per_heading_and_kind (extracted into fixture_blocks_five helper) so the determinism property is exercised on a real lift_block path, not just an empty Vec. Still < 1s. - M6: snapshot integration test now passes BodyHints { first_h1: Some("Code And Table"), .. } and asserts doc.title == "Code And Table" end-to-end. Baseline JSON updated. - M7: title/lang edge-case unit tests pin policy: empty string lifts to empty string; non-stringy values silently drop. Rustdoc updated. - M10: provenance_contains_stage_events_in_order asserts events[1].at == events[2].at to pin the shared-now_utc invariant. New tests (unit, kb-normalize): - provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1) - audio_ref_block_skipped_with_warning (I2) - nfc_nfd_korean_heading_path_same_block_id (I3) - title_empty_string_in_user_map_falls_back_to_default (M7) - title_non_string_in_user_map_silently_drops (M7) - lang_invalid_shape_silently_drops (M7) kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:41:50 +00:00
altair823	c0096ce44b	p1-4: scaffold kb-normalize crate Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a dev-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical entry point per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:53 +00:00
altair823	4e7e9cad87	p1-3: add parse_blocks (pulldown-cmark walker) submodule Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a small frame-based state machine that tracks heading paths, accumulates inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and reports SourceSpan::Line spans in 1-indexed file-line coordinates. Covers headings, paragraphs, code blocks (lang from info string), GFM tables (with malformed fallback to paragraph + MalformedTable warning), lists (nested sub-lists flattened into parent item), and block-level image references. Inline images are dropped silently per the inline filter. Adversarial inputs are caught with `catch_unwind` and degrade to an empty output + ExtractFailed warning. 15 unit tests cover heading-path correctness, code lang, table parsing, malformed-table fallback (driven via synthetic events since pulldown-cmark auto-normalizes table widths), LF/CRLF line-range parity, image refs, nested-list flattening, inline filter, and 100-iteration random-bytes plus hand-crafted adversarial-input no-panic guards.	2026-04-30 14:14:34 +00:00
altair823	a86b463fc4	p1-2: scaffold kb-parse-md crate Add the workspace member with the dep allow-list pinned by design §0 Q9 and the task spec. P1-2 will land the frontmatter submodule in the next commit; P1-3 will add the block parser as a sibling. Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024 so we use serde_yaml_ng, the maintained fork. lingua's per-language features are explicitly enabled (default-features=false) to keep build time + binary size sane — only the languages we need at parse time.	2026-04-30 12:55:20 +00:00
altair823	7c75e10b2c	p1-1: scaffold kb-source-fs crate (FsSourceConnector) Walk config.workspace.root, apply gitignore-style filters (config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for .DS_Store / ._*), stream BLAKE3 over each file, and emit a deterministic Vec<RawAsset> sorted by workspace_path. Modules: - hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file loads); pinned digests for empty input and "hello world". - media: extension → MediaType (markdown/pdf/image/audio/other). - walker: ignore::OverrideBuilder for filter union; walkdir with manual visited-set cycle protection on top of follow_links. - connector: public FsSourceConnector::new(&Config) + SourceConnector::scan(&SourceScope) impl. Uses kb_core::to_posix for WorkspacePath construction (carries P0-1 # rejection through unchanged) and kb_core::id_for_asset for AssetId derivation. Storage variant signals intent only; actual byte copy is P1-6's responsibility. Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:27:34 +00:00
altair823	d2c8728095	p0-1: address review (drop unused thiserror dep, document kb-core reserve) - Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app (unused — none of those crates' src trees reference thiserror; CoreError in kb-core is the only consumer). - kb-config keeps the `kb-core` dep with a one-line comment marking CoreError reserved for P1-* config-error wiring per the review thread. - ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and satisfies `clippy::manual_is_ascii_check` under `-D warnings`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:55:39 +00:00
altair823	f86df99fe9	p0-1: workspace + kb-core domain types, traits, and ID recipe Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core crate per the frozen design (§3, §4, §7, §10). kb-core has zero deps on other kb-* crates and exposes: - Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId / IndexId) with Display + FromStr that reject anything but 32 lower-hex. - id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2; pinned hex test values computed via an independent JCS+blake3 tool. - CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4). - Citation (5 variants) with W3C Media Fragments to_uri / parse; round-trip property holds for every variant. - Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail (§3.7); DocFilter / DocSummary mirrors of wire §2.5. - Answer / AnswerCitation / RefusalReason / ModelRef (§3.8). - IngestReport, JobRepo support types, VectorRecord / VectorHit. - Component traits (SourceConnector / Extractor / Chunker / Embedder / Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo) plus their input helpers (SourceScope / ExtractContext / ChunkPolicy / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason). - CoreError (§10). - nfc + to_posix helpers (§4.1, §6.6). 20 unit tests cover ID determinism (1000-run regression), key-order invariance, two pinned hex values, newtype rejection of bad input, Citation round-trip for all 5 variants, and to_posix collapsing + Korean NFC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 05:16:37 +00:00

1 2 3

118 Commits