kebab

Author	SHA1	Message	Date
altair823	58a11cc2b8	feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence - new kb-eval crate: load_golden_set (YAML) + run_eval (per-query search/ask + persistence) - new kb-store-sqlite::eval module: record_eval_run_with_results (transactional), document_exists / chunk_exists probes - fixtures/golden_queries.yaml: 5-entry KO+EN template - tests: 13 pass (loader: parse, dup-id, missing chunk_id; runner: elapsed, snapshot, error capture, JSONL, determinism, persistence, config_snapshot) - per_query.jsonl mirror written to runs_dir/<run_id>/ - temperature=0 + fixed seed → byte-identical per_query.jsonl (lexical) deviations from spec (documented in code): - run_id uses uuid::Uuid::now_v7().simple() (timestamp-ordered hex) instead of ULID — uuid already in workspace deps - load_golden_set_validated kept #[cfg(test)] pub(crate) — production inlines validate_against_db - snapshot fixture uses normalized projection (id/query/mode/first_hit) — full byte-determinism covered by separate test - index_version in config_snapshot left null (composed per call by kb-app, not config-level) deferred to follow-up: - App reuse across queries (currently rebuilds App per query) - expand_path hoist to kb-config (3 crate clones now) - --max-queries flag (deferred to P5-2 per updated spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:01:09 +00:00
altair823	e35b06d0d0	feat(p4-3): kb-rag crate — full RAG pipeline + kb-app::ask wired P4 terminal task. Implements the user-facing payoff: retrieve → score gate → pack → render → generate → cite-validate → persist. After this commit, `kb ask` actually works against an Ollama backend; the pipeline grounds the answer in retrieved chunks and refuses cleanly when the gate trips or the model self-judges. New crate kb-rag: - pub struct RagPipeline { retriever, llm, docs, config } — all Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync. - pub fn ask(query, opts) -> Result<Answer> drives the nine-stage flow per spec §1. - pub struct AskOpts { k, explain, mode, temperature, seed, stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor over config.search.default_k so a low-k caller can't starve retrieval (documented in field doc). Pipeline stages: 1. Retrieve via the injected dyn Retriever. 2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 < config.rag.score_gate → ScoreGate refusal (no LLM call) with top-3 candidates listed in the synthesized answer text. 3. Pack: budget = config.rag.max_context_tokens.saturating_sub (prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>` with deterministic enumeration. If every hit's chunk is unfetchable from the store (deleted between search and pack), fall back to NoChunks refusal with a tracing::warn rather than feeding an empty [근거] to the LLM. 4. Render rag-v1 prompt with the spec's verbatim Korean system string + `[질문]/[근거]` user template. 5. Generate via dyn LanguageModel. Single-thread token loop owns the iterator; tokens optionally forward to opts.stream_sink (a `mpsc::Sender<String>`). SendError silently dropped — caller cancellation never panics the pipeline. After Done the loop reads (acc, finish_reason, usage) in lockstep with no race. max_completion = llm.context_tokens().saturating_sub (used_for_input).max(64) — explicitly NOT capped by config.rag.max_context_tokens (that's the packing budget for [근거], not the LM completion ceiling). 6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`, `[#1234]`, `vec![1]` are all rejected to prevent prose false-positives. 7. Citation validate covers four cases: - unknown marker (e.g. `[#7]` when only 3 packed) → LlmSelfJudge refusal. - empty answer with hits → LlmSelfJudge. - non-empty + no marker + matches `근거 (가\|이) 부족` regex → LlmSelfJudge (model self-refused with the canonical phrase; phrase match logged via tracing::debug for observability). - non-empty + no marker + no refusal phrase → LlmSelfJudge (silent ungrounded answers are still refusals). - non-empty + ≥1 valid marker → grounded = true. 8. Build Answer per kb_core::Answer shape: - citations: filter packed list to exactly the markers cited. Wire format `marker: Some("[1]")` (square-bracketed bare index) per design §2.3, distinct from the prompt-side `[#n]` grammar. - embedding ModelRef: read from config.models.embedding for Vector/Hybrid; None for Lexical. Documented deviation since the Retriever trait doesn't expose the embedder. For ScoreGate/NoChunks refusals on Vector/Hybrid the embedding model is still recorded — the vector retriever WAS consulted even when the gate tripped. - TraceId minted as `ret_<8-hex>` from blake3(query, top_score, model_id, ns). - retrieval AnswerRetrievalSummary populated. - usage from the final Done chunk; latency_ms wall-clock fallback when the LLM reports zero. - created_at OffsetDateTime::now_utc(). 9. Persist via SqliteStore::put_answer (new inherent method on SqliteStore, not on the DocumentStore trait — answers aren't documents and adding to kb-core was forbidden). Always inserts, refusals included. packed_chunks_json is null unless opts.explain == true. kb-store-sqlite extension: - pub fn put_answer(&Answer, query, packed_chunks_json) -> Result<AnswerId>. Maps all 22 fields of the answers table per V001 schema in a single INSERT under a transaction. kb-app::ask wired: - bail!("not yet wired (P4-3)") replaced with a real body that builds the retriever per opts.mode (Lexical \| Vector \| Hybrid), instantiates OllamaLanguageModel from config, constructs RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is re-exported via `pub use kb_rag::AskOpts` so kb-cli's `use kb_app::AskOpts` keeps working. - kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's forbids on these are lifted by P4-3 spec — kb-app is the orchestrator and ask requires both the trait crate and the Ollama adapter. - kb-cli/main.rs's AskOpts literal updated with stream_sink: None for the CLI path (TUI in P9 will plumb a real sink). Tests (kb-rag: 18; kb-app: 1 ignored): - 3 unit in src/pipeline.rs: marker regex strictness (rejects all loose forms with byte-equal expectations), Send+Sync compile check, embedding_ref_for behavior across modes. - 15 integration in tests/pipeline.rs covering every spec test row + the new "all chunks unfetchable falls back to NoChunks" guard: empty-hits, score-gate, grounded happy path, unknown-marker, prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase, packing-budget overflow, streaming-forwards-to-mpsc, dropped- receiver-no-panic, usage-from-final-Done, answers-row-inserted- for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON shape, unfetchable-chunks-fall-back-to-no-chunks (the new M3 test). - kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that drives the wired ask end-to-end against `localhost:11434`. Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-search, kb-llm, kb-store-sqlite, serde, serde_json, regex, time, tracing, thiserror) plus forced waivers anyhow (Retriever / LanguageModel trait return types) and blake3 (TraceId minting). Forbidden (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store- vector direct, kb-embed* direct, kb-llm-local direct, kb-tui, kb-desktop) all absent from `cargo tree -p kb-rag` — concrete adapters reach the pipeline only through trait objects. Out of scope: reranker between retrieve and pack (P+), multi-turn chat memory (P+), LLM-as-judge eval (P5 uses rule-based must_contain), --json streaming (buffers per §0 Q5 hybrid). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:06:10 +00:00
altair823	3e38a9bcb4	feat(p4-2): kb-llm-local crate — Ollama HTTP adapter (reqwest::blocking) First real LanguageModel implementation. Wraps Ollama's local HTTP API at POST {endpoint}/api/generate with stream:true, parses the NDJSON streaming response into TokenChunk events, and maps Ollama error states to a thiserror-derived LlmError with actionable hints. Synchronous trait surface; reqwest::blocking handles the HTTP I/O. Public surface: - pub struct OllamaLanguageModel - pub fn new(config: &Config) -> Result<Self> — lazy connect; never hits the network. Spec line 96. - pub enum LlmError { Unreachable, ModelNotPulled, Timeout, Stream, Malformed }. Lives in this crate per spec — kb-core / kb-llm stay free of error taxonomy. - impl kb_core::LanguageModel via re-export from kb-llm. Streaming: - POST body shape per spec §11.2: model, prompt = system + "\n\n" + user, stream: true, options { temperature, seed, num_ctx, stop }. - OllamaStream owns BufReader<reqwest::blocking::Response>, reads NDJSON lines via read_until(b'\n'), parses each as {response, done, done_reason?, prompt_eval_count?, eval_count?, total_duration?}. Token frame → TokenChunk::Token; done frame → TokenChunk::Done { finish_reason, usage }. - done_reason mapping: "length" → Length, "abort" → Aborted, "stop" / missing / unknown → Stop (forward-compat with future Ollama tags). - Missing prompt_eval_count / eval_count default to 0 + tracing::warn (do NOT fail). Spec line 135. - EOF without a done line synthesizes Done { Aborted, zeros } so downstream pipelines never deadlock waiting for a terminal frame. - UTF-8: line-delimited framing means each JSON line is a complete UTF-8 sequence — no cross-HTTP-chunk codepoint splits to worry about. read_until accumulates whole lines regardless of how the underlying reqwest body chunks. Error mapping (LlmError): - reqwest::Error::is_connect() → Unreachable { endpoint, source } with hint "ensure `ollama serve` is running and reachable at <endpoint>". - reqwest::Error::is_timeout() → Timeout. - 200 with non-NDJSON first line (e.g., transparent-proxy HTML error page) → Stream(truncated body) — distinguished from Malformed by the iterator's has_emitted flag. - 404 with body containing model_id (case-insensitive) OR English "model" + "not found" → ModelNotPulled(model_id) with hint "ollama pull <model_id>". Tightened beyond spec to survive Ollama localizing the error message (Korean / Japanese / etc.) while keeping the original English-substring fallback. - Other 4xx/5xx → Stream(truncated body). - Mid-stream JSON parse failure (after at least one valid line) → Malformed(line). Truncate all error bodies to 512 chars (chars-based, multibyte safe) so an nginx 500 page can't blow up the diagnostic. - Trailing slash in endpoint stripped before formatting the URL — endpoint = "http://x:1234/" produces .../api/generate, not .../api//generate. Pinned by trailing-slash test. Tokio note: reqwest 0.12's blocking feature internally wraps a private current-thread tokio runtime, so cargo tree --edges normal shows tokio. The auditable invariant is "no top-level tokio dep + no async surface exposed to callers" — verified: src/ has zero async/await/tokio::*. default-features = false drops default-tls (rustls only) but does NOT drop tokio. Documented honestly in Cargo.toml + lib.rs. Switching to ureq would remove tokio entirely; deferred since reqwest is the spec's allowed dep. Tests (24 total: 23 default + 1 ignored): - 7 unit in src/ollama.rs: prompt-build, options-build, finish- reason mapping, truncate_body bounds (under_cap / over_cap_marker / multibyte_chars_not_bytes), 404+model-id heuristic. - 3 in tests/construction.rs: ModelRef shape, context_tokens passthrough, lazy-connect proven via port-1 pointing. - 13 in tests/streaming.rs: streamed tokens then Done, multibyte chars within a line round-trip (renamed from "split across chunks" to honestly reflect what's tested), Unreachable-with- hint, 4xx→Stream, 404→ModelNotPulled, concat-equals-canned, done_reason length / abort, missing eval counts default to zero, missing done_reason defaults to Stop, determinism-by-mock, trailing-slash endpoint, non-NDJSON 200 body → Stream not Malformed. - 1 #[ignore] in tests/integration.rs: real Ollama on localhost:11434 with the configured model. Opt-in via cargo test -p kb-llm-local -- --ignored after `ollama serve` + `ollama pull`. Workspace: 288 passed / 25 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. No native-tls, no openssl in the dep graph. Allowed deps respected: kb-core, kb-config, kb-llm, reqwest 0.12 (default-features=false; blocking, json, rustls-tls), serde, serde_json, tracing, thiserror plus anyhow (forced by trait return type). wiremock + tokio in [dev-dependencies] only. Out of scope: llama.cpp / candle adapters (P+), Ollama embed endpoint (separate adapter inside kb-embed-local if requested), cancellation / abort tokens (P+), connection-pool tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:28:34 +00:00
altair823	27c669fbf9	feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel Establishes the kb-llm trait crate so concrete LLM adapters (p4-2 Ollama, future llama.cpp / candle) target a stable surface. Pure re- export of kb_core::{LanguageModel, GenerateRequest, TokenChunk, FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic mock for downstream RAG tests (p4-3) that need an LLM trait object without an Ollama dependency. MockLanguageModel (cfg(feature = "mock"), default OFF): - Holds canned_response + canned_finish + canned_usage + (model_id, provider, context_tokens). Pure in-memory; no I/O. - generate_stream() honors GenerateRequest.stop: scans every non-empty stop string against the canned response, takes the earliest byte position (Iterator::min returns the first equal element on ties so declaration order in req.stop wins), truncates with a direct byte- slice (str::find returns a UTF-8 char boundary by contract). - When a stop matches, finish_reason is overridden to Stop (matches OpenAI / Ollama real-world behaviour); otherwise the caller's canned_finish passes through verbatim. - Emits one TokenChunk::Token per Unicode scalar value (char), NOT per grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining marks split. Acceptable for trait-shape testing; real adapters MAY combine. Documented in module docs. - Always terminates with TokenChunk::Done { finish_reason, usage } even if the canned response is empty. The returned iterator is a boxed Vec<TokenChunk>::into_iter().map(Ok), trivially Send. - Real adapters MAY return Err from generate_stream itself (e.g. connection refused) before any chunk is yielded; the mock never does. Documented for the trait re-exporter consumer audience. Helpers: - assert_finish_chunk(chunks) — asserts the last chunk is a Done. Useful for proptests asserting trait contract over random inputs. Tests: - cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests. - cargo test -p kb-llm --features mock: 9 tests including 100-case proptest over random Unicode strings asserting Done terminator, char-count == streamed Token chunks, concat == canned (truncated by stop), plus explicit cases for stop-string truncation, first-stop- match precedence, model_ref dimensions=None invariant, finish reason pass-through. - All 271 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating verified: - cargo build --release -p kb-llm (default): nm shows zero MockLanguageModel symbols. - cargo build --release -p kb-llm --features mock: three trait-impl symbols present. Spec invariant "release builds MUST NOT include MockLanguageModel" enforced at the symbol level. Allowed deps respected: only kb-core (path) and anyhow (workspace, forced by trait return type). Dropped kb-config / serde / thiserror / tracing from the spec's allowed list — they are listed as Allowed but nothing in this skeleton crate references them, and dropping them keeps the dependency graph slim for downstream consumers. p4-2/p4-3 will add what they need at their own dep sites. Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-, kb-embed, kb-search, kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm. Out of scope: real adapter (p4-2 Ollama), token counting against the real tokenizer, server-side cancellation / abort signals (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:37:46 +00:00
altair823	3cd5117a7e	feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending\|committed\|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:01:31 +00:00
altair823	bcbe2b8531	feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small First real Embedder implementation. Wraps fastembed-rs (ONNX runtime) with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME} template expansion so model files land under config.storage.model_dir/ fastembed/ without polluting kb-config's public API. Public surface: - pub struct FastembedEmbedder - pub fn new(config: &Config) -> Result<Self> - impl kb_core::Embedder (via kb-embed re-export) Behavior: - Default model multilingual-e5-small (384 dims). model_id and model_version come from config.models.embedding.{model,version}. - Pre-load dim check via TextEmbedding::get_model_info: dim mismatch bails before paying the ~470MB ONNX init cost. - e5 prefix applied BEFORE tokenization: "passage: " for EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned by prefix_input unit tests. - Batches inputs into chunks of config.models.embedding.batch_size, concatenates results in input order. - L2 normalization is performed by fastembed 4.9's default transformer pipeline (verified at fastembed/src/text_embedding/output.rs:43); we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so a future fastembed bump that drops this invariant fails loudly. - Synchronous (no async runtime). Mutex serializes calls into the underlying ONNX session — conservative; ORT Session is Send+Sync but callers (kb-app indexer) batch sequentially anyway. Revisit if profiling shows contention. - First-run model download surfaces via tracing::info before/after TextEmbedding::try_new — users no longer stare at a silent 30-60s pause during the 470MB pull. Tests: - 11 default-lane tests covering: check_dim match/mismatch (no model load), prefix_input Document/Query/empty, resolve_model known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive ~ expansion). XDG tests serialize on a Mutex + RAII guard since edition 2024 makes set_var/remove_var unsafe. - 7 #[ignore] integration tests covering: full construction with default config, dim-mismatch belt-and-braces, Document vs Query cosine differential, L2 unit norm, byte-equal determinism, batch-64 performance under 5s, snapshot-hash stability over a 5-sentence multilingual fixture. - Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints the captured hash and panics with paste-back instructions, so first --ignored run forces the maintainer to pin the baseline rather than silently passing. - Workspace: 222 tests pass (default lane); clippy clean. Allowed deps respected: kb-config, kb-embed (re-exports kb-core trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and ort enter transitively through fastembed; reqwest/hyper/hf-hub also transitive (model download is fastembed's responsibility per spec carve-out). No direct kb-core dep needed — re-exports cover it. Pinned to fastembed 4.x rather than the recent 5.x to limit blast radius; consider bump when p3-3 (lancedb-store) consumes the embedder output shape. Out of scope: reranker (P+), Ollama embedding endpoint, candle adapter, image embeddings (P6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:39:38 +00:00
altair823	2e3eb8f437	feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder Establishes the kb-embed trait crate so concrete embedding adapters (p3-2 fastembed, future ollama-embed/candle) target a stable surface. Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic mock for downstream tests. MockEmbedder (cfg(feature = "mock"), default OFF): - Per-component hash recipe: blake3(seed_le8 \|\| kind_byte \|\| text_len_le8 \|\| text \|\| i_le8). Length-prefixed text avoids the domain-separation ambiguity where two (text, i) pairs could shift bytes between text tail and the i field. - Document = 0u8, Query = 1u8 — same text different kind yields different vectors (mirrors e5 prefix behaviour). - Per component: blake3 first 8 bytes → u64 → reinterpret as i64 → f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32 rounds to -1.0; range [-1, 1] holds. - L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision loss) before f32 cast. Zero-norm guard skips the divide. - with_seed(...) constructor lets two embedders share identity but produce different vectors — useful for downstream parametric tests. Helpers: - assert_vector_shape(vecs, dims) — len + finite check. - assert_unit_norm(vecs, tolerance) — caller-supplied tolerance; 5e-4 documented as safe for dims=384 under f32 epsilon × √dims. Tests: - cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests. - cargo test -p kb-embed --features mock: 7 tests including 100-case proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text), Doc(text1) ≠ Doc(text2). - All 220 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating: nm on the release rlib confirms zero MockEmbedder symbols under default features; three trait impl symbols under --features mock. Spec invariant "release builds MUST NOT include MockEmbedder" verified at the symbol level. Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing, plus anyhow (forced by trait return type) and blake3 (justified by the determinism contract; already in workspace lockfile via kb-core). No fastembed/ort/tokenizers anywhere. Out of scope: real adapter (p3-2), reranker traits (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:15:44 +00:00
altair823	b335151d18	feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1) to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with bm25-derived ranking, snippet() previews, and W3C-fragment-style Citation built from the chunk's first source_spans entry. New crate kb-search: - LexicalRetriever::new(Arc<SqliteStore>, IndexVersion). - search() builds an FTS5 MATCH expression by escaping every whitespace token into a quoted literal (inner " doubled); single-quote-wrapped text passes through verbatim as raw FTS5 syntax. Empty query short-circuits to Ok(vec![]). - bm25 normalization: score = -bm25 / (1 + \|bm25\|), bounded (0, 1] for any FTS5-returned negative bm25. - Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet enforces the char cap on the way out (chars per design §6.4 — accepts the combining-mark trade-off). - Citation from chunks.source_spans_json first span: Line / Page / Region / Time forwarded; Byte / empty array fall back to Line{1,1} with a tracing::warn so forward-compat regressions surface. - Filters: tags_any (subquery on document_tags), lang (= column), trust_min (CASE-rank in SQL) all applied at SQL level. path_glob uses globset with literal_separator(true) — guarantees '' does not cross '/' per spec Risks/notes — applied as Rust post-filter with +128 row over-fetch when set, then rank reassigned 1..k contiguously. - Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex tiebreaker on identical bm25). Tested explicitly with two chunks of identical text content. - RetrievalDetail: method=Lexical, both lexical_score and fusion_score set, vector_ None. kb-store-sqlite: - Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>. Read-only contract is doc-only — kb-search needs MutexGuard for prepare_cached + iter, which a closure-scoped wrapper would awkwardly constrain. Closure variant left as a P3 follow-up. Tests (26 new): empty corpus, empty query, single hit + citation round-trip, snippet length cap, tags_any exclusion, lang+trust composition, path_glob with '' not crossing '/', citation line round- trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical- score tiebreaker), index_version passthrough, snapshot (crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace: 211 tests pass, cargo clippy --workspace --all-targets -D warnings clean. Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (forced by trait return type), serde_json (parses _json TEXT columns), globset (path_glob '*' boundary). Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4), reranker (P+), Korean morphological tokenizer (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:20:35 +00:00
altair823	a3390d5171	p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL New workspace member crate `kb-store-sqlite` (allowed deps only: kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json, time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md / kb-normalize / kb-chunk for the contract round-trip test). Migration V001 replaces the P0-1 stub with the full §5 DDL (assets, documents, document_tags, blocks, chunks with policy_hash, embedding_records, jobs, ingest_runs, answers, eval_runs, eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers remain deferred to V002 (P2-1). Public surface per task spec: SqliteStore::open / run_migrations / put_asset_with_bytes impl DocumentStore for SqliteStore (7 trait methods) impl JobRepo for SqliteStore (4 trait methods) StoreError { Sqlx, Migration, Conflict } Behavior: - Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY. - Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else reference. blake3(bytes) verified against asset.checksum; mismatch → Conflict. - Idempotency: put_document UPSERTs and bumps doc_version + 1 on conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags re-derived per put_document. - get_document rehydrates blocks via payload_json ordered by stream ordinal. - list_documents builds dynamic WHERE from DocFilter (lang / trust_min / path_glob via GLOB / tags_any via document_tags subquery). - JobRepo: jobs.kind/status are stored as lowercase enum tags; create mints a 32-hex JobId via blake3(kind \|\| payload \|\| nanos). Tests follow in subsequent commits.	2026-04-30 17:08:36 +00:00
altair823	8142449eb7	p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:42 +00:00
altair823	c0096ce44b	p1-4: scaffold kb-normalize crate Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a dev-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical entry point per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:53 +00:00
altair823	a86b463fc4	p1-2: scaffold kb-parse-md crate Add the workspace member with the dep allow-list pinned by design §0 Q9 and the task spec. P1-2 will land the frontmatter submodule in the next commit; P1-3 will add the block parser as a sibling. Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024 so we use serde_yaml_ng, the maintained fork. lingua's per-language features are explicitly enabled (default-features=false) to keep build time + binary size sane — only the languages we need at parse time.	2026-04-30 12:55:20 +00:00
altair823	7c75e10b2c	p1-1: scaffold kb-source-fs crate (FsSourceConnector) Walk config.workspace.root, apply gitignore-style filters (config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for .DS_Store / ._*), stream BLAKE3 over each file, and emit a deterministic Vec<RawAsset> sorted by workspace_path. Modules: - hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file loads); pinned digests for empty input and "hello world". - media: extension → MediaType (markdown/pdf/image/audio/other). - walker: ignore::OverrideBuilder for filter union; walkdir with manual visited-set cycle protection on top of follow_links. - connector: public FsSourceConnector::new(&Config) + SourceConnector::scan(&SourceScope) impl. Uses kb_core::to_posix for WorkspacePath construction (carries P0-1 # rejection through unchanged) and kb_core::id_for_asset for AssetId derivation. Storage variant signals intent only; actual byte copy is P1-6's responsibility. Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:27:34 +00:00
altair823	f86df99fe9	p0-1: workspace + kb-core domain types, traits, and ID recipe Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core crate per the frozen design (§3, §4, §7, §10). kb-core has zero deps on other kb-* crates and exposes: - Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId / IndexId) with Display + FromStr that reject anything but 32 lower-hex. - id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2; pinned hex test values computed via an independent JCS+blake3 tool. - CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4). - Citation (5 variants) with W3C Media Fragments to_uri / parse; round-trip property holds for every variant. - Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail (§3.7); DocFilter / DocSummary mirrors of wire §2.5. - Answer / AnswerCitation / RefusalReason / ModelRef (§3.8). - IngestReport, JobRepo support types, VectorRecord / VectorHit. - Component traits (SourceConnector / Extractor / Chunker / Embedder / Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo) plus their input helpers (SourceScope / ExtractContext / ChunkPolicy / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason). - CoreError (§10). - nfc + to_posix helpers (§4.1, §6.6). 20 unit tests cover ID determinism (1000-run regression), key-order invariance, two pinned hex values, newtype rejection of bad input, Citation round-trip for all 5 variants, and to_posix collapsing + Korean NFC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 05:16:37 +00:00

14 Commits