f9714aa5cbf06ce2166d9301b646f706d5deea3e
23 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 911fb49550 |
refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules
프로젝트 이름 `kb` → `kebab` rename 의 첫 단계. - workspace `Cargo.toml`: members `crates/kb-*` → `crates/kebab-*`, repository URL `altair823/kb` → `altair823/kebab`. - 18 crate 폴더 rename via `git mv` (history 보존). - 각 crate `Cargo.toml`: `name = "kb-*"` → `"kebab-*"`, path deps `../kb-*` → `../kebab-*`. - 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`, `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`, `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`, `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`, `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어 경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염). CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths, tracing target, 그리고 docs sweep 은 다음 commit 에서. ## 검증 - `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| d9a5b88d27 |
feat(p5-2): kb-eval metrics + compare — AggregateMetrics, CompareReport, kb eval CLI
P5-2 구현. 저장된 eval_runs / eval_query_results 위에서:
- `kb_eval::metrics`: hit@k / MRR / recall@k_doc / citation_coverage /
groundedness / empty_result_rate / refusal_correctness 계산. NaN
metrics (분모 0)는 JSON null. 4-decimal round + Deserialize 추가로
aggregate_json 라운드트립.
- `kb_eval::compare`: 두 run 비교 → CompareReport (per-metric Δ + per-
query Win/Loss/Draw/Regression). chunker_version drift 시 graceful
doc-id fallback (chunker_version_match: "fallback_doc"), `strict`
옵션이면 refuse.
- `render_report_md`: 인간용 Markdown (집계 + Wins/Losses/Regressions
표).
- `SqliteStore::{load_eval_run, load_eval_query_results,
update_eval_run_aggregate}` + owned `EvalRunRecord` /
`EvalQueryResultRecord` 추가 — write 측 borrow-shape는 그대로.
- `kb eval` CLI: `run` (P5-1 위임), `aggregate <id>`, `compare <a> <b>
[--strict-chunker-version] [--write-report]`. `--json` 으로 raw
CompareReport, 기본은 Markdown 출력.
## Spec deviations (intentional, doc 명시)
- Graceful 매칭은 doc-id-only (chunker_version_match: "fallback_doc")
— 50% span overlap은 chunker re-index 후 양쪽 chunks 동시 보존이
현실적으로 안 돼서 P6+ 로 deferred.
- `*_with_config` 헬퍼 추가: 통합 테스트가 TempDir Config 로 드라이브.
no-arg 형태는 Config::load(None) 로 위임.
- CLI 는 kb-cli → kb-eval 직접 wire (kb-app cycle 회피). DoD 의 "via
kb-app" 의도는 facade 단일화였지만 cycle 발생.
- `AggregateMetrics: Deserialize` 추가 — aggregate_json 라운드트립.
## 검증
- `cargo test -p kb-eval` 30/30 (15 unit + 2 loader + 8 metrics+compare
통합 + 7 runner). 8 통합 중 snapshot 1 건 (`compare-1.json`).
- `cargo test -p kb-store-sqlite` 33/33.
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- forbidden imports 부재 (`kb-source-fs|kb-parse|kb-normalize|kb-chunk|
kb-store-vector|kb-embed|kb-search|kb-llm|kb-rag|kb-tui|kb-desktop|
kb-app` — kb-app 는 metrics/compare 모듈에 부재; runner 만 사용).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 58a11cc2b8 |
feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence
- new kb-eval crate: load_golden_set (YAML) + run_eval (per-query search/ask + persistence) - new kb-store-sqlite::eval module: record_eval_run_with_results (transactional), document_exists / chunk_exists probes - fixtures/golden_queries.yaml: 5-entry KO+EN template - tests: 13 pass (loader: parse, dup-id, missing chunk_id; runner: elapsed, snapshot, error capture, JSONL, determinism, persistence, config_snapshot) - per_query.jsonl mirror written to runs_dir/<run_id>/ - temperature=0 + fixed seed → byte-identical per_query.jsonl (lexical) deviations from spec (documented in code): - run_id uses uuid::Uuid::now_v7().simple() (timestamp-ordered hex) instead of ULID — uuid already in workspace deps - load_golden_set_validated kept #[cfg(test)] pub(crate) — production inlines validate_against_db - snapshot fixture uses normalized projection (id/query/mode/first_hit) — full byte-determinism covered by separate test - index_version in config_snapshot left null (composed per call by kb-app, not config-level) deferred to follow-up: - App reuse across queries (currently rebuilds App per query) - expand_path hoist to kb-config (3 crate clones now) - --max-queries flag (deferred to P5-2 per updated spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| e35b06d0d0 |
feat(p4-3): kb-rag crate — full RAG pipeline + kb-app::ask wired
P4 terminal task. Implements the user-facing payoff: retrieve →
score gate → pack → render → generate → cite-validate → persist.
After this commit, `kb ask` actually works against an Ollama
backend; the pipeline grounds the answer in retrieved chunks and
refuses cleanly when the gate trips or the model self-judges.
New crate kb-rag:
- pub struct RagPipeline { retriever, llm, docs, config } — all
Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync.
- pub fn ask(query, opts) -> Result<Answer> drives the nine-stage
flow per spec §1.
- pub struct AskOpts { k, explain, mode, temperature, seed,
stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor
over config.search.default_k so a low-k caller can't starve
retrieval (documented in field doc).
Pipeline stages:
1. Retrieve via the injected dyn Retriever.
2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 <
config.rag.score_gate → ScoreGate refusal (no LLM call) with
top-3 candidates listed in the synthesized answer text.
3. Pack: budget = config.rag.max_context_tokens.saturating_sub
(prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>`
with deterministic enumeration. If every hit's chunk is
unfetchable from the store (deleted between search and pack),
fall back to NoChunks refusal with a tracing::warn rather than
feeding an empty [근거] to the LLM.
4. Render rag-v1 prompt with the spec's verbatim Korean system
string + `[질문]/[근거]` user template.
5. Generate via dyn LanguageModel. Single-thread token loop owns
the iterator; tokens optionally forward to opts.stream_sink (a
`mpsc::Sender<String>`). SendError silently dropped — caller
cancellation never panics the pipeline. After Done the loop
reads (acc, finish_reason, usage) in lockstep with no race.
max_completion = llm.context_tokens().saturating_sub
(used_for_input).max(64) — explicitly NOT capped by
config.rag.max_context_tokens (that's the packing budget for
[근거], not the LM completion ceiling).
6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled
once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`,
`[#1234]`, `vec![1]` are all rejected to prevent prose
false-positives.
7. Citation validate covers four cases:
- unknown marker (e.g. `[#7]` when only 3 packed) →
LlmSelfJudge refusal.
- empty answer with hits → LlmSelfJudge.
- non-empty + no marker + matches `근거 (가|이) 부족` regex →
LlmSelfJudge (model self-refused with the canonical phrase;
phrase match logged via tracing::debug for observability).
- non-empty + no marker + no refusal phrase → LlmSelfJudge
(silent ungrounded answers are still refusals).
- non-empty + ≥1 valid marker → grounded = true.
8. Build Answer per kb_core::Answer shape:
- citations: filter packed list to exactly the markers cited.
Wire format `marker: Some("[1]")` (square-bracketed bare
index) per design §2.3, distinct from the prompt-side
`[#n]` grammar.
- embedding ModelRef: read from config.models.embedding for
Vector/Hybrid; None for Lexical. Documented deviation since
the Retriever trait doesn't expose the embedder. For
ScoreGate/NoChunks refusals on Vector/Hybrid the embedding
model is still recorded — the vector retriever WAS consulted
even when the gate tripped.
- TraceId minted as `ret_<8-hex>` from blake3(query, top_score,
model_id, ns).
- retrieval AnswerRetrievalSummary populated.
- usage from the final Done chunk; latency_ms wall-clock
fallback when the LLM reports zero.
- created_at OffsetDateTime::now_utc().
9. Persist via SqliteStore::put_answer (new inherent method on
SqliteStore, not on the DocumentStore trait — answers aren't
documents and adding to kb-core was forbidden). Always inserts,
refusals included. packed_chunks_json is null unless
opts.explain == true.
kb-store-sqlite extension:
- pub fn put_answer(&Answer, query, packed_chunks_json) ->
Result<AnswerId>. Maps all 22 fields of the answers table per
V001 schema in a single INSERT under a transaction.
kb-app::ask wired:
- bail!("not yet wired (P4-3)") replaced with a real body that
builds the retriever per opts.mode (Lexical | Vector | Hybrid),
instantiates OllamaLanguageModel from config, constructs
RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is
re-exported via `pub use kb_rag::AskOpts` so kb-cli's
`use kb_app::AskOpts` keeps working.
- kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's
forbids on these are lifted by P4-3 spec — kb-app is the
orchestrator and ask requires both the trait crate and the
Ollama adapter.
- kb-cli/main.rs's AskOpts literal updated with stream_sink: None
for the CLI path (TUI in P9 will plumb a real sink).
Tests (kb-rag: 18; kb-app: 1 ignored):
- 3 unit in src/pipeline.rs: marker regex strictness (rejects all
loose forms with byte-equal expectations), Send+Sync compile
check, embedding_ref_for behavior across modes.
- 15 integration in tests/pipeline.rs covering every spec test row
+ the new "all chunks unfetchable falls back to NoChunks" guard:
empty-hits, score-gate, grounded happy path, unknown-marker,
prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase,
packing-budget overflow, streaming-forwards-to-mpsc, dropped-
receiver-no-panic, usage-from-final-Done, answers-row-inserted-
for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON
shape, unfetchable-chunks-fall-back-to-no-chunks (the new
M3 test).
- kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that
drives the wired ask end-to-end against `localhost:11434`.
Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy
--workspace --all-targets -- -D warnings clean.
Allowed deps respected (kb-core, kb-config, kb-search, kb-llm,
kb-store-sqlite, serde, serde_json, regex, time, tracing,
thiserror) plus forced waivers anyhow (Retriever / LanguageModel
trait return types) and blake3 (TraceId minting). Forbidden
(kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-
vector direct, kb-embed* direct, kb-llm-local direct, kb-tui,
kb-desktop) all absent from `cargo tree -p kb-rag` — concrete
adapters reach the pipeline only through trait objects.
Out of scope: reranker between retrieve and pack (P+), multi-turn
chat memory (P+), LLM-as-judge eval (P5 uses rule-based
must_contain), --json streaming (buffers per §0 Q5 hybrid).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 3e38a9bcb4 |
feat(p4-2): kb-llm-local crate — Ollama HTTP adapter (reqwest::blocking)
First real LanguageModel implementation. Wraps Ollama's local HTTP
API at POST {endpoint}/api/generate with stream:true, parses the
NDJSON streaming response into TokenChunk events, and maps Ollama
error states to a thiserror-derived LlmError with actionable hints.
Synchronous trait surface; reqwest::blocking handles the HTTP I/O.
Public surface:
- pub struct OllamaLanguageModel
- pub fn new(config: &Config) -> Result<Self> — lazy connect; never
hits the network. Spec line 96.
- pub enum LlmError { Unreachable, ModelNotPulled, Timeout, Stream,
Malformed }. Lives in this crate per spec — kb-core / kb-llm stay
free of error taxonomy.
- impl kb_core::LanguageModel via re-export from kb-llm.
Streaming:
- POST body shape per spec §11.2: model, prompt = system + "\n\n" +
user, stream: true, options { temperature, seed, num_ctx, stop }.
- OllamaStream owns BufReader<reqwest::blocking::Response>, reads
NDJSON lines via read_until(b'\n'), parses each as
{response, done, done_reason?, prompt_eval_count?, eval_count?,
total_duration?}. Token frame → TokenChunk::Token; done frame →
TokenChunk::Done { finish_reason, usage }.
- done_reason mapping: "length" → Length, "abort" → Aborted,
"stop" / missing / unknown → Stop (forward-compat with future
Ollama tags).
- Missing prompt_eval_count / eval_count default to 0 + tracing::warn
(do NOT fail). Spec line 135.
- EOF without a done line synthesizes Done { Aborted, zeros } so
downstream pipelines never deadlock waiting for a terminal frame.
- UTF-8: line-delimited framing means each JSON line is a complete
UTF-8 sequence — no cross-HTTP-chunk codepoint splits to worry
about. read_until accumulates whole lines regardless of how the
underlying reqwest body chunks.
Error mapping (LlmError):
- reqwest::Error::is_connect() → Unreachable { endpoint, source }
with hint "ensure `ollama serve` is running and reachable at
<endpoint>".
- reqwest::Error::is_timeout() → Timeout.
- 200 with non-NDJSON first line (e.g., transparent-proxy HTML
error page) → Stream(truncated body) — distinguished from
Malformed by the iterator's has_emitted flag.
- 404 with body containing model_id (case-insensitive) OR English
"model" + "not found" → ModelNotPulled(model_id) with hint
"ollama pull <model_id>". Tightened beyond spec to survive
Ollama localizing the error message (Korean / Japanese / etc.)
while keeping the original English-substring fallback.
- Other 4xx/5xx → Stream(truncated body).
- Mid-stream JSON parse failure (after at least one valid line) →
Malformed(line). Truncate all error bodies to 512 chars
(chars-based, multibyte safe) so an nginx 500 page can't blow up
the diagnostic.
- Trailing slash in endpoint stripped before formatting the URL —
endpoint = "http://x:1234/" produces .../api/generate, not
.../api//generate. Pinned by trailing-slash test.
Tokio note: reqwest 0.12's blocking feature internally wraps a
private current-thread tokio runtime, so cargo tree --edges normal
shows tokio. The auditable invariant is "no top-level tokio dep +
no async surface exposed to callers" — verified: src/ has zero
async/await/tokio::*. default-features = false drops default-tls
(rustls only) but does NOT drop tokio. Documented honestly in
Cargo.toml + lib.rs. Switching to ureq would remove tokio
entirely; deferred since reqwest is the spec's allowed dep.
Tests (24 total: 23 default + 1 ignored):
- 7 unit in src/ollama.rs: prompt-build, options-build, finish-
reason mapping, truncate_body bounds (under_cap / over_cap_marker
/ multibyte_chars_not_bytes), 404+model-id heuristic.
- 3 in tests/construction.rs: ModelRef shape, context_tokens
passthrough, lazy-connect proven via port-1 pointing.
- 13 in tests/streaming.rs: streamed tokens then Done, multibyte
chars within a line round-trip (renamed from "split across
chunks" to honestly reflect what's tested), Unreachable-with-
hint, 4xx→Stream, 404→ModelNotPulled, concat-equals-canned,
done_reason length / abort, missing eval counts default to zero,
missing done_reason defaults to Stop, determinism-by-mock,
trailing-slash endpoint, non-NDJSON 200 body → Stream not
Malformed.
- 1 #[ignore] in tests/integration.rs: real Ollama on
localhost:11434 with the configured model. Opt-in via
cargo test -p kb-llm-local -- --ignored after `ollama serve`
+ `ollama pull`.
Workspace: 288 passed / 25 ignored / 0 failed. cargo clippy
--workspace --all-targets -- -D warnings clean. No native-tls,
no openssl in the dep graph.
Allowed deps respected: kb-core, kb-config, kb-llm, reqwest 0.12
(default-features=false; blocking, json, rustls-tls), serde,
serde_json, tracing, thiserror plus anyhow (forced by trait return
type). wiremock + tokio in [dev-dependencies] only.
Out of scope: llama.cpp / candle adapters (P+), Ollama embed
endpoint (separate adapter inside kb-embed-local if requested),
cancellation / abort tokens (P+), connection-pool tuning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 27c669fbf9 |
feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel
Establishes the kb-llm trait crate so concrete LLM adapters (p4-2
Ollama, future llama.cpp / candle) target a stable surface. Pure re-
export of kb_core::{LanguageModel, GenerateRequest, TokenChunk,
FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic
mock for downstream RAG tests (p4-3) that need an LLM trait object
without an Ollama dependency.
MockLanguageModel (cfg(feature = "mock"), default OFF):
- Holds canned_response + canned_finish + canned_usage + (model_id,
provider, context_tokens). Pure in-memory; no I/O.
- generate_stream() honors GenerateRequest.stop: scans every non-empty
stop string against the canned response, takes the earliest byte
position (Iterator::min returns the first equal element on ties so
declaration order in req.stop wins), truncates with a direct byte-
slice (str::find returns a UTF-8 char boundary by contract).
- When a stop matches, finish_reason is overridden to Stop (matches
OpenAI / Ollama real-world behaviour); otherwise the caller's
canned_finish passes through verbatim.
- Emits one TokenChunk::Token per Unicode scalar value (char), NOT per
grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining
marks split. Acceptable for trait-shape testing; real adapters MAY
combine. Documented in module docs.
- Always terminates with TokenChunk::Done { finish_reason, usage } even
if the canned response is empty. The returned iterator is a boxed
Vec<TokenChunk>::into_iter().map(Ok), trivially Send.
- Real adapters MAY return Err from generate_stream itself (e.g.
connection refused) before any chunk is yielded; the mock never does.
Documented for the trait re-exporter consumer audience.
Helpers:
- assert_finish_chunk(chunks) — asserts the last chunk is a Done.
Useful for proptests asserting trait contract over random inputs.
Tests:
- cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests.
- cargo test -p kb-llm --features mock: 9 tests including 100-case
proptest over random Unicode strings asserting Done terminator,
char-count == streamed Token chunks, concat == canned (truncated by
stop), plus explicit cases for stop-string truncation, first-stop-
match precedence, model_ref dimensions=None invariant, finish reason
pass-through.
- All 271 workspace tests pass; clippy clean for both default and
mock-on feature configurations.
Symbol gating verified:
- cargo build --release -p kb-llm (default): nm shows zero
MockLanguageModel symbols.
- cargo build --release -p kb-llm --features mock: three trait-impl
symbols present. Spec invariant "release builds MUST NOT include
MockLanguageModel" enforced at the symbol level.
Allowed deps respected: only kb-core (path) and anyhow (workspace,
forced by trait return type). Dropped kb-config / serde / thiserror /
tracing from the spec's allowed list — they are listed as Allowed but
nothing in this skeleton crate references them, and dropping them
keeps the dependency graph slim for downstream consumers. p4-2/p4-3
will add what they need at their own dep sites.
Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs,
kb-parse-md, kb-normalize, kb-chunk, kb-store-*, kb-embed*, kb-search,
kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm.
Out of scope: real adapter (p4-2 Ollama), token counting against the
real tokenizer, server-side cancellation / abort signals (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 17d52461b2 |
feat(p3-5): wire kb-app facade — ingest / search / list / inspect
Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real
bodies that compose the libraries shipped through P3-4. After this
commit, `cargo run -p kb-cli -- index` actually walks the workspace
and persists chunks (SQLite + optionally LanceDB), and
`cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns
real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3
owns it.
App lifecycle (crates/kb-app/src/app.rs):
- Internal pub(crate) struct App holds the Config plus
Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind
OnceLock<Arc<...>> for memoization. First call pays the ~470MB
fastembed init / Lance open; subsequent calls return the cached
Arc::clone. OnceLock::set race losers fall back to get().cloned()
so the lazy-init is concurrent-safe.
- One-shot CLI invocations pay the cost once at most. The P9 TUI
(which holds an App for the session) gets memoization for free.
ingest pipeline (lib.rs):
- FsSourceConnector::scan(&scope) → per asset:
parse_frontmatter → parse_blocks → build_canonical_document →
MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document →
put_blocks → put_chunks. One transaction per document per design
§5.8 (kb-store-sqlite's put_* methods own the transactions).
- When provider != "none" and dimensions > 0: build embedder once,
embed each doc's chunks as Document kind, ensure_table once at the
top of the run, then upsert the VectorRecord batch. Lexical-only
config (provider == "none") skips both — verified by
ingest_provider_none_skips_lance test.
- Per-asset parse failures recorded as IngestItemKind::Error with
the warning attached; the run continues. Only structural failures
(DB unreachable etc.) abort.
- Aggregate counts (assets_scanned / new / updated / skipped /
errors / chunks_indexed / embeddings_indexed / duration_ms) flow
into both the JobRepo progress_json AND a dedicated ingest_runs
row written via SqliteStore::record_ingest_run (new
pub(crate) helper added to kb-store-sqlite — see below).
summary_only=true writes items_json=NULL but still populates the
count columns.
search dispatch:
- SearchMode::Lexical → LexicalRetriever directly.
- SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore.
- SearchMode::Hybrid → HybridRetriever composing the two.
- Vector / Hybrid with provider=none returns a clear error naming the
config key to flip ("models.embedding.provider").
list_docs / inspect_doc / inspect_chunk delegate straight to
DocumentStore trait methods. Returns Err with actionable message on
not-found.
Test seam: each public free function has a matching
#[doc(hidden)] pub fn *_with_config(cfg, ...) companion that
integration tests invoke directly (the public form internally calls
load_config()). pub(crate) would not reach across the integration-
tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the
function comment flags it as test-only.
kb-store-sqlite additions:
- pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore
for the kb-app aggregate-counts persistence path. Helper writes
the ingest_runs row directly with all aggregate columns; jobs
table still gets a JobRepo create/update_progress/finish trio in
parallel.
Tests (11 default, 2 #[ignore] AVX-gated):
- ingest_lexical: round-trip, idempotent, summary_only_drops_items,
provider_none_skips_lance (asserts no .lance dir on disk),
records_ingest_runs_row_with_aggregate_counts, tags_any filter,
inspect_doc_not_found, inspect_chunk_not_found.
- search_lexical: lexical hits with embedding_model=None,
empty_query_returns_empty, vector_mode_with_provider_none returns
clear error.
- search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector
mode embedding_model assertion (#[ignore], AVX). Both run on the
AVX VM in ~21s combined (first run pays the model download).
- TestEnv pins workspace.root + storage.{data_dir,model_dir} to a
TempDir so tests don't touch the user's $HOME/.local/share.
- Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has
three small markdown files with varied frontmatter (rust+cargo+
python tags) so the tags_any filter test exercises a non-trivial
predicate.
Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo
clippy --workspace --all-targets -- -D warnings clean. CLI smoke
verified manually: `cargo run -p kb-cli -- index` returns a real
IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns
hits with citations; `cargo run -p kb-cli -- list docs` lists the
indexed documents.
Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types,
kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector,
kb-embed, kb-embed-local plus existing tracing / anyhow / serde /
toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm*,
kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from
cargo tree -p kb-app.
Out of scope per spec: ask body (P4-3), --rebuild-fts wiring,
--resume checkpointing (P+), --watch (P+), TUI / desktop integration
(P9 consumes this facade).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| ccd49ef546 |
feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF)
Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever
wrapper around LanceVectorStore (P3-3) into a single Retriever that
dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and
vector candidates via Reciprocal Rank Fusion and populates the full
RetrievalDetail per SearchHit so kb search --explain can attribute
scores back to each side.
Public surface (kb-search crate):
- pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>,
Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction.
- pub struct HybridRetriever { lexical, vector, fusion, k }.
- pub enum FusionPolicy { Rrf { k_rrf: u32 } }.
VectorRetriever:
- Embeds query.text as EmbeddingKind::Query before delegating to
VectorStore::search(query_vec, query.k * 2, &query.filters). Over-
fetches by ×2 for filter losses; LanceVectorStore applies the
filters internally so they propagate naturally.
- Hydrates each VectorHit into a full SearchHit by joining on
chunk_id in a single IN-clause batch (no N+1): doc_path,
section_label, chunker_version, source_spans for citation, plus
embedding_model from embedder.model_id().
- Snippet trimmed to config.search.snippet_chars (vector mode lacks
FTS5 highlighting; chunk text prefix is the next-best signal).
- Citation built from the chunk's first source span via the shared
citation_helper module — extracted from lexical.rs so both
retrievers compute citations identically (Byte/empty fallback to
Line{1,1} preserved with tracing::warn).
- RetrievalDetail.method = Vector for standalone calls; both
fusion_score and vector_score set to the LanceVectorStore-shifted
cosine score; lexical_* None.
HybridRetriever:
- Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail.
- Hybrid mode runs both retrievers with k * 2 fanout, fuses with
RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC
with deterministic tiebreaker (lex_rank ASC then chunk_id ASC),
takes top query.k. Fusion math runs in f64 throughout; cast to
f32 only at the SearchHit boundary where bounded magnitude (≤
~0.033 at k_rrf=60) makes f32 precision sufficient for ranking.
- Per-hit lexical preferred for snippet/citation/heading_path/
chunker_version/embedding_model when the chunk appears in both
retrievers — FTS5 highlighting is more user-relevant than vector's
truncated text. Vector-only chunks fall through to vector hit data.
- index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at
construction; mismatched lex/vec versions trigger a tracing::warn
so users notice stale indexes (spec line 143).
kb-search additions:
- citation_helper.rs — pub(crate) citation_from_first_span shared
between lexical and vector retrievers. Extracted from lexical.rs;
no behavior drift.
Tests (38 default + 3 ignored):
- 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within
f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid
preserves single-side hits with the missing side's RetrievalDetail
None, deterministic tiebreaker on identical fused scores, composite
index_version, mismatched-version warn at construction.
- 2 unit tests in vector.rs covering the snippet-prefix and citation
fallback paths.
- 11 unit tests in lexical.rs (unchanged from P2-2).
- 13 lexical integration tests (unchanged).
- 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus
recall (lex returns A,B; vec returns C,D; hybrid returns all 4),
determinism over two queries, snapshot stability against
tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was
regenerated against this branch on an AVX-enabled VM and contains
4 real chunks (c1/c2 lex+vec, c3/c4 vec-only).
- KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of
silently passing — matches the P3-2/P3-3 fail-loud-instead-of-
silent-pass philosophy.
Allowed deps respected (kb-core, kb-config, kb-store-sqlite,
kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing
kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow).
kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder>
trait object; the concrete adapter is runtime-injected by kb-app.
Out of scope: reranker (P+), score calibration across modes (RRF is
rank-comparable so absolute calibration is P+), multimodal retrieval
(P6+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 3cd5117a7e |
feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status
First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending|committed|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| bcbe2b8531 |
feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small
First real Embedder implementation. Wraps fastembed-rs (ONNX runtime)
with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME}
template expansion so model files land under config.storage.model_dir/
fastembed/ without polluting kb-config's public API.
Public surface:
- pub struct FastembedEmbedder
- pub fn new(config: &Config) -> Result<Self>
- impl kb_core::Embedder (via kb-embed re-export)
Behavior:
- Default model multilingual-e5-small (384 dims). model_id and
model_version come from config.models.embedding.{model,version}.
- Pre-load dim check via TextEmbedding::get_model_info: dim mismatch
bails before paying the ~470MB ONNX init cost.
- e5 prefix applied BEFORE tokenization: "passage: " for
EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned
by prefix_input unit tests.
- Batches inputs into chunks of config.models.embedding.batch_size,
concatenates results in input order.
- L2 normalization is performed by fastembed 4.9's default transformer
pipeline (verified at fastembed/src/text_embedding/output.rs:43);
we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so
a future fastembed bump that drops this invariant fails loudly.
- Synchronous (no async runtime). Mutex serializes calls into the
underlying ONNX session — conservative; ORT Session is Send+Sync but
callers (kb-app indexer) batch sequentially anyway. Revisit if
profiling shows contention.
- First-run model download surfaces via tracing::info before/after
TextEmbedding::try_new — users no longer stare at a silent 30-60s
pause during the 470MB pull.
Tests:
- 11 default-lane tests covering: check_dim match/mismatch (no model
load), prefix_input Document/Query/empty, resolve_model
known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set
+ XDG_DATA_HOME unset (falls back to ~/.local/share with recursive
~ expansion). XDG tests serialize on a Mutex + RAII guard since
edition 2024 makes set_var/remove_var unsafe.
- 7 #[ignore] integration tests covering: full construction with
default config, dim-mismatch belt-and-braces, Document vs Query
cosine differential, L2 unit norm, byte-equal determinism, batch-64
performance under 5s, snapshot-hash stability over a 5-sentence
multilingual fixture.
- Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints
the captured hash and panics with paste-back instructions, so first
--ignored run forces the maintainer to pin the baseline rather than
silently passing.
- Workspace: 222 tests pass (default lane); clippy clean.
Allowed deps respected: kb-config, kb-embed (re-exports kb-core
trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and
ort enter transitively through fastembed; reqwest/hyper/hf-hub also
transitive (model download is fastembed's responsibility per spec
carve-out). No direct kb-core dep needed — re-exports cover it.
Pinned to fastembed 4.x rather than the recent 5.x to limit blast
radius; consider bump when p3-3 (lancedb-store) consumes the embedder
output shape.
Out of scope: reranker (P+), Ollama embedding endpoint, candle
adapter, image embeddings (P6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 2e3eb8f437 |
feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder
Establishes the kb-embed trait crate so concrete embedding adapters
(p3-2 fastembed, future ollama-embed/candle) target a stable surface.
Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic
mock for downstream tests.
MockEmbedder (cfg(feature = "mock"), default OFF):
- Per-component hash recipe: blake3(seed_le8 || kind_byte ||
text_len_le8 || text || i_le8). Length-prefixed text avoids the
domain-separation ambiguity where two (text, i) pairs could shift
bytes between text tail and the i field.
- Document = 0u8, Query = 1u8 — same text different kind yields
different vectors (mirrors e5 prefix behaviour).
- Per component: blake3 first 8 bytes → u64 → reinterpret as i64 →
f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32
rounds to -1.0; range [-1, 1] holds.
- L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision
loss) before f32 cast. Zero-norm guard skips the divide.
- with_seed(...) constructor lets two embedders share identity but
produce different vectors — useful for downstream parametric tests.
Helpers:
- assert_vector_shape(vecs, dims) — len + finite check.
- assert_unit_norm(vecs, tolerance) — caller-supplied tolerance;
5e-4 documented as safe for dims=384 under f32 epsilon × √dims.
Tests:
- cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests.
- cargo test -p kb-embed --features mock: 7 tests including 100-case
proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within
tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text),
Doc(text1) ≠ Doc(text2).
- All 220 workspace tests pass; clippy clean for both default and
mock-on feature configurations.
Symbol gating: nm on the release rlib confirms zero MockEmbedder
symbols under default features; three trait impl symbols under
--features mock. Spec invariant "release builds MUST NOT include
MockEmbedder" verified at the symbol level.
Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing,
plus anyhow (forced by trait return type) and blake3 (justified by
the determinism contract; already in workspace lockfile via kb-core).
No fastembed/ort/tokenizers anywhere.
Out of scope: real adapter (p3-2), reranker traits (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| b335151d18 |
feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25)
Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.
New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
token into a quoted literal (inner " doubled); single-quote-wrapped
text passes through verbatim as raw FTS5 syntax. Empty query
short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
enforces the char cap on the way out (chars per design §6.4 — accepts
the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
Region / Time forwarded; Byte / empty array fall back to Line{1,1}
with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
uses globset with literal_separator(true) — guarantees '*' does not
cross '/' per spec Risks/notes — applied as Rust post-filter with
+128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
tiebreaker on identical bm25). Tested explicitly with two chunks of
identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
set, vector_* None.
kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
Read-only contract is doc-only — kb-search needs MutexGuard for
prepare_cached + iter, which a closure-scoped wrapper would awkwardly
constrain. Closure variant left as a P3 follow-up.
Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.
Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).
Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 111f40ddf0 |
p1-6: kb-store-sqlite test suite (8 categories)
All 8 test categories from the task plan, plus a JobRepo subset:
migration — tests/migration.rs: fresh DB after run_migrations
exposes every required §5 table + index.
unit (copy) — tests/asset_writer.rs: copy mode writes file with
mode 0o644 + correct bytes.
unit (ref) — tests/asset_writer.rs: reference mode does not write
file; row records source path.
unit (cs) — tests/asset_writer.rs: tampered checksum returns a
Conflict-flavoured anyhow error.
unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
doc_version 1→2; tags re-derived.
unit (rb) — tests/idempotency.rs: put_blocks with FK violation
rolls back; pre-existing rows unchanged.
contract — tests/contract_roundtrip.rs: drives kb-parse-md +
kb-normalize + kb-chunk on
fixtures/markdown/code-and-table.md, persists, then
reloads via DocumentStore::get_document /
get_chunk and asserts byte-equal round-trip.
snapshot — tests/ingest_report_snapshot.rs +
snapshots/ingest_report.snapshot.json: pin the wire
JSON form of kb_core::IngestReport for an inline
fixture run.
jobs — tests/jobs.rs: create → progress → finish flow;
error message round-trip; list filters on status/kind.
Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
|
|||
| a3390d5171 |
p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL
New workspace member crate `kb-store-sqlite` (allowed deps only:
kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json,
time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md /
kb-normalize / kb-chunk for the contract round-trip test).
Migration V001 replaces the P0-1 stub with the full §5 DDL (assets,
documents, document_tags, blocks, chunks with policy_hash,
embedding_records, jobs, ingest_runs, answers, eval_runs,
eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers
remain deferred to V002 (P2-1).
Public surface per task spec:
SqliteStore::open / run_migrations / put_asset_with_bytes
impl DocumentStore for SqliteStore (7 trait methods)
impl JobRepo for SqliteStore (4 trait methods)
StoreError { Sqlx, Migration, Conflict }
Behavior:
- Pragmas at open: foreign_keys=ON, journal_mode=WAL,
synchronous=NORMAL, temp_store=MEMORY.
- Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to
data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else
reference. blake3(bytes) verified against asset.checksum; mismatch →
Conflict.
- Idempotency: put_document UPSERTs and bumps doc_version + 1 on
conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags
re-derived per put_document.
- get_document rehydrates blocks via payload_json ordered by stream
ordinal.
- list_documents builds dynamic WHERE from DocFilter (lang / trust_min
/ path_glob via GLOB / tags_any via document_tags subquery).
- JobRepo: jobs.kind/status are stored as lowercase enum tags; create
mints a 32-hex JobId via blake3(kind || payload || nanos).
Tests follow in subsequent commits.
|
|||
| 58f7b8573d |
p1-5: add long-section fixture + Vec<Chunk> snapshot test
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s, nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table, and a multi-paragraph Gamma section) into the JSON snapshot baseline. Confirms the priority rules end-to-end: - Heading boundaries hold across H1 → H2 → H1 transitions - The code block emits one chunk at 427 tokens > target=200 - The table stays single-chunk - Gamma's paragraph stream splits with one block of overlap seed A second test runs the full parse → normalize → chunk pipeline 5 times and asserts identical chunk_ids each pass. Drops the unused `kb-config` and `serde` from regular dependencies — they were declared but no source path imports them; `serde` flows in transitively via `kb-core` as a public API requirement, and `ChunkingCfg` lives in `kb-config` but the chunker takes `ChunkPolicy` directly. Production deps are now exactly the allowed set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer, tracing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 8142449eb7 |
p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton
Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| e0df42984e |
p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path)
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.
I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.
I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.
Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
blake3 (unused); keep tracing (now wired) + unicode-normalization
(now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
fixture_blocks_five helper) so the determinism property is exercised
on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
== events[2].at to pin the shared-now_utc invariant.
New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)
kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| c0096ce44b |
p1-4: scaffold kb-normalize crate
Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical *entry point* per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 4e7e9cad87 |
p1-3: add parse_blocks (pulldown-cmark walker) submodule
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a small frame-based state machine that tracks heading paths, accumulates inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and reports SourceSpan::Line spans in 1-indexed file-line coordinates. Covers headings, paragraphs, code blocks (lang from info string), GFM tables (with malformed fallback to paragraph + MalformedTable warning), lists (nested sub-lists flattened into parent item), and block-level image references. Inline images are dropped silently per the inline filter. Adversarial inputs are caught with `catch_unwind` and degrade to an empty output + ExtractFailed warning. 15 unit tests cover heading-path correctness, code lang, table parsing, malformed-table fallback (driven via synthetic events since pulldown-cmark auto-normalizes table widths), LF/CRLF line-range parity, image refs, nested-list flattening, inline filter, and 100-iteration random-bytes plus hand-crafted adversarial-input no-panic guards. |
|||
| a86b463fc4 |
p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9 and the task spec. P1-2 will land the frontmatter submodule in the next commit; P1-3 will add the block parser as a sibling. Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024 so we use serde_yaml_ng, the maintained fork. lingua's per-language features are explicitly enabled (default-features=false) to keep build time + binary size sane — only the languages we need at parse time. |
|||
| 7c75e10b2c |
p1-1: scaffold kb-source-fs crate (FsSourceConnector)
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.
Modules:
- hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file
loads); pinned digests for empty input and "hello world".
- media: extension → MediaType (markdown/pdf/image/audio/other).
- walker: ignore::OverrideBuilder for filter union; walkdir with
manual visited-set cycle protection on top of follow_links.
- connector: public FsSourceConnector::new(&Config) +
SourceConnector::scan(&SourceScope) impl. Uses
kb_core::to_posix for WorkspacePath construction (carries
P0-1 # rejection through unchanged) and kb_core::id_for_asset
for AssetId derivation. Storage variant signals intent only;
actual byte copy is P1-6's responsibility.
Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| d2c8728095 |
p0-1: address review (drop unused thiserror dep, document kb-core reserve)
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app (unused — none of those crates' src trees reference thiserror; CoreError in kb-core is the only consumer). - kb-config keeps the `kb-core` dep with a one-line comment marking CoreError reserved for P1-* config-error wiring per the review thread. - ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and satisfies `clippy::manual_is_ascii_check` under `-D warnings`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| f86df99fe9 |
p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:
- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
(§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
/ EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).
20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|