1125 Commits

Author SHA1 Message Date
33ec13bad7 Merge pull request 'feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence' (#27) from feat/p5-1-golden-fixture-runner into main
Reviewed-on: altair823-org/kb#27
2026-05-02 02:47:21 +00:00
e6ff9c412c fix(p5-1): apply deferred review items — App reuse + expand_path hoist + nits
- kb-app: promote App to pub, add open_with_config / search / ask methods
  so kb-eval (and future TUI) can amortize embedder + vector store + LLM
  cold-start across many queries on one App instance. Memoization is
  per-instance via OnceLock; *_with_config free functions delegate.
- kb-config: add canonical expand_path helper + 8 unit tests; drop the
  4 duplicate copies in kb-store-sqlite, kb-store-vector, kb-embed-local,
  kb-eval (net: -6 duplicate tests, +8 canonical tests).
- kb-eval: extract elapsed_ms_u32 helper, drop redundant tracing debug
  log (with_context already names path on error), replace dead-port :1
  test with bind-then-release ephemeral port.

Verified: cargo clippy --workspace --all-targets -D warnings clean,
all crate tests green (kb-app 12+3 ignored, kb-eval 11, kb-config 17,
kb-store-sqlite 33, kb-store-vector 7+8 AVX-gated, kb-embed-local 7+7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:55:23 +00:00
58a11cc2b8 feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence
- new kb-eval crate: load_golden_set (YAML) + run_eval (per-query search/ask + persistence)
- new kb-store-sqlite::eval module: record_eval_run_with_results (transactional), document_exists / chunk_exists probes
- fixtures/golden_queries.yaml: 5-entry KO+EN template
- tests: 13 pass (loader: parse, dup-id, missing chunk_id; runner: elapsed, snapshot, error capture, JSONL, determinism, persistence, config_snapshot)
- per_query.jsonl mirror written to runs_dir/<run_id>/
- temperature=0 + fixed seed → byte-identical per_query.jsonl (lexical)

deviations from spec (documented in code):
- run_id uses uuid::Uuid::now_v7().simple() (timestamp-ordered hex) instead of ULID — uuid already in workspace deps
- load_golden_set_validated kept #[cfg(test)] pub(crate) — production inlines validate_against_db
- snapshot fixture uses normalized projection (id/query/mode/first_hit) — full byte-determinism covered by separate test
- index_version in config_snapshot left null (composed per call by kb-app, not config-level)

deferred to follow-up:
- App reuse across queries (currently rebuilds App per query)
- expand_path hoist to kb-config (3 crate clones now)
- --max-queries flag (deferred to P5-2 per updated spec)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:01:09 +00:00
2c0607ae95 Merge pull request 'docs: mark P0–P4 done, add SMOKE recipe, refresh README' (#26) from docs/post-p4-doc-updates into main
Reviewed-on: altair823-org/kb#26
2026-05-01 16:37:42 +00:00
d1b99b2994 docs: mark P0–P4 done, add SMOKE recipe, refresh README
State drift after P0–P4 completion + 3 post-merge hotfixes (PR #20
--config across subcommands, PR #24 --config in kb ask, PR #25 RRF
fusion_score normalization). README still framed the project as
"spec frozen, code 0 lines"; phase docs and task specs all carried
status: planned. Sweep:

- README.md: top banner now "P0–P4 done (17/31 tasks) + 3 hotfixes
  applied"; command table marks each subcommand's owning phase and
  current status (kb ask =  via P4-3, kb eval =  P5);
  phase roadmap table grew a Status column (P0–P4 completed, P5
  next, P6–P9 pending); component count bumped 30 → 31 to reflect
  P3-5 (app-wiring, post-spec); core decisions table notes the
  RRF [0,1] normalization invariant; build+실행 section drops the
  "P0 미시작" caveat; new pointers to HOTFIXES.md and SMOKE.md.
- docs/SMOKE.md (new): ~80-line recipe for running the full
  pipeline against an isolated /tmp/kb-smoke/ workspace via
  --config, without polluting ~/.config/kb/ or
  ~/.local/share/kb/. Covers fixture seeding, sample config.toml
  with the post-merge defaults, doctor → ingest → list →
  search × 3 modes → inspect → ask sequence, verification
  checklist, and known-behaviour notes (fastembed model
  download, RAG response time, --config hard-fail on missing
  path).
- tasks/phase-{0..4}-*.md: status frontmatter flipped planned →
  completed.
- tasks/p0/, tasks/p1/, tasks/p2/, tasks/p3/, tasks/p4/: same
  status flip across all 17 component task specs (1+6+2+5+3).
  P5–P9 stay planned.

cargo test --workspace: 319 passed; clippy clean (no source
changes in this commit, just docs + frontmatter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:32:28 +00:00
e8c295910d Merge pull request 'fix(rag): RRF fusion_score를 [0,1]로 정규화 + post-merge hotfix 로그 추가' (#25) from fix/rrf-fusion-score-normalize-and-docs into main
Reviewed-on: altair823-org/kb#25
2026-05-01 16:19:05 +00:00
9dde01eb9f fix(rag): normalize RRF fusion_score to [0,1] + log post-merge hotfixes
## Bug

config.rag.score_gate default 0.05 was incompatible with hybrid RRF
fusion_score: raw RRF tops out at num_retrievers / (k_rrf + 1) ≈
0.0328 at the default k_rrf=60, so every hybrid `kb ask` tripped
ScoreGate refusal even when the top hit was perfectly aligned across
both retrievers. Symptomatic on the post-P4-3 manual smoke at
/tmp/kb-smoke/ pointed at 192.168.0.47 Ollama:

    $ kb ask "Rust ownership 모델의 핵심 규칙은 뭐야?" --mode hybrid
    근거 부족. KB에 해당 내용 없음.        # top fusion_score = 0.0164

Per-mode score_gate (lexical_score_gate / vector_score_gate /
hybrid_score_gate) was rejected because it forces every consumer
(CLI, eval, TUI) to know which mode picks which threshold. Score
normalization solves it at the source.

## Fix

crates/kb-search/src/hybrid.rs divides every fused score by
2 / (k_rrf + 1), the theoretical RRF maximum with two retrievers
each contributing rank 1. After normalization:

- both retrievers agree on rank 1 → fusion_score = 1.0
- only one retriever finds the chunk → caps near 0.5
- typical mixed ranks → falls between 0 and 0.5

RRF's rank-ordering invariants are preserved (every score divides
by the same positive constant), so sort + tiebreak behaviour is
unchanged. Wire schema label `fusion_score` keeps its slot in
RetrievalDetail; only the magnitude shifts, and only for hybrid
mode (lexical / vector were already in [0, 1]).

Verification: re-ran the four-scenario smoke at /tmp/kb-smoke/ with
default score_gate = 0.05 — all four (Korean→Korean, English→
English, cross-language Korean↔English, out-of-corpus) succeed
with the expected grounded / refusal classification, top
fusion_score now ≈ 0.5.

## Tests

One unit test (rrf_formula_matches_known_value) updated to expect
the normalized value `(1/61 + 1/62) / (2/61) ≈ 0.9919` instead of
the raw `1/61 + 1/62 ≈ 0.0325`. The integration snapshot fixture
crates/kb-search/tests/fixtures/search/hybrid/run-1.json already
used presence checks (fusion_score_positive: true) rather than
absolute values, so it doesn't need regeneration. Workspace 319
tests pass; clippy clean across both feature configs.

## Docs

This commit also adds tasks/HOTFIXES.md as a dated post-merge log
covering this fix and the two earlier --config-flag regressions
(P3-5 hotfix #20 across ingest/search/list/inspect/doctor; P4-3
follow-up #24 for kb ask). Original task specs in tasks/p<N>/
*.md stay frozen as the historical contract; HOTFIXES.md is the
live source of truth for post-merge deltas. Each affected task
spec gets a "Risks/notes" addendum pointing back to HOTFIXES.md
so a reader landing on the spec sees the active behaviour:

- tasks/INDEX.md gains a "Post-merge 핫픽스" section.
- tasks/phase-3-vector-hybrid.md updates the RRF formula text to
  show the normalized form.
- tasks/p3/p3-4-hybrid-fusion.md "Behavior contract" RRF bullet
  notes the normalization and reason.
- tasks/p3/p3-5-app-wiring.md "Risks/notes" notes the --config
  fix.
- tasks/p4/p4-3-rag-pipeline.md "Risks/notes" notes the kb-ask
  --config fix and the score_gate-RRF incompatibility (closed by
  the normalization in p3-4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:16:01 +00:00
d7f3d38d48 Merge pull request 'fix(cli): honor --config flag in kb ask (P4-3 follow-up)' (#24) from fix/cli-ask-honor-config-flag into main
Reviewed-on: altair823-org/kb#24
2026-05-01 16:08:24 +00:00
ed8bf87c65 fix(cli): honor --config flag in kb ask (P4-3 follow-up)
The earlier P3-5 hotfix wired --config through ingest / search / list /
inspect / doctor by switching kb-cli to call the *_with_config
companions. P4-3 added the ask body but kb-cli's Cmd::Ask arm still
called bare kb_app::ask(query, opts) — same bug as before, ask
silently fell back to ~/.config/kb/config.toml regardless of what the
user passed.

Caught during the post-P4-3 smoke against /tmp/kb-smoke/ pointed at
192.168.0.47 Ollama with gemma4:26b: the answer's wire JSON reported
`model.id = "qwen2.5:14b-instruct"` (the user's XDG default) instead
of `gemma4:26b` from the explicit --config, plus the score_gate /
data_dir / model fields all reflected XDG defaults. After this fix
the same invocation correctly returns model.id=gemma4:26b,
embedding=multilingual-e5-small (from the smoke config), grounded=true
with `[#2]` citation pointing at rust/ownership.md.

Same minimal pattern as the P3-5 hotfix:
- Build the Config once via Config::load(cli.config.as_deref()).
- Call kb_app::ask_with_config(cfg, query, opts) instead of
  kb_app::ask(query, opts).

Workspace 319 tests pass; cargo clippy --workspace --all-targets --
-D warnings clean.

Smoke verified across four scenarios:
- Korean→Korean-body lookup: grounded with rust/ownership.md citation.
- English→Korean-body cross-language: grounded with arch/rag-
  architecture.md citation.
- Korean→English-body cross-language: grounded with arch/embeddings.md
  citation.
- Out-of-corpus query: LlmSelfJudge refusal with "근거가 부족하다."

Out of scope (filed for follow-up):
- config.rag.score_gate default 0.05 is incompatible with hybrid
  RRF scores. RRF top score is bounded by 2/k_rrf (≈0.033 at k_rrf
  =60), so the spec default trips ScoreGate on every hybrid query.
  Workaround: lower the gate to 0.005 in the user's config; long-
  term fix needs either per-mode gate config or RRF score
  normalization to [0,1]. Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:56:57 +00:00
6210c52efd Merge pull request 'feat(p4-3): rag-pipeline — kb-rag 크레이트 + kb-app::ask 와이어링' (#23) from feat/p4-3-rag-pipeline into main
Reviewed-on: altair823-org/kb#23
2026-05-01 15:17:08 +00:00
e35b06d0d0 feat(p4-3): kb-rag crate — full RAG pipeline + kb-app::ask wired
P4 terminal task. Implements the user-facing payoff: retrieve →
score gate → pack → render → generate → cite-validate → persist.
After this commit, `kb ask` actually works against an Ollama
backend; the pipeline grounds the answer in retrieved chunks and
refuses cleanly when the gate trips or the model self-judges.

New crate kb-rag:
- pub struct RagPipeline { retriever, llm, docs, config } — all
  Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync.
- pub fn ask(query, opts) -> Result<Answer> drives the nine-stage
  flow per spec §1.
- pub struct AskOpts { k, explain, mode, temperature, seed,
  stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor
  over config.search.default_k so a low-k caller can't starve
  retrieval (documented in field doc).

Pipeline stages:
1. Retrieve via the injected dyn Retriever.
2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 <
   config.rag.score_gate → ScoreGate refusal (no LLM call) with
   top-3 candidates listed in the synthesized answer text.
3. Pack: budget = config.rag.max_context_tokens.saturating_sub
   (prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>`
   with deterministic enumeration. If every hit's chunk is
   unfetchable from the store (deleted between search and pack),
   fall back to NoChunks refusal with a tracing::warn rather than
   feeding an empty [근거] to the LLM.
4. Render rag-v1 prompt with the spec's verbatim Korean system
   string + `[질문]/[근거]` user template.
5. Generate via dyn LanguageModel. Single-thread token loop owns
   the iterator; tokens optionally forward to opts.stream_sink (a
   `mpsc::Sender<String>`). SendError silently dropped — caller
   cancellation never panics the pipeline. After Done the loop
   reads (acc, finish_reason, usage) in lockstep with no race.
   max_completion = llm.context_tokens().saturating_sub
   (used_for_input).max(64) — explicitly NOT capped by
   config.rag.max_context_tokens (that's the packing budget for
   [근거], not the LM completion ceiling).
6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled
   once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`,
   `[#1234]`, `vec![1]` are all rejected to prevent prose
   false-positives.
7. Citation validate covers four cases:
   - unknown marker (e.g. `[#7]` when only 3 packed) →
     LlmSelfJudge refusal.
   - empty answer with hits → LlmSelfJudge.
   - non-empty + no marker + matches `근거 (가|이) 부족` regex →
     LlmSelfJudge (model self-refused with the canonical phrase;
     phrase match logged via tracing::debug for observability).
   - non-empty + no marker + no refusal phrase → LlmSelfJudge
     (silent ungrounded answers are still refusals).
   - non-empty + ≥1 valid marker → grounded = true.
8. Build Answer per kb_core::Answer shape:
   - citations: filter packed list to exactly the markers cited.
     Wire format `marker: Some("[1]")` (square-bracketed bare
     index) per design §2.3, distinct from the prompt-side
     `[#n]` grammar.
   - embedding ModelRef: read from config.models.embedding for
     Vector/Hybrid; None for Lexical. Documented deviation since
     the Retriever trait doesn't expose the embedder. For
     ScoreGate/NoChunks refusals on Vector/Hybrid the embedding
     model is still recorded — the vector retriever WAS consulted
     even when the gate tripped.
   - TraceId minted as `ret_<8-hex>` from blake3(query, top_score,
     model_id, ns).
   - retrieval AnswerRetrievalSummary populated.
   - usage from the final Done chunk; latency_ms wall-clock
     fallback when the LLM reports zero.
   - created_at OffsetDateTime::now_utc().
9. Persist via SqliteStore::put_answer (new inherent method on
   SqliteStore, not on the DocumentStore trait — answers aren't
   documents and adding to kb-core was forbidden). Always inserts,
   refusals included. packed_chunks_json is null unless
   opts.explain == true.

kb-store-sqlite extension:
- pub fn put_answer(&Answer, query, packed_chunks_json) ->
  Result<AnswerId>. Maps all 22 fields of the answers table per
  V001 schema in a single INSERT under a transaction.

kb-app::ask wired:
- bail!("not yet wired (P4-3)") replaced with a real body that
  builds the retriever per opts.mode (Lexical | Vector | Hybrid),
  instantiates OllamaLanguageModel from config, constructs
  RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is
  re-exported via `pub use kb_rag::AskOpts` so kb-cli's
  `use kb_app::AskOpts` keeps working.
- kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's
  forbids on these are lifted by P4-3 spec — kb-app is the
  orchestrator and ask requires both the trait crate and the
  Ollama adapter.
- kb-cli/main.rs's AskOpts literal updated with stream_sink: None
  for the CLI path (TUI in P9 will plumb a real sink).

Tests (kb-rag: 18; kb-app: 1 ignored):
- 3 unit in src/pipeline.rs: marker regex strictness (rejects all
  loose forms with byte-equal expectations), Send+Sync compile
  check, embedding_ref_for behavior across modes.
- 15 integration in tests/pipeline.rs covering every spec test row
  + the new "all chunks unfetchable falls back to NoChunks" guard:
  empty-hits, score-gate, grounded happy path, unknown-marker,
  prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase,
  packing-budget overflow, streaming-forwards-to-mpsc, dropped-
  receiver-no-panic, usage-from-final-Done, answers-row-inserted-
  for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON
  shape, unfetchable-chunks-fall-back-to-no-chunks (the new
  M3 test).
- kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that
  drives the wired ask end-to-end against `localhost:11434`.

Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy
--workspace --all-targets -- -D warnings clean.

Allowed deps respected (kb-core, kb-config, kb-search, kb-llm,
kb-store-sqlite, serde, serde_json, regex, time, tracing,
thiserror) plus forced waivers anyhow (Retriever / LanguageModel
trait return types) and blake3 (TraceId minting). Forbidden
(kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-
vector direct, kb-embed* direct, kb-llm-local direct, kb-tui,
kb-desktop) all absent from `cargo tree -p kb-rag` — concrete
adapters reach the pipeline only through trait objects.

Out of scope: reranker between retrieve and pack (P+), multi-turn
chat memory (P+), LLM-as-judge eval (P5 uses rule-based
must_contain), --json streaming (buffers per §0 Q5 hybrid).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:06:10 +00:00
5db4aa6e1e Merge pull request 'feat(p4-2): ollama-adapter — kb-llm-local 크레이트 (reqwest::blocking)' (#22) from feat/p4-2-ollama-adapter into main
Reviewed-on: altair823-org/kb#22
2026-05-01 14:32:33 +00:00
3e38a9bcb4 feat(p4-2): kb-llm-local crate — Ollama HTTP adapter (reqwest::blocking)
First real LanguageModel implementation. Wraps Ollama's local HTTP
API at POST {endpoint}/api/generate with stream:true, parses the
NDJSON streaming response into TokenChunk events, and maps Ollama
error states to a thiserror-derived LlmError with actionable hints.
Synchronous trait surface; reqwest::blocking handles the HTTP I/O.

Public surface:
- pub struct OllamaLanguageModel
- pub fn new(config: &Config) -> Result<Self> — lazy connect; never
  hits the network. Spec line 96.
- pub enum LlmError { Unreachable, ModelNotPulled, Timeout, Stream,
  Malformed }. Lives in this crate per spec — kb-core / kb-llm stay
  free of error taxonomy.
- impl kb_core::LanguageModel via re-export from kb-llm.

Streaming:
- POST body shape per spec §11.2: model, prompt = system + "\n\n" +
  user, stream: true, options { temperature, seed, num_ctx, stop }.
- OllamaStream owns BufReader<reqwest::blocking::Response>, reads
  NDJSON lines via read_until(b'\n'), parses each as
  {response, done, done_reason?, prompt_eval_count?, eval_count?,
  total_duration?}. Token frame → TokenChunk::Token; done frame →
  TokenChunk::Done { finish_reason, usage }.
- done_reason mapping: "length" → Length, "abort" → Aborted,
  "stop" / missing / unknown → Stop (forward-compat with future
  Ollama tags).
- Missing prompt_eval_count / eval_count default to 0 + tracing::warn
  (do NOT fail). Spec line 135.
- EOF without a done line synthesizes Done { Aborted, zeros } so
  downstream pipelines never deadlock waiting for a terminal frame.
- UTF-8: line-delimited framing means each JSON line is a complete
  UTF-8 sequence — no cross-HTTP-chunk codepoint splits to worry
  about. read_until accumulates whole lines regardless of how the
  underlying reqwest body chunks.

Error mapping (LlmError):
- reqwest::Error::is_connect() → Unreachable { endpoint, source }
  with hint "ensure `ollama serve` is running and reachable at
  <endpoint>".
- reqwest::Error::is_timeout() → Timeout.
- 200 with non-NDJSON first line (e.g., transparent-proxy HTML
  error page) → Stream(truncated body) — distinguished from
  Malformed by the iterator's has_emitted flag.
- 404 with body containing model_id (case-insensitive) OR English
  "model" + "not found" → ModelNotPulled(model_id) with hint
  "ollama pull <model_id>". Tightened beyond spec to survive
  Ollama localizing the error message (Korean / Japanese / etc.)
  while keeping the original English-substring fallback.
- Other 4xx/5xx → Stream(truncated body).
- Mid-stream JSON parse failure (after at least one valid line) →
  Malformed(line). Truncate all error bodies to 512 chars
  (chars-based, multibyte safe) so an nginx 500 page can't blow up
  the diagnostic.
- Trailing slash in endpoint stripped before formatting the URL —
  endpoint = "http://x:1234/" produces .../api/generate, not
  .../api//generate. Pinned by trailing-slash test.

Tokio note: reqwest 0.12's blocking feature internally wraps a
private current-thread tokio runtime, so cargo tree --edges normal
shows tokio. The auditable invariant is "no top-level tokio dep +
no async surface exposed to callers" — verified: src/ has zero
async/await/tokio::*. default-features = false drops default-tls
(rustls only) but does NOT drop tokio. Documented honestly in
Cargo.toml + lib.rs. Switching to ureq would remove tokio
entirely; deferred since reqwest is the spec's allowed dep.

Tests (24 total: 23 default + 1 ignored):
- 7 unit in src/ollama.rs: prompt-build, options-build, finish-
  reason mapping, truncate_body bounds (under_cap / over_cap_marker
  / multibyte_chars_not_bytes), 404+model-id heuristic.
- 3 in tests/construction.rs: ModelRef shape, context_tokens
  passthrough, lazy-connect proven via port-1 pointing.
- 13 in tests/streaming.rs: streamed tokens then Done, multibyte
  chars within a line round-trip (renamed from "split across
  chunks" to honestly reflect what's tested), Unreachable-with-
  hint, 4xx→Stream, 404→ModelNotPulled, concat-equals-canned,
  done_reason length / abort, missing eval counts default to zero,
  missing done_reason defaults to Stop, determinism-by-mock,
  trailing-slash endpoint, non-NDJSON 200 body → Stream not
  Malformed.
- 1 #[ignore] in tests/integration.rs: real Ollama on
  localhost:11434 with the configured model. Opt-in via
  cargo test -p kb-llm-local -- --ignored after `ollama serve`
  + `ollama pull`.

Workspace: 288 passed / 25 ignored / 0 failed. cargo clippy
--workspace --all-targets -- -D warnings clean. No native-tls,
no openssl in the dep graph.

Allowed deps respected: kb-core, kb-config, kb-llm, reqwest 0.12
(default-features=false; blocking, json, rustls-tls), serde,
serde_json, tracing, thiserror plus anyhow (forced by trait return
type). wiremock + tokio in [dev-dependencies] only.

Out of scope: llama.cpp / candle adapters (P+), Ollama embed
endpoint (separate adapter inside kb-embed-local if requested),
cancellation / abort tokens (P+), connection-pool tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:28:34 +00:00
9ceabebf38 Merge pull request 'feat(p4-1): llm-trait — kb-llm 크레이트 + MockLanguageModel' (#21) from feat/p4-1-llm-trait into main
Reviewed-on: altair823-org/kb#21
2026-05-01 13:44:39 +00:00
27c669fbf9 feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel
Establishes the kb-llm trait crate so concrete LLM adapters (p4-2
Ollama, future llama.cpp / candle) target a stable surface. Pure re-
export of kb_core::{LanguageModel, GenerateRequest, TokenChunk,
FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic
mock for downstream RAG tests (p4-3) that need an LLM trait object
without an Ollama dependency.

MockLanguageModel (cfg(feature = "mock"), default OFF):
- Holds canned_response + canned_finish + canned_usage + (model_id,
  provider, context_tokens). Pure in-memory; no I/O.
- generate_stream() honors GenerateRequest.stop: scans every non-empty
  stop string against the canned response, takes the earliest byte
  position (Iterator::min returns the first equal element on ties so
  declaration order in req.stop wins), truncates with a direct byte-
  slice (str::find returns a UTF-8 char boundary by contract).
- When a stop matches, finish_reason is overridden to Stop (matches
  OpenAI / Ollama real-world behaviour); otherwise the caller's
  canned_finish passes through verbatim.
- Emits one TokenChunk::Token per Unicode scalar value (char), NOT per
  grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining
  marks split. Acceptable for trait-shape testing; real adapters MAY
  combine. Documented in module docs.
- Always terminates with TokenChunk::Done { finish_reason, usage } even
  if the canned response is empty. The returned iterator is a boxed
  Vec<TokenChunk>::into_iter().map(Ok), trivially Send.
- Real adapters MAY return Err from generate_stream itself (e.g.
  connection refused) before any chunk is yielded; the mock never does.
  Documented for the trait re-exporter consumer audience.

Helpers:
- assert_finish_chunk(chunks) — asserts the last chunk is a Done.
  Useful for proptests asserting trait contract over random inputs.

Tests:
- cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests.
- cargo test -p kb-llm --features mock: 9 tests including 100-case
  proptest over random Unicode strings asserting Done terminator,
  char-count == streamed Token chunks, concat == canned (truncated by
  stop), plus explicit cases for stop-string truncation, first-stop-
  match precedence, model_ref dimensions=None invariant, finish reason
  pass-through.
- All 271 workspace tests pass; clippy clean for both default and
  mock-on feature configurations.

Symbol gating verified:
- cargo build --release -p kb-llm (default): nm shows zero
  MockLanguageModel symbols.
- cargo build --release -p kb-llm --features mock: three trait-impl
  symbols present. Spec invariant "release builds MUST NOT include
  MockLanguageModel" enforced at the symbol level.

Allowed deps respected: only kb-core (path) and anyhow (workspace,
forced by trait return type). Dropped kb-config / serde / thiserror /
tracing from the spec's allowed list — they are listed as Allowed but
nothing in this skeleton crate references them, and dropping them
keeps the dependency graph slim for downstream consumers. p4-2/p4-3
will add what they need at their own dep sites.

Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs,
kb-parse-md, kb-normalize, kb-chunk, kb-store-*, kb-embed*, kb-search,
kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm.

Out of scope: real adapter (p4-2 Ollama), token counting against the
real tokenizer, server-side cancellation / abort signals (P+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:37:46 +00:00
38ff886c37 Merge pull request 'fix(cli): honor --config flag + improve search output legibility' (#20) from fix/cli-config-flag-and-search-output into main
Reviewed-on: altair823-org/kb#20
2026-05-01 13:25:41 +00:00
a08f61a242 fix(cli): honor --config flag + improve search output legibility
Two issues surfaced during the post-P3-5 manual smoke test against a
six-document workspace:

1. --config flag was silently ignored. kb-cli read cli.config only
   while building SourceScope inside the Ingest arm, then called
   kb_app::ingest(scope, summary_only) which internally re-loads
   Config::load(None) — falling back to ~/.config/kb/config.toml
   regardless of what the user passed. Same pattern in search,
   list, inspect, doctor. Users had to rely on KB_* env vars to
   point at a non-default config.

2. Search output collapsed RRF hybrid scores to "0.02" because
   `{:.2}` truncated the (0, 0.033]-bounded fused score, and
   chunks from the same document showed up as identical lines
   ("3. 0.02  arch/rag-architecture.md") since heading_path was
   never printed.

Fix:

- kb-app: doctor/ingest/search/list/inspect already had
  *_with_config(Config, ...) seams introduced for integration tests
  (#[doc(hidden)] pub). Repurpose them as the official "config-explicit"
  API — kb-cli now builds the Config once via
  Config::load(cli.config.as_deref()) at the top of every subcommand
  and threads it into the *_with_config variant. Module doc-comment
  updated to reflect three callers (CLI --config, integration tests,
  TUI session) instead of "test-only seam".
- kb-app: doctor() rewritten as doctor_with_config_path(Option<&Path>)
  that respects an explicit path. config_loaded probe now reports the
  actual path checked, returning a clear hard error if --config points
  at a non-existent or malformed file (defaults would silently mask
  user intent). data_dir_writable resolves storage.data_dir from the
  loaded config (with env overrides applied via Config::apply_env) so
  --config users see their custom paths reflected. Original doctor()
  signature kept as a None-passing wrapper.
- kb-cli: ingest/search/list/inspect/doctor each call the
  *_with_config* companion. Search printer switches to {:.4} score
  formatting (RRF hybrid range bounded by ~2/k_rrf ≈ 0.033 at k_rrf=60
  default) and appends `> head1 / head2` when heading_path is non-
  empty so chunks from the same document are visually distinguishable.

Verified manually:
- `kb --config /tmp/kb-smoke/config.toml doctor` reports the
  custom config path + custom data_dir, not the XDG defaults.
- `kb --config /tmp/kb-smoke/config.toml search "..." --mode hybrid`
  returns hits with distinct 4-digit scores and heading paths
  ("rust/ownership.md > Rust 소유권 모델 / Borrow checker").

Workspace 269 passed / 24 ignored / 0 failed; cargo clippy
--workspace --all-targets -- -D warnings clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:46:37 +00:00
18bb32caef Merge pull request 'feat(p3-5): app-wiring — kb-app facade bodies for ingest / search / list / inspect' (#19) from feat/p3-5-app-wiring into main
Reviewed-on: altair823-org/kb#19
2026-05-01 12:32:07 +00:00
17d52461b2 feat(p3-5): wire kb-app facade — ingest / search / list / inspect
Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real
bodies that compose the libraries shipped through P3-4. After this
commit, `cargo run -p kb-cli -- index` actually walks the workspace
and persists chunks (SQLite + optionally LanceDB), and
`cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns
real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3
owns it.

App lifecycle (crates/kb-app/src/app.rs):
- Internal pub(crate) struct App holds the Config plus
  Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind
  OnceLock<Arc<...>> for memoization. First call pays the ~470MB
  fastembed init / Lance open; subsequent calls return the cached
  Arc::clone. OnceLock::set race losers fall back to get().cloned()
  so the lazy-init is concurrent-safe.
- One-shot CLI invocations pay the cost once at most. The P9 TUI
  (which holds an App for the session) gets memoization for free.

ingest pipeline (lib.rs):
- FsSourceConnector::scan(&scope) → per asset:
  parse_frontmatter → parse_blocks → build_canonical_document →
  MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document →
  put_blocks → put_chunks. One transaction per document per design
  §5.8 (kb-store-sqlite's put_* methods own the transactions).
- When provider != "none" and dimensions > 0: build embedder once,
  embed each doc's chunks as Document kind, ensure_table once at the
  top of the run, then upsert the VectorRecord batch. Lexical-only
  config (provider == "none") skips both — verified by
  ingest_provider_none_skips_lance test.
- Per-asset parse failures recorded as IngestItemKind::Error with
  the warning attached; the run continues. Only structural failures
  (DB unreachable etc.) abort.
- Aggregate counts (assets_scanned / new / updated / skipped /
  errors / chunks_indexed / embeddings_indexed / duration_ms) flow
  into both the JobRepo progress_json AND a dedicated ingest_runs
  row written via SqliteStore::record_ingest_run (new
  pub(crate) helper added to kb-store-sqlite — see below).
  summary_only=true writes items_json=NULL but still populates the
  count columns.

search dispatch:
- SearchMode::Lexical → LexicalRetriever directly.
- SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore.
- SearchMode::Hybrid → HybridRetriever composing the two.
- Vector / Hybrid with provider=none returns a clear error naming the
  config key to flip ("models.embedding.provider").

list_docs / inspect_doc / inspect_chunk delegate straight to
DocumentStore trait methods. Returns Err with actionable message on
not-found.

Test seam: each public free function has a matching
#[doc(hidden)] pub fn *_with_config(cfg, ...) companion that
integration tests invoke directly (the public form internally calls
load_config()). pub(crate) would not reach across the integration-
tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the
function comment flags it as test-only.

kb-store-sqlite additions:
- pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore
  for the kb-app aggregate-counts persistence path. Helper writes
  the ingest_runs row directly with all aggregate columns; jobs
  table still gets a JobRepo create/update_progress/finish trio in
  parallel.

Tests (11 default, 2 #[ignore] AVX-gated):
- ingest_lexical: round-trip, idempotent, summary_only_drops_items,
  provider_none_skips_lance (asserts no .lance dir on disk),
  records_ingest_runs_row_with_aggregate_counts, tags_any filter,
  inspect_doc_not_found, inspect_chunk_not_found.
- search_lexical: lexical hits with embedding_model=None,
  empty_query_returns_empty, vector_mode_with_provider_none returns
  clear error.
- search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector
  mode embedding_model assertion (#[ignore], AVX). Both run on the
  AVX VM in ~21s combined (first run pays the model download).
- TestEnv pins workspace.root + storage.{data_dir,model_dir} to a
  TempDir so tests don't touch the user's $HOME/.local/share.
- Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has
  three small markdown files with varied frontmatter (rust+cargo+
  python tags) so the tags_any filter test exercises a non-trivial
  predicate.

Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo
clippy --workspace --all-targets -- -D warnings clean. CLI smoke
verified manually: `cargo run -p kb-cli -- index` returns a real
IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns
hits with citations; `cargo run -p kb-cli -- list docs` lists the
indexed documents.

Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types,
kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector,
kb-embed, kb-embed-local plus existing tracing / anyhow / serde /
toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm*,
kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from
cargo tree -p kb-app.

Out of scope per spec: ask body (P4-3), --rebuild-fts wiring,
--resume checkpointing (P+), --watch (P+), TUI / desktop integration
(P9 consumes this facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:11:21 +00:00
f782db5e51 Merge pull request 'docs(p3-5): add app-wiring task spec; INDEX + phase-3 updates' (#18) from docs/p3-5-app-wiring-spec into main
Reviewed-on: altair823-org/kb#18
2026-05-01 11:33:33 +00:00
78bf78c8be docs(p3-5): add app-wiring task spec; INDEX + phase-3 updates
P1–P3 shipped libraries but kb-app facade is still all `bail!("not yet
wired")` stubs, so the CLI is structurally complete but unusable.
Insert p3-5 between P3-4 and P4-1 to swap the facade bodies — ingest,
search, list_docs, inspect_doc, inspect_chunk — into real
compositions of the libraries shipped through P3-4. `ask` stays
stubbed; P4-3 owns it.

After p3-5 merges:
- `kb index` walks a workspace and persists chunks (SQLite +
  optionally LanceDB).
- `kb search --mode {lexical,vector,hybrid}` returns real SearchHits
  with citations.
- `kb list` / `kb inspect doc|chunk` round-trip from the store.

Updates:
- New task spec at tasks/p3/p3-5-app-wiring.md (depends_on
  p1-6/p2-2/p3-2/p3-3/p3-4; unblocks p4-3/p9-1/p9-2/p9-4).
- tasks/INDEX.md bumps P3 component count 4 → 5 and adds the link.
- tasks/phase-3-vector-hybrid.md replaces the speculative
  `embed_index` facade signature with the actual frozen kb-app
  surface and updates the phase completion checklist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:32:42 +00:00
d32e54622c Merge pull request 'feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF)' (#17) from feat/p3-4-hybrid-fusion into main
Reviewed-on: altair823-org/kb#17
2026-05-01 11:26:17 +00:00
ccd49ef546 feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF)
Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever
wrapper around LanceVectorStore (P3-3) into a single Retriever that
dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and
vector candidates via Reciprocal Rank Fusion and populates the full
RetrievalDetail per SearchHit so kb search --explain can attribute
scores back to each side.

Public surface (kb-search crate):
- pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>,
  Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction.
- pub struct HybridRetriever { lexical, vector, fusion, k }.
- pub enum FusionPolicy { Rrf { k_rrf: u32 } }.

VectorRetriever:
- Embeds query.text as EmbeddingKind::Query before delegating to
  VectorStore::search(query_vec, query.k * 2, &query.filters). Over-
  fetches by ×2 for filter losses; LanceVectorStore applies the
  filters internally so they propagate naturally.
- Hydrates each VectorHit into a full SearchHit by joining on
  chunk_id in a single IN-clause batch (no N+1): doc_path,
  section_label, chunker_version, source_spans for citation, plus
  embedding_model from embedder.model_id().
- Snippet trimmed to config.search.snippet_chars (vector mode lacks
  FTS5 highlighting; chunk text prefix is the next-best signal).
- Citation built from the chunk's first source span via the shared
  citation_helper module — extracted from lexical.rs so both
  retrievers compute citations identically (Byte/empty fallback to
  Line{1,1} preserved with tracing::warn).
- RetrievalDetail.method = Vector for standalone calls; both
  fusion_score and vector_score set to the LanceVectorStore-shifted
  cosine score; lexical_* None.

HybridRetriever:
- Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail.
- Hybrid mode runs both retrievers with k * 2 fanout, fuses with
  RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC
  with deterministic tiebreaker (lex_rank ASC then chunk_id ASC),
  takes top query.k. Fusion math runs in f64 throughout; cast to
  f32 only at the SearchHit boundary where bounded magnitude (≤
  ~0.033 at k_rrf=60) makes f32 precision sufficient for ranking.
- Per-hit lexical preferred for snippet/citation/heading_path/
  chunker_version/embedding_model when the chunk appears in both
  retrievers — FTS5 highlighting is more user-relevant than vector's
  truncated text. Vector-only chunks fall through to vector hit data.
- index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at
  construction; mismatched lex/vec versions trigger a tracing::warn
  so users notice stale indexes (spec line 143).

kb-search additions:
- citation_helper.rs — pub(crate) citation_from_first_span shared
  between lexical and vector retrievers. Extracted from lexical.rs;
  no behavior drift.

Tests (38 default + 3 ignored):
- 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within
  f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid
  preserves single-side hits with the missing side's RetrievalDetail
  None, deterministic tiebreaker on identical fused scores, composite
  index_version, mismatched-version warn at construction.
- 2 unit tests in vector.rs covering the snippet-prefix and citation
  fallback paths.
- 11 unit tests in lexical.rs (unchanged from P2-2).
- 13 lexical integration tests (unchanged).
- 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus
  recall (lex returns A,B; vec returns C,D; hybrid returns all 4),
  determinism over two queries, snapshot stability against
  tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was
  regenerated against this branch on an AVX-enabled VM and contains
  4 real chunks (c1/c2 lex+vec, c3/c4 vec-only).
- KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of
  silently passing — matches the P3-2/P3-3 fail-loud-instead-of-
  silent-pass philosophy.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite,
kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing
kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow).
kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder>
trait object; the concrete adapter is runtime-injected by kb-app.

Out of scope: reranker (P+), score calibration across modes (RRF is
rank-comparable so absolute calibration is P+), multimodal retrieval
(P6+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:22:21 +00:00
60a0290d85 Merge pull request 'feat(p3-3): lancedb-store — kb-store-vector + V003 embedding status migration' (#16) from feat/p3-3-lancedb-store into main
Reviewed-on: altair823-org/kb#16
2026-05-01 11:01:32 +00:00
dd42740cc0 test(p3-3): pin LanceDB snapshot fixture from AVX-capable host
Replaces the placeholder run-1.json with the captured Vec<VectorHit>
from `cargo test -p kb-store-vector --test snapshot -- --ignored` on
an AVX2-capable VM (host-passthrough CPU model). Verified by re-
running the same ignored lane and asserting against the pinned
fixture.

Full ignored lane on AVX hardware:
- upsert_search.rs: 8 / 8 pass (ensure_table idempotent, search-empty,
  upsert+search, dim-mismatch, tags filter, model isolation,
  determinism, crash-recovery promotes pending → committed).
- snapshot.rs: 1 / 1 pass against the pinned fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:58:10 +00:00
3cd5117a7e feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status
First VectorStore implementation. Per-model Lance tables under
config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance
MergeInsert → SQLite-committed) with crash-safe retry, search via
cosine distance with the spec's score-shift (preserves negative
similarity ranking signal that clamping would crush).

V003 migration:
- Adds status (CHECK constraint pending|committed|tombstone, default
  pending) and vector_committed columns to embedding_records.
- BEFORE DELETE trigger on chunks flips dependent rows to tombstone.
  Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE
  runs first then row vanishes via CASCADE. Spec-faithful tombstone
  preservation requires recreating embedding_records to drop the
  CASCADE — deferred to a P+ migration since no production rows exist
  yet (P3-3 is the first writer). V003 SQL comment explains.

LanceVectorStore:
- ensure_table is idempotent: opens existing or creates with the
  Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32,
  dim>, model_id, embedding_version, text, heading_path, created_at).
- IndexId computed via id_for_index with collection="chunk_embeddings",
  index_kind="flat", params_hash = blake3(descriptor JSON). Schema
  bumps automatically rotate the IndexId.
- upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status=
  'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed
  on chunk_id (idempotent re-run); phase-3 UPDATE status='committed',
  vector_committed=1. If phase-2 fails the rows stay 'pending' and
  the next upsert call retries idempotently.
- search joins embedding_records WHERE status='committed' so partial-
  write rows never surface. Cosine distance from Lance ∈ [0, 2] →
  similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈
  [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters
  via SqliteStore::filter_chunks (added in this commit).
- Sync trait + async LanceDB bridged by an embedded current-thread
  tokio runtime. Doc-comment on the struct flags the "do NOT call
  from inside another tokio runtime" panic (block_on cannot nest).
  kb-app's job scheduler is sync today.

kb-store-sqlite additions:
- pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1
  INSERT OR REPLACE (status='pending', vector_committed=0).
- pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3
  single UPDATE … WHERE embedding_id IN (?, ?, …) via
  params_from_iter, guarded by WHERE status='pending' so tombstones
  don't get clobbered.
- pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId>
  consolidates the JOIN against documents/document_tags/
  embedding_records + path_glob via globset. Lets kb-store-vector
  honor SearchFilters without depending on rusqlite or globset
  directly. (kb-search's filter logic is structurally different —
  interleaved with the FTS5 SELECT — so it stays as-is for now;
  consolidation is a P+ refactor.)
- 4 new unit tests cover the phase-1 round-trip, empty batch,
  replay reset of pending rows, and the WHERE-status-pending guard.

Tests:
- 9 lib unit tests in kb-store-vector covering paths/sanitization,
  arrow_batch dim validation + descriptor hash, bm25-style cosine
  score shift math.
- 4 new kb-store-sqlite unit tests on filter_chunks (committed-only,
  tags/lang/trust/path_glob, order preservation, empty input).
- 4 new kb-store-sqlite unit tests on the embedding_records helpers.
- 8 integration tests in upsert_search.rs and 1 snapshot test marked
  #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke
  require_avx_or_panic() at the top of each body so a missing-AVX
  --ignored run fails loudly instead of silently passing. This dev
  host (qemu64 model) lacks AVX so these were NOT exercised end-to-
  end here — first CI lane on AVX hardware will validate them.
- Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder
  with an _comment marker. Snapshot test panics until the placeholder
  is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware.
- Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace
  --all-targets -- -D warnings clean.

Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb,
arrow + arrow-array + arrow-schema, serde, serde_json, tracing,
thiserror) plus forced waivers — anyhow (trait return type), tokio
+ futures (LanceDB async-only API), blake3 (params_hash). rusqlite
and globset are NOT direct deps of kb-store-vector — confirmed via
cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for
the test fixture seeder only.

Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app
embed_index orchestration (P3-4 facade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:01:31 +00:00
9beef930b4 Merge pull request 'feat(p3-2): fastembed-adapter — kb-embed-local 크레이트 + FastembedEmbedder' (#15) from feat/p3-2-fastembed-adapter into main
Reviewed-on: altair823-org/kb#15
2026-05-01 08:42:06 +00:00
bcbe2b8531 feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small
First real Embedder implementation. Wraps fastembed-rs (ONNX runtime)
with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME}
template expansion so model files land under config.storage.model_dir/
fastembed/ without polluting kb-config's public API.

Public surface:
- pub struct FastembedEmbedder
- pub fn new(config: &Config) -> Result<Self>
- impl kb_core::Embedder (via kb-embed re-export)

Behavior:
- Default model multilingual-e5-small (384 dims). model_id and
  model_version come from config.models.embedding.{model,version}.
- Pre-load dim check via TextEmbedding::get_model_info: dim mismatch
  bails before paying the ~470MB ONNX init cost.
- e5 prefix applied BEFORE tokenization: "passage: " for
  EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned
  by prefix_input unit tests.
- Batches inputs into chunks of config.models.embedding.batch_size,
  concatenates results in input order.
- L2 normalization is performed by fastembed 4.9's default transformer
  pipeline (verified at fastembed/src/text_embedding/output.rs:43);
  we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so
  a future fastembed bump that drops this invariant fails loudly.
- Synchronous (no async runtime). Mutex serializes calls into the
  underlying ONNX session — conservative; ORT Session is Send+Sync but
  callers (kb-app indexer) batch sequentially anyway. Revisit if
  profiling shows contention.
- First-run model download surfaces via tracing::info before/after
  TextEmbedding::try_new — users no longer stare at a silent 30-60s
  pause during the 470MB pull.

Tests:
- 11 default-lane tests covering: check_dim match/mismatch (no model
  load), prefix_input Document/Query/empty, resolve_model
  known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set
  + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive
  ~ expansion). XDG tests serialize on a Mutex + RAII guard since
  edition 2024 makes set_var/remove_var unsafe.
- 7 #[ignore] integration tests covering: full construction with
  default config, dim-mismatch belt-and-braces, Document vs Query
  cosine differential, L2 unit norm, byte-equal determinism, batch-64
  performance under 5s, snapshot-hash stability over a 5-sentence
  multilingual fixture.
- Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints
  the captured hash and panics with paste-back instructions, so first
  --ignored run forces the maintainer to pin the baseline rather than
  silently passing.
- Workspace: 222 tests pass (default lane); clippy clean.

Allowed deps respected: kb-config, kb-embed (re-exports kb-core
trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and
ort enter transitively through fastembed; reqwest/hyper/hf-hub also
transitive (model download is fastembed's responsibility per spec
carve-out). No direct kb-core dep needed — re-exports cover it.

Pinned to fastembed 4.x rather than the recent 5.x to limit blast
radius; consider bump when p3-3 (lancedb-store) consumes the embedder
output shape.

Out of scope: reranker (P+), Ollama embedding endpoint, candle
adapter, image embeddings (P6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:39:38 +00:00
9f2afc73dc Merge pull request 'feat(p3-1): embedder-trait — kb-embed 크레이트 + MockEmbedder' (#14) from feat/p3-1-embedder-trait into main
Reviewed-on: altair823-org/kb#14
2026-05-01 08:21:37 +00:00
2e3eb8f437 feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder
Establishes the kb-embed trait crate so concrete embedding adapters
(p3-2 fastembed, future ollama-embed/candle) target a stable surface.
Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic
mock for downstream tests.

MockEmbedder (cfg(feature = "mock"), default OFF):
- Per-component hash recipe: blake3(seed_le8 || kind_byte ||
  text_len_le8 || text || i_le8). Length-prefixed text avoids the
  domain-separation ambiguity where two (text, i) pairs could shift
  bytes between text tail and the i field.
- Document = 0u8, Query = 1u8 — same text different kind yields
  different vectors (mirrors e5 prefix behaviour).
- Per component: blake3 first 8 bytes → u64 → reinterpret as i64 →
  f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32
  rounds to -1.0; range [-1, 1] holds.
- L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision
  loss) before f32 cast. Zero-norm guard skips the divide.
- with_seed(...) constructor lets two embedders share identity but
  produce different vectors — useful for downstream parametric tests.

Helpers:
- assert_vector_shape(vecs, dims) — len + finite check.
- assert_unit_norm(vecs, tolerance) — caller-supplied tolerance;
  5e-4 documented as safe for dims=384 under f32 epsilon × √dims.

Tests:
- cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests.
- cargo test -p kb-embed --features mock: 7 tests including 100-case
  proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within
  tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text),
  Doc(text1) ≠ Doc(text2).
- All 220 workspace tests pass; clippy clean for both default and
  mock-on feature configurations.

Symbol gating: nm on the release rlib confirms zero MockEmbedder
symbols under default features; three trait impl symbols under
--features mock. Spec invariant "release builds MUST NOT include
MockEmbedder" verified at the symbol level.

Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing,
plus anyhow (forced by trait return type) and blake3 (justified by
the determinism contract; already in workspace lockfile via kb-core).
No fastembed/ort/tokenizers anywhere.

Out of scope: real adapter (p3-2), reranker traits (P+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:15:44 +00:00
ed6a595672 Merge pull request 'feat(p2-2): lexical-retriever — kb-search crate + LexicalRetriever (FTS5 + bm25)' (#13) from feat/p2-2-lexical-retriever into main
Reviewed-on: altair823-org/kb#13
2026-05-01 08:03:22 +00:00
b335151d18 feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25)
Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.

New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
  token into a quoted literal (inner " doubled); single-quote-wrapped
  text passes through verbatim as raw FTS5 syntax. Empty query
  short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
  any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
  word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
  enforces the char cap on the way out (chars per design §6.4 — accepts
  the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
  Region / Time forwarded; Byte / empty array fall back to Line{1,1}
  with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
  trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
  uses globset with literal_separator(true) — guarantees '*' does not
  cross '/' per spec Risks/notes — applied as Rust post-filter with
  +128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
  tiebreaker on identical bm25). Tested explicitly with two chunks of
  identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
  set, vector_* None.

kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
  Read-only contract is doc-only — kb-search needs MutexGuard for
  prepare_cached + iter, which a closure-scoped wrapper would awkwardly
  constrain. Closure variant left as a P3 follow-up.

Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.

Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).

Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 05:20:35 +00:00
5aef478b96 Merge pull request 'feat(p2-1): fts-schema — chunks_fts + sync triggers + V002 migration' (#12) from feat/p2-1-fts-schema into main
Reviewed-on: altair823-org/kb#12
2026-05-01 04:58:55 +00:00
94bfc50efd feat(p2-1): chunks_fts virtual table + sync triggers (V002 migration)
Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual
table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED
chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror
every chunks mutation into chunks_fts inside the host transaction.

V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT
so existing P1 databases gain searchability without re-ingest. Refinery
bookkeeping makes double-apply naturally idempotent.

Adds rebuild_chunks_fts(&Connection) escape hatch for kb index
--rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT
instead of Transaction so callers can invoke from inside an outer
transaction; WAL serializes writers so no DELETE/INSERT race vs.
concurrent chunks mutators is possible.

Tests (10): V001-only → V002 cold-upgrade backfill (literal path),
chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild
idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5
verbatim diff guard (anchored extraction from both files), WAL/SHM
release on store drop. All 185 workspace tests pass.

Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no
new deps). FTS query implementation deferred to p2-2 (lexical-retriever).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:42:15 +00:00
fd11dd054b Merge pull request 'feat(p1-6): kb-store-sqlite — P1 terminal task' (#11) from feat/p1-6-store-sqlite into main
Reviewed-on: altair823-org/kb#11
2026-04-30 17:47:45 +00:00
b7367dedfe p1-6: doc-only TODO markers (section_label, doc_version invariant)
M9: add a `TODO(P2/P3)` comment near the NULL persistence at
documents.rs (put_chunks). The `section_label` column exists in the §5.5
DDL but neither the in-memory Chunk struct nor the §2.6 wire schema
carries the field, so NULL is the correct canonical value today —
flag the future-bump intent in-line rather than leaving it implicit.

M10: add a one-line invariant comment near the i64 -> u32 narrowing for
`doc_version` in `get_document`. The invariant is documented at the
write site (UPSERT bumps by 1 per re-ingest) — restate it at the read
site so the cast is not silently load-bearing.

No behaviour change. No tests touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:34:17 +00:00
15b4d80efc p1-6: rename StoreError::Sqlx -> Sqlite, drop dead assets_root helper
M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not
sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No
external pattern matches; the variant is reached only via `#[from]`.

M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a
test that never landed. Delete it so the dead-code allow goes with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:33:50 +00:00
e41279de96 p1-6: harden store boundary (atomic asset write, poison-tolerant mutex, AssetId validation)
Three Important review fixes on the kb-store-sqlite write path:

I1. Atomic asset write. put_asset_with_bytes now stages bytes to
    `<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into
    place (atomic on POSIX same-fs). On any failure between staging and
    rename we best-effort `remove_file` the temp so the previous orphan
    risk on UPSERT failure is gone. Reference mode is unchanged (no I/O,
    no orphan risk).

I2. Poison-tolerant mutex. New `lock_conn` helper does
    `.lock().unwrap_or_else(|p| p.into_inner())`, so a single panic mid-
    transaction no longer poisons every subsequent store call. The
    rusqlite Transaction Drop already rolls back on panic, leaving the
    Connection state safe to reuse. All 13 prior `.expect("sqlite mutex
    poisoned")` sites in store.rs / documents.rs / jobs.rs now route
    through `lock_conn`.

I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a
    hand-construction bypass the `FromStr` 32-hex invariant. Added
    `validate_asset_id` (32 ASCII hex chars) at every store entry that
    turns an AssetId into a path: `put_asset_with_bytes` and
    `DocumentStore::put_asset`. This shuts a potential path-traversal via
    `assets_path_for`'s `&id[..2]` shard slice.

Tests:
- `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a
  row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips
  a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final
  file and no leaked `*.tmp.*`.
- `put_asset_with_bytes_rejects_invalid_asset_id` — passes
  `AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`).
  Asserts error and zero filesystem artifacts under `data_dir/assets/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:33:19 +00:00
efdb71b1c3 p1-6: list_documents filter coverage test
Round-trip three docs (en/ko, varied tags, varied trust) and exercise
each DocFilter axis: default (all), lang, path_glob (workspace_path
GLOB), tags_any (intersection via document_tags subquery + per-row tag
hydration), and trust_min (Primary > Secondary > Generated rank gate).
2026-04-30 17:14:17 +00:00
111f40ddf0 p1-6: kb-store-sqlite test suite (8 categories)
All 8 test categories from the task plan, plus a JobRepo subset:

  migration   — tests/migration.rs: fresh DB after run_migrations
                exposes every required §5 table + index.
  unit (copy) — tests/asset_writer.rs: copy mode writes file with
                mode 0o644 + correct bytes.
  unit (ref)  — tests/asset_writer.rs: reference mode does not write
                file; row records source path.
  unit (cs)   — tests/asset_writer.rs: tampered checksum returns a
                Conflict-flavoured anyhow error.
  unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
                doc_version 1→2; tags re-derived.
  unit (rb)   — tests/idempotency.rs: put_blocks with FK violation
                rolls back; pre-existing rows unchanged.
  contract    — tests/contract_roundtrip.rs: drives kb-parse-md +
                kb-normalize + kb-chunk on
                fixtures/markdown/code-and-table.md, persists, then
                reloads via DocumentStore::get_document /
                get_chunk and asserts byte-equal round-trip.
  snapshot    — tests/ingest_report_snapshot.rs +
                snapshots/ingest_report.snapshot.json: pin the wire
                JSON form of kb_core::IngestReport for an inline
                fixture run.
  jobs        — tests/jobs.rs: create → progress → finish flow;
                error message round-trip; list filters on status/kind.

Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
2026-04-30 17:13:03 +00:00
a3390d5171 p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL
New workspace member crate `kb-store-sqlite` (allowed deps only:
kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json,
time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md /
kb-normalize / kb-chunk for the contract round-trip test).

Migration V001 replaces the P0-1 stub with the full §5 DDL (assets,
documents, document_tags, blocks, chunks with policy_hash,
embedding_records, jobs, ingest_runs, answers, eval_runs,
eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers
remain deferred to V002 (P2-1).

Public surface per task spec:
  SqliteStore::open / run_migrations / put_asset_with_bytes
  impl DocumentStore for SqliteStore (7 trait methods)
  impl JobRepo for SqliteStore (4 trait methods)
  StoreError { Sqlx, Migration, Conflict }

Behavior:
- Pragmas at open: foreign_keys=ON, journal_mode=WAL,
  synchronous=NORMAL, temp_store=MEMORY.
- Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to
  data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else
  reference. blake3(bytes) verified against asset.checksum; mismatch →
  Conflict.
- Idempotency: put_document UPSERTs and bumps doc_version + 1 on
  conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags
  re-derived per put_document.
- get_document rehydrates blocks via payload_json ordered by stream
  ordinal.
- list_documents builds dynamic WHERE from DocFilter (lang / trust_min
  / path_glob via GLOB / tags_any via document_tags subquery).
- JobRepo: jobs.kind/status are stored as lowercase enum tags; create
  mints a 32-hex JobId via blake3(kind || payload || nanos).

Tests follow in subsequent commits.
2026-04-30 17:08:36 +00:00
207a0ff61e kb-chunk: regenerate long-section.chunks.snapshot.json baseline
The snapshot now includes the policy_hash field on every Chunk per the
preceding kb-core schema change. chunk_ids are unchanged because the
chunk_id recipe (§4.2) already incorporated policy_hash via the chunker —
the field is simply now visible in the wire form.

Regenerated via:
  UPDATE_SNAPSHOTS=1 cargo test -p kb-chunk long_section_chunks_snapshot
2026-04-30 17:02:53 +00:00
094c4641ba kb-chunk: populate Chunk.policy_hash field
Set the new policy_hash field on every emitted Chunk to the same hex
already computed for the chunk_id recipe (§4.2). No recipe / chunk_id
change — only the field on the struct is now populated.

Pairs with the kb-core hotfix (preceding commit) and unblocks P1-6's
DocumentStore::put_chunks to read chunk.policy_hash directly per §5.5.
2026-04-30 17:02:17 +00:00
16b2a5c150 kb-core: add policy_hash field to Chunk struct (P1-6 schema reconcile)
Add policy_hash: String to kb_core::Chunk to align with the §5.5 SQLite
schema (chunks.policy_hash NOT NULL), so kb-store-sqlite persistence is a
straight field copy rather than a recompute.

This is a §9 schema migration:
- §5.5 (the persistence schema) is authoritative.
- §3.5 (the domain model) must accommodate.

The chunker already computed policy_hash for the chunk_id recipe (§4.2);
P1-5 stored it implicitly. We now hold it explicitly on the Chunk so any
DocumentStore::put_chunks impl can read it directly.

Follow-up commits update kb-chunk to populate the field and refresh the
P1-5 snapshot baseline accordingly.
2026-04-30 17:02:11 +00:00
b46e69b9c0 Merge pull request 'feat(p1-5): kb-chunk md-heading-v1 chunker' (#10) from feat/p1-5-chunk into main
Reviewed-on: altair823-org/kb#10
2026-04-30 16:58:33 +00:00
ceeac9f974 p1-5: doc rationale for respect_markdown_headings, policy_hash panic, overlap accounting
Doc-only follow-ups for review minors I1, M3, M4. No behavior change.

* I1: rustdoc on `MdHeadingV1Chunker` now records that
  `policy.respect_markdown_headings` flows into `policy_hash` only;
  the `md-heading-v1` variant unconditionally treats headings as
  boundaries by name. To disable heading awareness, ship a different
  `chunker_version` (none in P1-5).

* M3: `# Panics` rustdoc on `policy_hash` documents the
  unreachable-in-practice failure mode of
  `serde_json_canonicalizer::to_vec` and explains why the `expect` is
  retained as future-proofing.

* M4: Inline comment at the `would_exceed` decision noting that
  `acc.text_tokens` already includes the prior chunk's overlap seed,
  and that the I3 clamp guarantees a flush here never produces a
  chunk smaller than the seed budget.

* Heading-path bullet in the behavior contract updated to reflect the
  I2 fix wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:52:18 +00:00
a81460f9d0 p1-5: clamp overlap seed budget to target_tokens / 2
A pathological `ChunkPolicy { overlap_tokens >= target_tokens }`
caused `md-heading-v1` to degenerate into 1-block-per-chunk: the
seeded `acc.text_tokens` already exceeded `target_tokens` before any
fresh content landed, so the next paragraph immediately tripped the
`would_exceed` flush.

Clamp the seed budget in `collect_overlap_seed` to
`min(overlap_tokens, target_tokens / 2)`. This guarantees that after
seeding, the chunk has at least `target/2` worth of room for new
content before flushing, restoring the intended paragraph-overlap
behavior on every reasonable and unreasonable policy.

Adds a regression test pinning a 50/200 (overlap = 4× target) policy
and asserting every emitted chunk holds ≥2 blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:51:00 +00:00
f780c71ce0 p1-5: heading-only chunk carries self in heading_path
When a Heading block opens a chunk and is followed by another Heading
or an atomic block (Code, Table, ImageRef, AudioRef) with no
intervening prose, the prior fallback used `common.heading_path` from
the heading itself — which per kb-normalize convention does NOT
include the heading's own text. Result: heading-only and heading-led
chunks for `# Alpha\n## Beta\n...` patterns landed with
`heading_path = []`, losing citation context.

Synthesize the leading heading into the chunk's heading_path when
blocks[0] is a Heading: parent path + heading.text. The first
non-Heading branch (existing logic for normal mid-section chunks) is
unchanged.

`chunk_id` recipe is `(doc_id, chunker_version, block_ids,
policy_hash)` — `heading_path` is not in the recipe, so this fix does
NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json`
also unchanged because every heading in that fixture is followed by a
paragraph (the bug only triggers on direct heading→heading or
heading→atomic adjacency).

Adds `heading_with_parents` test helper and a regression test pinning
the `# Alpha\n## Beta\n[code]` pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:50:12 +00:00
58f7b8573d p1-5: add long-section fixture + Vec<Chunk> snapshot test
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:

  - Heading boundaries hold across H1 → H2 → H1 transitions
  - The code block emits one chunk at 427 tokens > target=200
  - The table stays single-chunk
  - Gamma's paragraph stream splits with one block of overlap seed

A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.

Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:33:29 +00:00
0237022d0e p1-5: implement md-heading-v1 chunking rules
Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset
from the design (§0 / §14):

  1. Heading is a hard boundary; the heading itself starts and is
     included in its chunk so heading text is retrievable.
  2. Code blocks never split, regardless of size.
  3. Tables stay single-chunk (row-split deferred per task spec).
  4. Long sections split at target_tokens with paragraph-level
     overlap_tokens worth of seeded tail blocks.
  5. ImageRef / AudioRef each become their own chunk with
     token_estimate = 0.
  6. heading_path lifts from the first contributing non-Heading
     block; source_spans concatenate in document order.
  7. chunk_id derives from id_for_chunk(doc_id, chunker_version,
     block_ids, policy_hash) per §4.2.

Covers the unit + determinism rows of the P1-5 test plan: heading
boundary respected, 800-token code block stays single, small table
stays single, long paragraph chain splits with overlap, ImageRef
chunk has token_estimate=0, and 1000-iter chunk_id determinism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:30:26 +00:00