Establishes the kb-embed trait crate so concrete embedding adapters
(p3-2 fastembed, future ollama-embed/candle) target a stable surface.
Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic
mock for downstream tests.
MockEmbedder (cfg(feature = "mock"), default OFF):
- Per-component hash recipe: blake3(seed_le8 || kind_byte ||
text_len_le8 || text || i_le8). Length-prefixed text avoids the
domain-separation ambiguity where two (text, i) pairs could shift
bytes between text tail and the i field.
- Document = 0u8, Query = 1u8 — same text different kind yields
different vectors (mirrors e5 prefix behaviour).
- Per component: blake3 first 8 bytes → u64 → reinterpret as i64 →
f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32
rounds to -1.0; range [-1, 1] holds.
- L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision
loss) before f32 cast. Zero-norm guard skips the divide.
- with_seed(...) constructor lets two embedders share identity but
produce different vectors — useful for downstream parametric tests.
Helpers:
- assert_vector_shape(vecs, dims) — len + finite check.
- assert_unit_norm(vecs, tolerance) — caller-supplied tolerance;
5e-4 documented as safe for dims=384 under f32 epsilon × √dims.
Tests:
- cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests.
- cargo test -p kb-embed --features mock: 7 tests including 100-case
proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within
tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text),
Doc(text1) ≠ Doc(text2).
- All 220 workspace tests pass; clippy clean for both default and
mock-on feature configurations.
Symbol gating: nm on the release rlib confirms zero MockEmbedder
symbols under default features; three trait impl symbols under
--features mock. Spec invariant "release builds MUST NOT include
MockEmbedder" verified at the symbol level.
Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing,
plus anyhow (forced by trait return type) and blake3 (justified by
the determinism contract; already in workspace lockfile via kb-core).
No fastembed/ort/tokenizers anywhere.
Out of scope: real adapter (p3-2), reranker traits (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.
New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
token into a quoted literal (inner " doubled); single-quote-wrapped
text passes through verbatim as raw FTS5 syntax. Empty query
short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
enforces the char cap on the way out (chars per design §6.4 — accepts
the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
Region / Time forwarded; Byte / empty array fall back to Line{1,1}
with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
uses globset with literal_separator(true) — guarantees '*' does not
cross '/' per spec Risks/notes — applied as Rust post-filter with
+128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
tiebreaker on identical bm25). Tested explicitly with two chunks of
identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
set, vector_* None.
kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
Read-only contract is doc-only — kb-search needs MutexGuard for
prepare_cached + iter, which a closure-scoped wrapper would awkwardly
constrain. Closure variant left as a P3 follow-up.
Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.
Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).
Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 8 test categories from the task plan, plus a JobRepo subset:
migration — tests/migration.rs: fresh DB after run_migrations
exposes every required §5 table + index.
unit (copy) — tests/asset_writer.rs: copy mode writes file with
mode 0o644 + correct bytes.
unit (ref) — tests/asset_writer.rs: reference mode does not write
file; row records source path.
unit (cs) — tests/asset_writer.rs: tampered checksum returns a
Conflict-flavoured anyhow error.
unit (idem) — tests/idempotency.rs: same put_document twice → 1 row,
doc_version 1→2; tags re-derived.
unit (rb) — tests/idempotency.rs: put_blocks with FK violation
rolls back; pre-existing rows unchanged.
contract — tests/contract_roundtrip.rs: drives kb-parse-md +
kb-normalize + kb-chunk on
fixtures/markdown/code-and-table.md, persists, then
reloads via DocumentStore::get_document /
get_chunk and asserts byte-equal round-trip.
snapshot — tests/ingest_report_snapshot.rs +
snapshots/ingest_report.snapshot.json: pin the wire
JSON form of kb_core::IngestReport for an inline
fixture run.
jobs — tests/jobs.rs: create → progress → finish flow;
error message round-trip; list filters on status/kind.
Drops the unused `serde` direct dep from Cargo.toml; serde_json brings
its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1`
to live only in the dev tree.
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:
- Heading boundaries hold across H1 → H2 → H1 transitions
- The code block emits one chunk at 427 tokens > target=200
- The table stays single-chunk
- Gamma's paragraph stream splits with one block of overlap seed
A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.
Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.
I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.
I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.
Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
blake3 (unused); keep tracing (now wired) + unicode-normalization
(now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
fixture_blocks_five helper) so the determinism property is exercised
on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
== events[2].at to pin the shared-now_utc invariant.
New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)
kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the workspace member, `Cargo.toml` with the §8-allowed dep set
(kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer,
blake3, unicode-normalization, time, anyhow, tracing) and a stubbed
`build_canonical_document` that pins the public signature plus
`doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so
the integration snapshot test (added later in this series) can drive
a fixture through the real parser without violating the production
boundary — `cargo tree -p kb-normalize --depth 1 --edges normal`
confirms no parser implementation appears in the regular dep tree.
`id_for_doc` and `id_for_block` are re-exported from kb-core (which
holds the canonical recipe per §4.2); kb-normalize is the canonical
*entry point* per design §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.
Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.
15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.
Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.
Modules:
- hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file
loads); pinned digests for empty input and "hello world".
- media: extension → MediaType (markdown/pdf/image/audio/other).
- walker: ignore::OverrideBuilder for filter union; walkdir with
manual visited-set cycle protection on top of follow_links.
- connector: public FsSourceConnector::new(&Config) +
SourceConnector::scan(&SourceScope) impl. Uses
kb_core::to_posix for WorkspacePath construction (carries
P0-1 # rejection through unchanged) and kb_core::id_for_asset
for AssetId derivation. Storage variant signals intent only;
actual byte copy is P1-6's responsibility.
Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
(unused — none of those crates' src trees reference thiserror; CoreError
in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
satisfies `clippy::manual_is_ascii_check` under `-D warnings`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>