Establishes the kb-embed trait crate so concrete embedding adapters
(p3-2 fastembed, future ollama-embed/candle) target a stable surface.
Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind,
EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic
mock for downstream tests.
MockEmbedder (cfg(feature = "mock"), default OFF):
- Per-component hash recipe: blake3(seed_le8 || kind_byte ||
text_len_le8 || text || i_le8). Length-prefixed text avoids the
domain-separation ambiguity where two (text, i) pairs could shift
bytes between text tail and the i field.
- Document = 0u8, Query = 1u8 — same text different kind yields
different vectors (mirrors e5 prefix behaviour).
- Per component: blake3 first 8 bytes → u64 → reinterpret as i64 →
f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32
rounds to -1.0; range [-1, 1] holds.
- L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision
loss) before f32 cast. Zero-norm guard skips the divide.
- with_seed(...) constructor lets two embedders share identity but
produce different vectors — useful for downstream parametric tests.
Helpers:
- assert_vector_shape(vecs, dims) — len + finite check.
- assert_unit_norm(vecs, tolerance) — caller-supplied tolerance;
5e-4 documented as safe for dims=384 under f32 epsilon × √dims.
Tests:
- cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests.
- cargo test -p kb-embed --features mock: 7 tests including 100-case
proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within
tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text),
Doc(text1) ≠ Doc(text2).
- All 220 workspace tests pass; clippy clean for both default and
mock-on feature configurations.
Symbol gating: nm on the release rlib confirms zero MockEmbedder
symbols under default features; three trait impl symbols under
--features mock. Spec invariant "release builds MUST NOT include
MockEmbedder" verified at the symbol level.
Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing,
plus anyhow (forced by trait return type) and blake3 (justified by
the determinism contract; already in workspace lockfile via kb-core).
No fastembed/ort/tokenizers anywhere.
Out of scope: real adapter (p3-2), reranker traits (P+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147 lines
5.4 KiB
Rust
147 lines
5.4 KiB
Rust
//! Deterministic mock embedder for downstream tests.
|
||
//!
|
||
//! Compiled only when the `mock` feature is enabled. Default builds
|
||
//! (`cargo build --release -p kb-embed`) MUST NOT contain the `MockEmbedder`
|
||
//! symbol — verifiable by symbol scan (`nm`, `cargo bloat`).
|
||
//!
|
||
//! ## Determinism contract
|
||
//!
|
||
//! For every call to [`MockEmbedder::embed`], component `i` of the output
|
||
//! vector for input `(text, kind)` is computed as:
|
||
//!
|
||
//! ```text
|
||
//! h = blake3(seed_le8 || kind_byte || text_len_le8 || text_utf8 || i_le8)
|
||
//! raw_i64 = i64::from_le_bytes(h[0..8])
|
||
//! comp = (raw_i64 as f64 / i64::MAX as f64) as f32 // ∈ [-1.0, 1.0]
|
||
//! ```
|
||
//!
|
||
//! `kind_byte` is `0u8` for [`EmbeddingKind::Document`] and `1u8` for
|
||
//! [`EmbeddingKind::Query`] — mirrors the e5-style prefix behavior (the same
|
||
//! text in different roles produces different vectors). `text_len_le8` is the
|
||
//! length of `text_utf8` (in bytes) as a little-endian `u64`; it provides
|
||
//! domain separation so the boundary between `text` and the trailing `i_le8`
|
||
//! cannot be ambiguous (without it, e.g. `("ABCDEFGH", 0)` and
|
||
//! `("", u64::from_le_bytes(*b"ABCDEFGH"))` would hash identically).
|
||
//!
|
||
//! After the per-component pass each vector is **L2-normalized to unit
|
||
//! length** so downstream cosine-similarity tests can rely on a unit-norm
|
||
//! input (‖v‖ ≈ 1.0 within f32 epsilon × √dims — the per-component f32
|
||
//! truncation is bounded by `f32::EPSILON`, summed in quadrature gives
|
||
//! roughly `√dims · EPSILON` in the L2 norm). If a vector ends up all-zeros
|
||
//! (vanishingly unlikely from BLAKE3), it is left untouched rather than
|
||
//! dividing by zero.
|
||
//!
|
||
//! Invariants the contract guarantees:
|
||
//!
|
||
//! * Identical `(seed, kind, text, dimensions)` → byte-identical output.
|
||
//! * Different `kind` for the same text → different output (kind_byte differs).
|
||
//! * Different `text` → different output with overwhelming probability.
|
||
//! * All output components are finite (`is_finite()`).
|
||
|
||
use kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion};
|
||
|
||
/// Deterministic test double. See module docs for the hashing recipe.
|
||
pub struct MockEmbedder {
|
||
model_id: EmbeddingModelId,
|
||
version: EmbeddingVersion,
|
||
dimensions: usize,
|
||
seed: u64,
|
||
}
|
||
|
||
impl MockEmbedder {
|
||
/// Construct with `seed = 0`. Use [`Self::with_seed`] to pick a different
|
||
/// seed (e.g., to verify two embedders with the same identity but
|
||
/// different seeds yield different vectors).
|
||
pub fn new(
|
||
model_id: EmbeddingModelId,
|
||
version: EmbeddingVersion,
|
||
dimensions: usize,
|
||
) -> Self {
|
||
Self {
|
||
model_id,
|
||
version,
|
||
dimensions,
|
||
seed: 0,
|
||
}
|
||
}
|
||
|
||
/// Construct with an explicit seed. Useful for differential tests.
|
||
pub fn with_seed(
|
||
model_id: EmbeddingModelId,
|
||
version: EmbeddingVersion,
|
||
dimensions: usize,
|
||
seed: u64,
|
||
) -> Self {
|
||
Self {
|
||
model_id,
|
||
version,
|
||
dimensions,
|
||
seed,
|
||
}
|
||
}
|
||
|
||
fn kind_byte(kind: EmbeddingKind) -> u8 {
|
||
match kind {
|
||
EmbeddingKind::Document => 0,
|
||
EmbeddingKind::Query => 1,
|
||
}
|
||
}
|
||
|
||
fn component(&self, kind: EmbeddingKind, text: &str, i: usize) -> f32 {
|
||
let mut hasher = blake3::Hasher::new();
|
||
hasher.update(&self.seed.to_le_bytes());
|
||
hasher.update(&[Self::kind_byte(kind)]);
|
||
// Length-prefix `text` (LE u64) so the boundary between `text` and the
|
||
// trailing `i` field is unambiguous — without this, `("ABCDEFGH", 0)`
|
||
// and `("", u64::from_le_bytes(*b"ABCDEFGH"))` would feed identical
|
||
// bytes into the hasher.
|
||
hasher.update(&(text.len() as u64).to_le_bytes());
|
||
hasher.update(text.as_bytes());
|
||
hasher.update(&(i as u64).to_le_bytes());
|
||
let digest = hasher.finalize();
|
||
let bytes = digest.as_bytes();
|
||
let mut head = [0u8; 8];
|
||
head.copy_from_slice(&bytes[..8]);
|
||
let raw = i64::from_le_bytes(head);
|
||
// Map to [-1.0, 1.0]. `i64::MAX` is finite in f64 so the ratio is
|
||
// always finite. Casting back to f32 cannot produce a NaN/Inf for
|
||
// values in this range.
|
||
// Note: i64::MIN/i64::MAX gives -1.0000000000000002 → f32 cast rounds to -1.0; range [-1, 1] holds in f32 even with this asymmetry.
|
||
((raw as f64) / (i64::MAX as f64)) as f32
|
||
}
|
||
}
|
||
|
||
impl Embedder for MockEmbedder {
|
||
fn model_id(&self) -> EmbeddingModelId {
|
||
self.model_id.clone()
|
||
}
|
||
|
||
fn model_version(&self) -> EmbeddingVersion {
|
||
self.version.clone()
|
||
}
|
||
|
||
fn dimensions(&self) -> usize {
|
||
self.dimensions
|
||
}
|
||
|
||
fn embed(&self, inputs: &[EmbeddingInput<'_>]) -> anyhow::Result<Vec<Vec<f32>>> {
|
||
let mut out = Vec::with_capacity(inputs.len());
|
||
for input in inputs {
|
||
let mut v: Vec<f32> = (0..self.dimensions)
|
||
.map(|i| self.component(input.kind, input.text, i))
|
||
.collect();
|
||
|
||
// L2-normalize. Skip the rare all-zero case to avoid 0/0 = NaN.
|
||
let norm_sq: f64 = v.iter().map(|&x| (x as f64) * (x as f64)).sum();
|
||
if norm_sq > 0.0 {
|
||
let inv = (1.0 / norm_sq.sqrt()) as f32;
|
||
for x in v.iter_mut() {
|
||
*x *= inv;
|
||
}
|
||
}
|
||
out.push(v);
|
||
}
|
||
Ok(out)
|
||
}
|
||
}
|