Files
kebab/crates/kb-embed-local/tests
altair823 bcbe2b8531 feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small
First real Embedder implementation. Wraps fastembed-rs (ONNX runtime)
with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME}
template expansion so model files land under config.storage.model_dir/
fastembed/ without polluting kb-config's public API.

Public surface:
- pub struct FastembedEmbedder
- pub fn new(config: &Config) -> Result<Self>
- impl kb_core::Embedder (via kb-embed re-export)

Behavior:
- Default model multilingual-e5-small (384 dims). model_id and
  model_version come from config.models.embedding.{model,version}.
- Pre-load dim check via TextEmbedding::get_model_info: dim mismatch
  bails before paying the ~470MB ONNX init cost.
- e5 prefix applied BEFORE tokenization: "passage: " for
  EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned
  by prefix_input unit tests.
- Batches inputs into chunks of config.models.embedding.batch_size,
  concatenates results in input order.
- L2 normalization is performed by fastembed 4.9's default transformer
  pipeline (verified at fastembed/src/text_embedding/output.rs:43);
  we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so
  a future fastembed bump that drops this invariant fails loudly.
- Synchronous (no async runtime). Mutex serializes calls into the
  underlying ONNX session — conservative; ORT Session is Send+Sync but
  callers (kb-app indexer) batch sequentially anyway. Revisit if
  profiling shows contention.
- First-run model download surfaces via tracing::info before/after
  TextEmbedding::try_new — users no longer stare at a silent 30-60s
  pause during the 470MB pull.

Tests:
- 11 default-lane tests covering: check_dim match/mismatch (no model
  load), prefix_input Document/Query/empty, resolve_model
  known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set
  + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive
  ~ expansion). XDG tests serialize on a Mutex + RAII guard since
  edition 2024 makes set_var/remove_var unsafe.
- 7 #[ignore] integration tests covering: full construction with
  default config, dim-mismatch belt-and-braces, Document vs Query
  cosine differential, L2 unit norm, byte-equal determinism, batch-64
  performance under 5s, snapshot-hash stability over a 5-sentence
  multilingual fixture.
- Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints
  the captured hash and panics with paste-back instructions, so first
  --ignored run forces the maintainer to pin the baseline rather than
  silently passing.
- Workspace: 222 tests pass (default lane); clippy clean.

Allowed deps respected: kb-config, kb-embed (re-exports kb-core
trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and
ort enter transitively through fastembed; reqwest/hyper/hf-hub also
transitive (model download is fastembed's responsibility per spec
carve-out). No direct kb-core dep needed — re-exports cover it.

Pinned to fastembed 4.x rather than the recent 5.x to limit blast
radius; consider bump when p3-3 (lancedb-store) consumes the embedder
output shape.

Out of scope: reranker (P+), Ollama embedding endpoint, candle
adapter, image embeddings (P6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:39:38 +00:00
..