feat(embed): candle 임베딩 provider (NUMA-안전, opt-in) + v0.22.0

duo-socket NUMA 서버에서 fastembed(onnxruntime)가 intra-op 스레드를 48개로 하드코딩해 NUMA 힙 손상 → double-free 로 ingest 가 죽는 문제를 회피하기 위해, 같은 multilingual-e5-large 모델을 순수 Rust(candle)로 돌리는 opt-in 임베딩 provider 를 추가한다. - 신규 crate kebab-embed-candle: CandleEmbedder (kebab_core::Embedder). hf-hub safetensors → XLMRobertaModel forward → mask mean-pool → L2 → e5 prefix. candle 의존성 트리를 이 crate 에 격리 (core/config 외 kebab-* 의존 0). - 스레드 캡: [models.embedding].num_threads + env KEBAB_EMBED_THREADS → 글로벌 rayon 풀 1회 캡 (NUMA-안전 레버). - kebab-app::embedder() 가 provider 분기 (fastembed/onnx/"" → 기존 경로 불변, candle → CandleEmbedder, 미지값 → 에러). - Phase 0 스파이크 crate 제거 (production 흡수). - 버전 0.21.1 → 0.22.0 (신규 config surface, pre-1.0 minor bump). 패리티: cosine_min=1.000000, max abs diff=2.01e-7 (< 1e-5) → embedding_version 유지, 재색인 0. fastembed default 동작/벡터 불변. wire schema 변경 없음. 검증(파일+exit code): clippy -D warnings EXIT=0(warning 0), test EXIT=0 (candle unit 5 + thread_cap rayon=4 + config 68), parity #[ignore] EXIT=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 14:52:25 +00:00
parent 76841af7d3
commit 8f7b6ee538
18 changed files with 825 additions and 330 deletions
--- a/crates/kebab-config/src/lib.rs
+++ b/crates/kebab-config/src/lib.rs
@@ -155,11 +155,21 @@ impl NliCfg {

 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
 pub struct EmbeddingModelCfg {
+    /// `fastembed` (default, onnxruntime) or `candle` (pure-Rust,
+    /// NUMA-safe). `none` disables embeddings (lexical-only). Unknown
+    /// values error at embedder construction.
    pub provider: String,
    pub model: String,
    pub version: String,
    pub dimensions: usize,
    pub batch_size: usize,
+    /// Cap on the CPU worker threads the `candle` provider spins up
+    /// (sizes the global rayon pool; env `KEBAB_EMBED_THREADS` overrides).
+    /// `0` = auto (rayon default = #cores). Lever to sidestep the
+    /// onnxruntime 48-thread NUMA double-free; ignored by the `fastembed`
+    /// provider. Defaulted on load so pre-0.22 config files still parse.
+    #[serde(default)]
+    pub num_threads: u32,
 }

 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
@@ -707,6 +717,7 @@ impl Config {
                    version: "v1".to_string(),
                    dimensions: 1024,
                    batch_size: 64,
+                    num_threads: 0,
                },
                llm: LlmCfg {
                    provider: "ollama".to_string(),
@@ -964,6 +975,11 @@ impl Config {
                        self.models.embedding.batch_size = n;
                    }
                }
+                "KEBAB_MODELS_EMBEDDING_NUM_THREADS" => {
+                    if let Ok(n) = v.parse::<u32>() {
+                        self.models.embedding.num_threads = n;
+                    }
+                }

                // models.llm
                "KEBAB_MODELS_LLM_PROVIDER" => self.models.llm.provider = v.clone(),