fix(rag): normalize RRF fusion_score to [0,1] + log post-merge hotfixes

## Bug config.rag.score_gate default 0.05 was incompatible with hybrid RRF fusion_score: raw RRF tops out at num_retrievers / (k_rrf + 1) ≈ 0.0328 at the default k_rrf=60, so every hybrid `kb ask` tripped ScoreGate refusal even when the top hit was perfectly aligned across both retrievers. Symptomatic on the post-P4-3 manual smoke at /tmp/kb-smoke/ pointed at 192.168.0.47 Ollama: $ kb ask "Rust ownership 모델의 핵심 규칙은 뭐야?" --mode hybrid 근거 부족. KB에 해당 내용 없음. # top fusion_score = 0.0164 Per-mode score_gate (lexical_score_gate / vector_score_gate / hybrid_score_gate) was rejected because it forces every consumer (CLI, eval, TUI) to know which mode picks which threshold. Score normalization solves it at the source. ## Fix crates/kb-search/src/hybrid.rs divides every fused score by 2 / (k_rrf + 1), the theoretical RRF maximum with two retrievers each contributing rank 1. After normalization: - both retrievers agree on rank 1 → fusion_score = 1.0 - only one retriever finds the chunk → caps near 0.5 - typical mixed ranks → falls between 0 and 0.5 RRF's rank-ordering invariants are preserved (every score divides by the same positive constant), so sort + tiebreak behaviour is unchanged. Wire schema label `fusion_score` keeps its slot in RetrievalDetail; only the magnitude shifts, and only for hybrid mode (lexical / vector were already in [0, 1]). Verification: re-ran the four-scenario smoke at /tmp/kb-smoke/ with default score_gate = 0.05 — all four (Korean→Korean, English→ English, cross-language Korean↔English, out-of-corpus) succeed with the expected grounded / refusal classification, top fusion_score now ≈ 0.5. ## Tests One unit test (rrf_formula_matches_known_value) updated to expect the normalized value `(1/61 + 1/62) / (2/61) ≈ 0.9919` instead of the raw `1/61 + 1/62 ≈ 0.0325`. The integration snapshot fixture crates/kb-search/tests/fixtures/search/hybrid/run-1.json already used presence checks (fusion_score_positive: true) rather than absolute values, so it doesn't need regeneration. Workspace 319 tests pass; clippy clean across both feature configs. ## Docs This commit also adds tasks/HOTFIXES.md as a dated post-merge log covering this fix and the two earlier --config-flag regressions (P3-5 hotfix #20 across ingest/search/list/inspect/doctor; P4-3 follow-up #24 for kb ask). Original task specs in tasks/p<N>/ *.md stay frozen as the historical contract; HOTFIXES.md is the live source of truth for post-merge deltas. Each affected task spec gets a "Risks/notes" addendum pointing back to HOTFIXES.md so a reader landing on the spec sees the active behaviour: - tasks/INDEX.md gains a "Post-merge 핫픽스" section. - tasks/phase-3-vector-hybrid.md updates the RRF formula text to show the normalized form. - tasks/p3/p3-4-hybrid-fusion.md "Behavior contract" RRF bullet notes the normalization and reason. - tasks/p3/p3-5-app-wiring.md "Risks/notes" notes the --config fix. - tasks/p4/p4-3-rag-pipeline.md "Risks/notes" notes the kb-ask --config fix and the score_gate-RRF incompatibility (closed by the normalization in p3-4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:16:01 +00:00
parent d7f3d38d48
commit 9dde01eb9f
7 changed files with 116 additions and 5 deletions
--- a/crates/kb-search/src/hybrid.rs
+++ b/crates/kb-search/src/hybrid.rs
@@ -191,8 +191,32 @@ impl HybridRetriever {
        }

        // Compute fused score per chunk.
+        //
+        // Raw RRF: `Σ 1/(k_rrf + rank_m(c))` over the retrievers a chunk
+        // appears in. With two retrievers the raw upper bound is
+        // `2/(k_rrf + 1)` — at k_rrf=60 that's only ≈0.0328, which makes
+        // a single `config.rag.score_gate` default of 0.05 silently
+        // refuse every hybrid query (and is incomparable with lexical /
+        // vector `fusion_score` already in [0, 1]).
+        //
+        // Normalize by the theoretical max so `fusion_score` lives in
+        // [0, 1] across all three SearchModes. The normalization factor
+        // is `num_retrievers / (k_rrf + 1)`. With both retrievers
+        // contributing rank=1 the normalized score is exactly 1.0;
+        // chunks present in only one retriever cap at ≈0.5 (≈ 1 / 2);
+        // all other rank combinations fall in between. RRF's rank-
+        // ordering invariants are preserved (we divide every score by
+        // the same positive constant), so the sort + tiebreak path is
+        // unchanged. Wire schema label `fusion_score` keeps its slot in
+        // `RetrievalDetail`; only the magnitude shifts.
        let FusionPolicy::Rrf { k_rrf } = self.fusion;
        let k_rrf_f = f64::from(k_rrf);
+        // Both retrievers can contribute, so the per-mode RRF max is
+        // 2 / (k_rrf + 1). Even when a chunk lands in only one mode, we
+        // still divide by this same constant — the score then caps
+        // around 0.5 which is exactly the "half-aligned" semantic we
+        // want users to compare against `score_gate`.
+        let rrf_normalizer = 2.0_f64 / (k_rrf_f + 1.0);

        struct Scored {
            chunk_id: String,
@@ -212,6 +236,7 @@ impl HybridRetriever {
                if let Some(r) = vec_rank {
                    rrf += 1.0 / (k_rrf_f + f64::from(r));
                }
+                rrf /= rrf_normalizer;
                Scored {
                    chunk_id: cid,
                    rrf,
@@ -462,8 +487,12 @@ mod tests {
    #[test]
    fn rrf_formula_matches_known_value() {
        // chunk A appears at lexical rank 1, vector rank 2; k_rrf=60.
-        // Expected: 1/(60+1) + 1/(60+2) = 1/61 + 1/62.
-        let expected = 1.0_f64 / 61.0 + 1.0_f64 / 62.0;
+        // Raw RRF: 1/(60+1) + 1/(60+2) = 1/61 + 1/62.
+        // After normalization by `2 / (60 + 1)` (theoretical max with
+        // both retrievers contributing rank=1), the score lives in
+        // [0, 1]: `(1/61 + 1/62) / (2/61) = 0.5 + 61/124 ≈ 0.9919`.
+        let raw = 1.0_f64 / 61.0 + 1.0_f64 / 62.0;
+        let expected = raw / (2.0_f64 / 61.0);
        let lex = Arc::new(CannedRetriever::new(
            vec![mk_hit("aaaa", 1, SearchMode::Lexical, 0.5)],
            "lex-v1",
--- a/tasks/HOTFIXES.md
+++ b/tasks/HOTFIXES.md
@@ -0,0 +1,71 @@
+---
+title: "Post-merge hotfixes log"
+date: 2026-05-01
+---
+
+# Post-merge hotfixes log
+
+Bugs discovered AFTER a phase task was merged, and the small follow-up
+PRs that close them. Each entry: what broke, how it surfaced, what the
+fix touched, and which task spec it amends.
+
+The original task specs in `tasks/p<N>/p<N>-<M>-*.md` stay frozen as the
+historical contract that was implemented; this file accumulates the
+deltas so phase 5+ readers can find the live behavior without diffing
+git history.
+
+## 2026-05-01 — `--config` flag silently ignored across all kb-cli subcommands
+
+**Discovered**: post-P3-5 manual smoke at `/tmp/kb-smoke/`.
+
+**Symptom**: `kb --config /path/to/config.toml ingest|search|list|inspect|doctor` ignored the flag and fell back to `~/.config/kb/config.toml` (XDG default). Users had to use `KB_*` env vars to point at a non-default config.
+
+**Root cause**: `kb-cli` read `cli.config` only inside `Cmd::Ingest` to build `SourceScope`, then called bare `kb_app::ingest(scope, summary_only)` which internally re-loaded `Config::load(None)` (XDG path). Same pattern in `Cmd::Search` / `List` / `Inspect` / `Doctor`. P3-5 introduced `*_with_config` test seams via `#[doc(hidden)] pub fn` but kb-cli never used them.
+
+**Fix** (PR #20, fix/cli-config-flag-and-search-output):
+- `kb-cli` now builds the Config once via `Config::load(cli.config.as_deref())` at the top of every subcommand and threads it into `kb_app::*_with_config(cfg, ...)` instead of `kb_app::*(...)`.
+- `kb_app::doctor()` rewritten as `doctor_with_config_path(Option<&Path>)` that reports the actual path probed and hard-fails when `--config <path>` doesn't exist (defaults would otherwise mask user intent).
+- `kb-app` module doc-comment updated: `#[doc(hidden)] pub fn *_with_config` is no longer "test-only seam" — it's the official "config-explicit" API consumed by CLI `--config`, integration tests, and TUI sessions.
+- Same PR also improved `kb search` printer: `{:.4}` score formatting (RRF range collapses on `{:.2}`) and `> heading_path` suffix so chunks from the same document are visually distinct.
+
+**Amends**: tasks/p3/p3-5-app-wiring.md (the test seam was always meant to be the config-explicit API; only the doc-comment lied).
+
+### 2026-05-01 — `--config` regression in `kb ask` (P4-3 follow-up)
+
+**Discovered**: post-P4-3 manual smoke against 192.168.0.47 Ollama with `gemma4:26b`.
+
+**Symptom**: `kb --config <path> ask` returned `model.id = qwen2.5:14b-instruct` (XDG default model) and `score_gate = 0.30` (XDG default), instead of `gemma4:26b` / `0.05` from the explicit config. P4-3 added the ask body but kb-cli's `Cmd::Ask` arm still called bare `kb_app::ask(query, opts)` — same regression class as the P3-5 fix above, just missed when ask was wired.
+
+**Fix** (PR #24, fix/cli-ask-honor-config-flag):
+- `kb-cli` builds `Config::load(cli.config.as_deref())` once at the top of `Cmd::Ask` and calls `kb_app::ask_with_config(cfg, query, opts)`.
+
+**Amends**: tasks/p4/p4-3-rag-pipeline.md.
+
+## 2026-05-01 — RRF `fusion_score` incompatible with `config.rag.score_gate` default
+
+**Discovered**: post-P4-3 manual smoke. Top hybrid result returned `fusion_score = 0.0164` against `score_gate = 0.05` → ScoreGate refusal on every hybrid query.
+
+**Root cause**: RRF formula `score(c) = Σ 1/(k_rrf + rank_m(c))` produces values bounded by `num_retrievers / (k_rrf + 1)`. With `num_retrievers = 2` and the default `k_rrf = 60`, the upper bound is `2/61 ≈ 0.0328`. The default `config.rag.score_gate = 0.05` was calibrated for vector / lexical scores already in `[0, 1]` and silently refused every hybrid query. `fusion_score` was also incomparable across modes — Lexical / Vector lived in `[0, 1]`, Hybrid lived in `(0, 0.033]`.
+
+**Fix** (PR #25, fix/rrf-fusion-score-normalize-and-docs):
+- `crates/kb-search/src/hybrid.rs` divides every raw RRF score by `2 / (k_rrf + 1)` so `fusion_score` always lives in `[0, 1]` regardless of mode. Both retrievers contributing rank 1 normalises to `1.0`; chunks present in only one retriever cap around `0.5`. RRF's rank-ordering invariants are preserved (same constant divides every score), so sort + tiebreak behaviour is identical.
+- One unit test (`rrf_formula_matches_known_value`) updated to expect the normalised value `(1/61 + 1/62) / (2/61) ≈ 0.9919`.
+- The integration snapshot `crates/kb-search/tests/fixtures/search/hybrid/run-1.json` already used presence checks (`fusion_score_positive: true`) rather than absolute values, so it didn't need regeneration.
+
+**Why not a per-mode `score_gate` config**: separate `lexical_score_gate / vector_score_gate / hybrid_score_gate` would force every downstream consumer (CLI, eval, TUI) to know which mode picks which threshold. Normalising the score itself is a one-line change at the source and makes `Answer.retrieval.score_gate` semantically meaningful without per-mode bookkeeping.
+
+**Amends**: tasks/p3/p3-4-hybrid-fusion.md (RRF formula now divides by `2/(k_rrf+1)` after summation), tasks/phase-3-vector-hybrid.md (RRF section).
+
+**Verification**: post-fix smoke at `/tmp/kb-smoke/` with default `score_gate = 0.05` succeeded across four scenarios — Korean→Korean, English→English, cross-language, and out-of-corpus refusal.
+
+## How to add an entry
+
+Each fix gets a dated subsection with five fields:
+
+- **Discovered**: when / how the bug surfaced (smoke, integration test, user report).
+- **Symptom**: what the user saw / what was wrong.
+- **Root cause**: the actual code or design issue.
+- **Fix**: PR number / branch + a one-paragraph summary of the change.
+- **Amends**: which `tasks/p<N>/...` spec docs the fix retroactively contradicts. Spec text stays frozen; this log is the live source of truth for post-merge deltas.
+
+If a fix is large enough that the original spec is no longer a useful reference, promote the entry into a new task spec (e.g., `p<N>-<M+1>-<topic>.md`) and link from here.
--- a/tasks/INDEX.md
+++ b/tasks/INDEX.md
@@ -82,6 +82,10 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
  - [p9-4 tui-inspect](p9/p9-4-tui-inspect.md)
  - [p9-5 desktop-tauri](p9/p9-5-desktop-tauri.md)

+## Post-merge 핫픽스
+
+머지 후 발견된 버그들과 그 follow-up PR들은 [HOTFIXES.md](HOTFIXES.md)에 dated 로그로 기록한다. 원래 task spec은 frozen 상태로 두고, post-merge 동작 변경은 HOTFIXES.md를 source of truth로 본다.
+
 ## 모든 task 공통 규약

 - 의존성 경계 (`Allowed` / `Forbidden`) 위반 금지. report §19 참조.
--- a/tasks/p3/p3-4-hybrid-fusion.md
+++ b/tasks/p3/p3-4-hybrid-fusion.md
@@ -93,9 +93,10 @@ impl kb_core::Retriever for VectorRetriever { /* per §7.2 */ }
 - `SearchMode::Vector` dispatches solely to `vector`. `RetrievalDetail.method = Vector`, `lexical_*` fields are `None`.
 - `SearchMode::Hybrid`:
  - run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required).
-  - fuse with RRF: `score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
+  - fuse with RRF: `raw(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
+  - **normalize fusion_score to [0, 1]** (post-merge fix, 2026-05): divide by `num_retrievers / (k_rrf + 1)` so the top-1-everywhere case maps to `1.0` and single-retriever chunks cap around `0.5`. Without this, raw RRF tops out at `≈ 0.033` and is incomparable with the `[0, 1]` lexical / vector `fusion_score` (and incompatible with the `config.rag.score_gate` default `0.05` — every hybrid query refused). RRF's rank ordering is preserved (we divide every score by the same positive constant). See [HOTFIXES.md](../HOTFIXES.md).
  - sort by fused score DESC, take top `query.k`.
-  - populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = computed fused score.
+  - populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = normalized fused score.
  - if a chunk appears in only one retriever, its `RetrievalDetail` still gets populated with `Some(...)` from that side and `None` for the other.
  - tie-break by `lexical_rank` ascending, then `chunk_id` ascending (deterministic).
 - `VectorRetriever`:
--- a/tasks/p3/p3-5-app-wiring.md
+++ b/tasks/p3/p3-5-app-wiring.md
@@ -165,6 +165,7 @@ All tests under `cargo test -p kb-app`. CLI smoke optional via `assert_cmd` if i
 ## Risks / notes

 - Cold-start cost: first `kb index` run downloads the fastembed model (~470MB) and warms the ONNX session. Surface via `tracing::info!` (already wired in P3-2).
+- **Post-merge fix (2026-05)**: original P3-5 implementation left every `kb-cli` subcommand calling the bare `kb_app::*` free functions, which silently re-loaded `Config::load(None)` (XDG default) and ignored the user's `--config <path>` flag. CLI now goes through `kb_app::*_with_config(cfg, ...)` everywhere. The `#[doc(hidden)] pub fn *_with_config` companions originally framed as "test seam" are now the official config-explicit API for CLI / TUI / integration tests. See [HOTFIXES.md](../HOTFIXES.md) for details.
 - The `App` lifecycle struct is internal but its construction is the natural seam for adding caching / connection pooling later. Keep it `pub(crate)` so future refactors don't break the CLI.
 - Mismatched `index_version` across stored records and the live retriever should fail loud at `App::open` (not at first search). Reuse the `tracing::warn!` from `HybridRetriever::new` (P3-4).
 - The fastembed adapter holds a `tokio::runtime` (P3-3); `App` must be constructed from a synchronous context. Document on `App::open`.
--- a/tasks/p4/p4-3-rag-pipeline.md
+++ b/tasks/p4/p4-3-rag-pipeline.md
@@ -179,6 +179,8 @@ All tests under `cargo test -p kb-rag` with no real Ollama (mock LM only).
 ## Risks / notes

 - Citation regex is STRICT `\[#(\d{1,3})\]` only. Models that emit `[1]`/`[ #1 ]`/`[foo]` are treated as no-marker → refusal. This is intentional: a noisy citation grammar lets prose `[1]` or `vec![1]` slip through as false positives, which corrupts both `grounded` and `kb eval` `citation_coverage`. The prompt template (`rag-v1`) explicitly instructs `[#번호]`.
+- **Post-merge fix (2026-05)**: kb-cli's `Cmd::Ask` arm originally called bare `kb_app::ask(query, opts)`, ignoring `--config <path>` and silently using XDG defaults (manifested as wrong model id / score_gate / data_dir surfacing in `Answer.retrieval`). Fixed by routing through `kb_app::ask_with_config(cfg, query, opts)`. See [HOTFIXES.md](../HOTFIXES.md).
+- **Post-merge fix (2026-05)**: `config.rag.score_gate` default `0.05` was incompatible with hybrid RRF `fusion_score` (raw range `(0, 2/k_rrf]` ≈ `≤ 0.033` at default k_rrf=60), refusing every hybrid `kb ask`. Closed by normalizing RRF in p3-4 to `[0, 1]`. See p3-4 spec + HOTFIXES.md.
 - `stream_sink` channel: pipeline `send`s tokens; if the receiver is dropped (caller cancelled), `SendError` is silently swallowed and generation continues to completion (so the `Answer` row still gets persisted). Pipeline does NOT panic on a dead sink.
 - `temperature=0` does not fully eliminate stochasticity in some quantized Ollama models; document this and rely on `must_contain` rule-based metrics in P5 instead of exact match.
 - Prompt-injection defense lives entirely in the system prompt; do NOT mutate `[근거]` text. If chunk text contains `<|system|>` or similar tokens, do not strip them — they are inert when wrapped.
--- a/tasks/phase-3-vector-hybrid.md
+++ b/tasks/phase-3-vector-hybrid.md
@@ -80,12 +80,15 @@ pub enum SearchMode { Lexical, Vector, Hybrid }
 Hybrid 점수 융합 (1차): RRF (Reciprocal Rank Fusion).

 ```text
-score(chunk) = sum_over_methods( 1 / (k_rrf + rank_method(chunk)) )
+raw(chunk)   = sum_over_methods( 1 / (k_rrf + rank_method(chunk)) )
+fusion_score = raw / (num_retrievers / (k_rrf + 1))   # ∈ [0, 1]
 k_rrf 기본 60.
 ```

 이유: bm25 score 와 cosine sim 의 절대값 스케일이 다름. RRF 는 rank 기반이라 안정적.

+**정규화 (2026-05 hotfix)**: raw RRF top score 가 `num_retrievers / (k_rrf+1)` (k_rrf=60에서 ≈ 0.0328) 로 bounded 라 lexical / vector 의 `[0, 1]` 점수와 incomparable 했고 `config.rag.score_gate` default 0.05 와도 incompatible (모든 hybrid query 가 ScoreGate refusal). `2/(k_rrf+1)` 로 나눠서 fusion_score 가 모든 mode 에서 `[0, 1]` 로 정렬되게 함. 자세한 이력은 [HOTFIXES.md](HOTFIXES.md) 참조.
+
 P3 범위에선 reranker 미도입 (P+ 단계 노트).

 ## kb-search 구조