fix(rag): normalize RRF fusion_score to [0,1] + log post-merge hotfixes
## Bug
config.rag.score_gate default 0.05 was incompatible with hybrid RRF
fusion_score: raw RRF tops out at num_retrievers / (k_rrf + 1) ≈
0.0328 at the default k_rrf=60, so every hybrid `kb ask` tripped
ScoreGate refusal even when the top hit was perfectly aligned across
both retrievers. Symptomatic on the post-P4-3 manual smoke at
/tmp/kb-smoke/ pointed at 192.168.0.47 Ollama:
$ kb ask "Rust ownership 모델의 핵심 규칙은 뭐야?" --mode hybrid
근거 부족. KB에 해당 내용 없음. # top fusion_score = 0.0164
Per-mode score_gate (lexical_score_gate / vector_score_gate /
hybrid_score_gate) was rejected because it forces every consumer
(CLI, eval, TUI) to know which mode picks which threshold. Score
normalization solves it at the source.
## Fix
crates/kb-search/src/hybrid.rs divides every fused score by
2 / (k_rrf + 1), the theoretical RRF maximum with two retrievers
each contributing rank 1. After normalization:
- both retrievers agree on rank 1 → fusion_score = 1.0
- only one retriever finds the chunk → caps near 0.5
- typical mixed ranks → falls between 0 and 0.5
RRF's rank-ordering invariants are preserved (every score divides
by the same positive constant), so sort + tiebreak behaviour is
unchanged. Wire schema label `fusion_score` keeps its slot in
RetrievalDetail; only the magnitude shifts, and only for hybrid
mode (lexical / vector were already in [0, 1]).
Verification: re-ran the four-scenario smoke at /tmp/kb-smoke/ with
default score_gate = 0.05 — all four (Korean→Korean, English→
English, cross-language Korean↔English, out-of-corpus) succeed
with the expected grounded / refusal classification, top
fusion_score now ≈ 0.5.
## Tests
One unit test (rrf_formula_matches_known_value) updated to expect
the normalized value `(1/61 + 1/62) / (2/61) ≈ 0.9919` instead of
the raw `1/61 + 1/62 ≈ 0.0325`. The integration snapshot fixture
crates/kb-search/tests/fixtures/search/hybrid/run-1.json already
used presence checks (fusion_score_positive: true) rather than
absolute values, so it doesn't need regeneration. Workspace 319
tests pass; clippy clean across both feature configs.
## Docs
This commit also adds tasks/HOTFIXES.md as a dated post-merge log
covering this fix and the two earlier --config-flag regressions
(P3-5 hotfix #20 across ingest/search/list/inspect/doctor; P4-3
follow-up #24 for kb ask). Original task specs in tasks/p<N>/
*.md stay frozen as the historical contract; HOTFIXES.md is the
live source of truth for post-merge deltas. Each affected task
spec gets a "Risks/notes" addendum pointing back to HOTFIXES.md
so a reader landing on the spec sees the active behaviour:
- tasks/INDEX.md gains a "Post-merge 핫픽스" section.
- tasks/phase-3-vector-hybrid.md updates the RRF formula text to
show the normalized form.
- tasks/p3/p3-4-hybrid-fusion.md "Behavior contract" RRF bullet
notes the normalization and reason.
- tasks/p3/p3-5-app-wiring.md "Risks/notes" notes the --config
fix.
- tasks/p4/p4-3-rag-pipeline.md "Risks/notes" notes the kb-ask
--config fix and the score_gate-RRF incompatibility (closed by
the normalization in p3-4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -191,8 +191,32 @@ impl HybridRetriever {
|
||||
}
|
||||
|
||||
// Compute fused score per chunk.
|
||||
//
|
||||
// Raw RRF: `Σ 1/(k_rrf + rank_m(c))` over the retrievers a chunk
|
||||
// appears in. With two retrievers the raw upper bound is
|
||||
// `2/(k_rrf + 1)` — at k_rrf=60 that's only ≈0.0328, which makes
|
||||
// a single `config.rag.score_gate` default of 0.05 silently
|
||||
// refuse every hybrid query (and is incomparable with lexical /
|
||||
// vector `fusion_score` already in [0, 1]).
|
||||
//
|
||||
// Normalize by the theoretical max so `fusion_score` lives in
|
||||
// [0, 1] across all three SearchModes. The normalization factor
|
||||
// is `num_retrievers / (k_rrf + 1)`. With both retrievers
|
||||
// contributing rank=1 the normalized score is exactly 1.0;
|
||||
// chunks present in only one retriever cap at ≈0.5 (≈ 1 / 2);
|
||||
// all other rank combinations fall in between. RRF's rank-
|
||||
// ordering invariants are preserved (we divide every score by
|
||||
// the same positive constant), so the sort + tiebreak path is
|
||||
// unchanged. Wire schema label `fusion_score` keeps its slot in
|
||||
// `RetrievalDetail`; only the magnitude shifts.
|
||||
let FusionPolicy::Rrf { k_rrf } = self.fusion;
|
||||
let k_rrf_f = f64::from(k_rrf);
|
||||
// Both retrievers can contribute, so the per-mode RRF max is
|
||||
// 2 / (k_rrf + 1). Even when a chunk lands in only one mode, we
|
||||
// still divide by this same constant — the score then caps
|
||||
// around 0.5 which is exactly the "half-aligned" semantic we
|
||||
// want users to compare against `score_gate`.
|
||||
let rrf_normalizer = 2.0_f64 / (k_rrf_f + 1.0);
|
||||
|
||||
struct Scored {
|
||||
chunk_id: String,
|
||||
@@ -212,6 +236,7 @@ impl HybridRetriever {
|
||||
if let Some(r) = vec_rank {
|
||||
rrf += 1.0 / (k_rrf_f + f64::from(r));
|
||||
}
|
||||
rrf /= rrf_normalizer;
|
||||
Scored {
|
||||
chunk_id: cid,
|
||||
rrf,
|
||||
@@ -462,8 +487,12 @@ mod tests {
|
||||
#[test]
|
||||
fn rrf_formula_matches_known_value() {
|
||||
// chunk A appears at lexical rank 1, vector rank 2; k_rrf=60.
|
||||
// Expected: 1/(60+1) + 1/(60+2) = 1/61 + 1/62.
|
||||
let expected = 1.0_f64 / 61.0 + 1.0_f64 / 62.0;
|
||||
// Raw RRF: 1/(60+1) + 1/(60+2) = 1/61 + 1/62.
|
||||
// After normalization by `2 / (60 + 1)` (theoretical max with
|
||||
// both retrievers contributing rank=1), the score lives in
|
||||
// [0, 1]: `(1/61 + 1/62) / (2/61) = 0.5 + 61/124 ≈ 0.9919`.
|
||||
let raw = 1.0_f64 / 61.0 + 1.0_f64 / 62.0;
|
||||
let expected = raw / (2.0_f64 / 61.0);
|
||||
let lex = Arc::new(CannedRetriever::new(
|
||||
vec![mk_hit("aaaa", 1, SearchMode::Lexical, 0.5)],
|
||||
"lex-v1",
|
||||
|
||||
71
tasks/HOTFIXES.md
Normal file
71
tasks/HOTFIXES.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: "Post-merge hotfixes log"
|
||||
date: 2026-05-01
|
||||
---
|
||||
|
||||
# Post-merge hotfixes log
|
||||
|
||||
Bugs discovered AFTER a phase task was merged, and the small follow-up
|
||||
PRs that close them. Each entry: what broke, how it surfaced, what the
|
||||
fix touched, and which task spec it amends.
|
||||
|
||||
The original task specs in `tasks/p<N>/p<N>-<M>-*.md` stay frozen as the
|
||||
historical contract that was implemented; this file accumulates the
|
||||
deltas so phase 5+ readers can find the live behavior without diffing
|
||||
git history.
|
||||
|
||||
## 2026-05-01 — `--config` flag silently ignored across all kb-cli subcommands
|
||||
|
||||
**Discovered**: post-P3-5 manual smoke at `/tmp/kb-smoke/`.
|
||||
|
||||
**Symptom**: `kb --config /path/to/config.toml ingest|search|list|inspect|doctor` ignored the flag and fell back to `~/.config/kb/config.toml` (XDG default). Users had to use `KB_*` env vars to point at a non-default config.
|
||||
|
||||
**Root cause**: `kb-cli` read `cli.config` only inside `Cmd::Ingest` to build `SourceScope`, then called bare `kb_app::ingest(scope, summary_only)` which internally re-loaded `Config::load(None)` (XDG path). Same pattern in `Cmd::Search` / `List` / `Inspect` / `Doctor`. P3-5 introduced `*_with_config` test seams via `#[doc(hidden)] pub fn` but kb-cli never used them.
|
||||
|
||||
**Fix** (PR #20, fix/cli-config-flag-and-search-output):
|
||||
- `kb-cli` now builds the Config once via `Config::load(cli.config.as_deref())` at the top of every subcommand and threads it into `kb_app::*_with_config(cfg, ...)` instead of `kb_app::*(...)`.
|
||||
- `kb_app::doctor()` rewritten as `doctor_with_config_path(Option<&Path>)` that reports the actual path probed and hard-fails when `--config <path>` doesn't exist (defaults would otherwise mask user intent).
|
||||
- `kb-app` module doc-comment updated: `#[doc(hidden)] pub fn *_with_config` is no longer "test-only seam" — it's the official "config-explicit" API consumed by CLI `--config`, integration tests, and TUI sessions.
|
||||
- Same PR also improved `kb search` printer: `{:.4}` score formatting (RRF range collapses on `{:.2}`) and `> heading_path` suffix so chunks from the same document are visually distinct.
|
||||
|
||||
**Amends**: tasks/p3/p3-5-app-wiring.md (the test seam was always meant to be the config-explicit API; only the doc-comment lied).
|
||||
|
||||
### 2026-05-01 — `--config` regression in `kb ask` (P4-3 follow-up)
|
||||
|
||||
**Discovered**: post-P4-3 manual smoke against 192.168.0.47 Ollama with `gemma4:26b`.
|
||||
|
||||
**Symptom**: `kb --config <path> ask` returned `model.id = qwen2.5:14b-instruct` (XDG default model) and `score_gate = 0.30` (XDG default), instead of `gemma4:26b` / `0.05` from the explicit config. P4-3 added the ask body but kb-cli's `Cmd::Ask` arm still called bare `kb_app::ask(query, opts)` — same regression class as the P3-5 fix above, just missed when ask was wired.
|
||||
|
||||
**Fix** (PR #24, fix/cli-ask-honor-config-flag):
|
||||
- `kb-cli` builds `Config::load(cli.config.as_deref())` once at the top of `Cmd::Ask` and calls `kb_app::ask_with_config(cfg, query, opts)`.
|
||||
|
||||
**Amends**: tasks/p4/p4-3-rag-pipeline.md.
|
||||
|
||||
## 2026-05-01 — RRF `fusion_score` incompatible with `config.rag.score_gate` default
|
||||
|
||||
**Discovered**: post-P4-3 manual smoke. Top hybrid result returned `fusion_score = 0.0164` against `score_gate = 0.05` → ScoreGate refusal on every hybrid query.
|
||||
|
||||
**Root cause**: RRF formula `score(c) = Σ 1/(k_rrf + rank_m(c))` produces values bounded by `num_retrievers / (k_rrf + 1)`. With `num_retrievers = 2` and the default `k_rrf = 60`, the upper bound is `2/61 ≈ 0.0328`. The default `config.rag.score_gate = 0.05` was calibrated for vector / lexical scores already in `[0, 1]` and silently refused every hybrid query. `fusion_score` was also incomparable across modes — Lexical / Vector lived in `[0, 1]`, Hybrid lived in `(0, 0.033]`.
|
||||
|
||||
**Fix** (PR #25, fix/rrf-fusion-score-normalize-and-docs):
|
||||
- `crates/kb-search/src/hybrid.rs` divides every raw RRF score by `2 / (k_rrf + 1)` so `fusion_score` always lives in `[0, 1]` regardless of mode. Both retrievers contributing rank 1 normalises to `1.0`; chunks present in only one retriever cap around `0.5`. RRF's rank-ordering invariants are preserved (same constant divides every score), so sort + tiebreak behaviour is identical.
|
||||
- One unit test (`rrf_formula_matches_known_value`) updated to expect the normalised value `(1/61 + 1/62) / (2/61) ≈ 0.9919`.
|
||||
- The integration snapshot `crates/kb-search/tests/fixtures/search/hybrid/run-1.json` already used presence checks (`fusion_score_positive: true`) rather than absolute values, so it didn't need regeneration.
|
||||
|
||||
**Why not a per-mode `score_gate` config**: separate `lexical_score_gate / vector_score_gate / hybrid_score_gate` would force every downstream consumer (CLI, eval, TUI) to know which mode picks which threshold. Normalising the score itself is a one-line change at the source and makes `Answer.retrieval.score_gate` semantically meaningful without per-mode bookkeeping.
|
||||
|
||||
**Amends**: tasks/p3/p3-4-hybrid-fusion.md (RRF formula now divides by `2/(k_rrf+1)` after summation), tasks/phase-3-vector-hybrid.md (RRF section).
|
||||
|
||||
**Verification**: post-fix smoke at `/tmp/kb-smoke/` with default `score_gate = 0.05` succeeded across four scenarios — Korean→Korean, English→English, cross-language, and out-of-corpus refusal.
|
||||
|
||||
## How to add an entry
|
||||
|
||||
Each fix gets a dated subsection with five fields:
|
||||
|
||||
- **Discovered**: when / how the bug surfaced (smoke, integration test, user report).
|
||||
- **Symptom**: what the user saw / what was wrong.
|
||||
- **Root cause**: the actual code or design issue.
|
||||
- **Fix**: PR number / branch + a one-paragraph summary of the change.
|
||||
- **Amends**: which `tasks/p<N>/...` spec docs the fix retroactively contradicts. Spec text stays frozen; this log is the live source of truth for post-merge deltas.
|
||||
|
||||
If a fix is large enough that the original spec is no longer a useful reference, promote the entry into a new task spec (e.g., `p<N>-<M+1>-<topic>.md`) and link from here.
|
||||
@@ -82,6 +82,10 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
|
||||
- [p9-4 tui-inspect](p9/p9-4-tui-inspect.md)
|
||||
- [p9-5 desktop-tauri](p9/p9-5-desktop-tauri.md)
|
||||
|
||||
## Post-merge 핫픽스
|
||||
|
||||
머지 후 발견된 버그들과 그 follow-up PR들은 [HOTFIXES.md](HOTFIXES.md)에 dated 로그로 기록한다. 원래 task spec은 frozen 상태로 두고, post-merge 동작 변경은 HOTFIXES.md를 source of truth로 본다.
|
||||
|
||||
## 모든 task 공통 규약
|
||||
|
||||
- 의존성 경계 (`Allowed` / `Forbidden`) 위반 금지. report §19 참조.
|
||||
|
||||
@@ -93,9 +93,10 @@ impl kb_core::Retriever for VectorRetriever { /* per §7.2 */ }
|
||||
- `SearchMode::Vector` dispatches solely to `vector`. `RetrievalDetail.method = Vector`, `lexical_*` fields are `None`.
|
||||
- `SearchMode::Hybrid`:
|
||||
- run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required).
|
||||
- fuse with RRF: `score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
|
||||
- fuse with RRF: `raw(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.
|
||||
- **normalize fusion_score to [0, 1]** (post-merge fix, 2026-05): divide by `num_retrievers / (k_rrf + 1)` so the top-1-everywhere case maps to `1.0` and single-retriever chunks cap around `0.5`. Without this, raw RRF tops out at `≈ 0.033` and is incomparable with the `[0, 1]` lexical / vector `fusion_score` (and incompatible with the `config.rag.score_gate` default `0.05` — every hybrid query refused). RRF's rank ordering is preserved (we divide every score by the same positive constant). See [HOTFIXES.md](../HOTFIXES.md).
|
||||
- sort by fused score DESC, take top `query.k`.
|
||||
- populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = computed fused score.
|
||||
- populate every `SearchHit.retrieval`: `method = Hybrid`, `lexical_score` / `lexical_rank` / `vector_score` / `vector_rank` from each retriever's hit (or `None` if absent), `fusion_score` = normalized fused score.
|
||||
- if a chunk appears in only one retriever, its `RetrievalDetail` still gets populated with `Some(...)` from that side and `None` for the other.
|
||||
- tie-break by `lexical_rank` ascending, then `chunk_id` ascending (deterministic).
|
||||
- `VectorRetriever`:
|
||||
|
||||
@@ -165,6 +165,7 @@ All tests under `cargo test -p kb-app`. CLI smoke optional via `assert_cmd` if i
|
||||
## Risks / notes
|
||||
|
||||
- Cold-start cost: first `kb index` run downloads the fastembed model (~470MB) and warms the ONNX session. Surface via `tracing::info!` (already wired in P3-2).
|
||||
- **Post-merge fix (2026-05)**: original P3-5 implementation left every `kb-cli` subcommand calling the bare `kb_app::*` free functions, which silently re-loaded `Config::load(None)` (XDG default) and ignored the user's `--config <path>` flag. CLI now goes through `kb_app::*_with_config(cfg, ...)` everywhere. The `#[doc(hidden)] pub fn *_with_config` companions originally framed as "test seam" are now the official config-explicit API for CLI / TUI / integration tests. See [HOTFIXES.md](../HOTFIXES.md) for details.
|
||||
- The `App` lifecycle struct is internal but its construction is the natural seam for adding caching / connection pooling later. Keep it `pub(crate)` so future refactors don't break the CLI.
|
||||
- Mismatched `index_version` across stored records and the live retriever should fail loud at `App::open` (not at first search). Reuse the `tracing::warn!` from `HybridRetriever::new` (P3-4).
|
||||
- The fastembed adapter holds a `tokio::runtime` (P3-3); `App` must be constructed from a synchronous context. Document on `App::open`.
|
||||
|
||||
@@ -179,6 +179,8 @@ All tests under `cargo test -p kb-rag` with no real Ollama (mock LM only).
|
||||
## Risks / notes
|
||||
|
||||
- Citation regex is STRICT `\[#(\d{1,3})\]` only. Models that emit `[1]`/`[ #1 ]`/`[foo]` are treated as no-marker → refusal. This is intentional: a noisy citation grammar lets prose `[1]` or `vec![1]` slip through as false positives, which corrupts both `grounded` and `kb eval` `citation_coverage`. The prompt template (`rag-v1`) explicitly instructs `[#번호]`.
|
||||
- **Post-merge fix (2026-05)**: kb-cli's `Cmd::Ask` arm originally called bare `kb_app::ask(query, opts)`, ignoring `--config <path>` and silently using XDG defaults (manifested as wrong model id / score_gate / data_dir surfacing in `Answer.retrieval`). Fixed by routing through `kb_app::ask_with_config(cfg, query, opts)`. See [HOTFIXES.md](../HOTFIXES.md).
|
||||
- **Post-merge fix (2026-05)**: `config.rag.score_gate` default `0.05` was incompatible with hybrid RRF `fusion_score` (raw range `(0, 2/k_rrf]` ≈ `≤ 0.033` at default k_rrf=60), refusing every hybrid `kb ask`. Closed by normalizing RRF in p3-4 to `[0, 1]`. See p3-4 spec + HOTFIXES.md.
|
||||
- `stream_sink` channel: pipeline `send`s tokens; if the receiver is dropped (caller cancelled), `SendError` is silently swallowed and generation continues to completion (so the `Answer` row still gets persisted). Pipeline does NOT panic on a dead sink.
|
||||
- `temperature=0` does not fully eliminate stochasticity in some quantized Ollama models; document this and rely on `must_contain` rule-based metrics in P5 instead of exact match.
|
||||
- Prompt-injection defense lives entirely in the system prompt; do NOT mutate `[근거]` text. If chunk text contains `<|system|>` or similar tokens, do not strip them — they are inert when wrapped.
|
||||
|
||||
@@ -80,12 +80,15 @@ pub enum SearchMode { Lexical, Vector, Hybrid }
|
||||
Hybrid 점수 융합 (1차): RRF (Reciprocal Rank Fusion).
|
||||
|
||||
```text
|
||||
score(chunk) = sum_over_methods( 1 / (k_rrf + rank_method(chunk)) )
|
||||
raw(chunk) = sum_over_methods( 1 / (k_rrf + rank_method(chunk)) )
|
||||
fusion_score = raw / (num_retrievers / (k_rrf + 1)) # ∈ [0, 1]
|
||||
k_rrf 기본 60.
|
||||
```
|
||||
|
||||
이유: bm25 score 와 cosine sim 의 절대값 스케일이 다름. RRF 는 rank 기반이라 안정적.
|
||||
|
||||
**정규화 (2026-05 hotfix)**: raw RRF top score 가 `num_retrievers / (k_rrf+1)` (k_rrf=60에서 ≈ 0.0328) 로 bounded 라 lexical / vector 의 `[0, 1]` 점수와 incomparable 했고 `config.rag.score_gate` default 0.05 와도 incompatible (모든 hybrid query 가 ScoreGate refusal). `2/(k_rrf+1)` 로 나눠서 fusion_score 가 모든 mode 에서 `[0, 1]` 로 정렬되게 함. 자세한 이력은 [HOTFIXES.md](HOTFIXES.md) 참조.
|
||||
|
||||
P3 범위에선 reranker 미도입 (P+ 단계 노트).
|
||||
|
||||
## kb-search 구조
|
||||
|
||||
Reference in New Issue
Block a user