fix(rag): RRF fusion_score를 [0,1]로 정규화 + post-merge hotfix 로그 추가 #25
Reference in New Issue
Block a user
Delete Branch "fix/rrf-fusion-score-normalize-and-docs"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
변경 요약
P4-3 머지 후 직접 스모크에서 발견된 RRF score_gate 버그를 닫고, 지금까지 발견된 post-merge hotfix들을
tasks/HOTFIXES.md에 dated 로그로 정리했습니다.버그
config.rag.score_gatedefault0.05이 hybrid RRFfusion_score범위와 incompatible. raw RRF top은num_retrievers / (k_rrf + 1)즉 defaultk_rrf=60에서≈ 0.0328로 bounded라, 두 retriever에서 모두 rank 1을 차지한 perfect-aligned 청크조차 ScoreGate refusal로 떨어집니다.스모크 재현:
fusion_score이 mode별로 다른 범위에 살았다는 점도 의미적 문제 — Lexical / Vector는[0, 1], Hybrid는(0, 0.033]. wire schema reader가 모드별 분기 로직을 가져야 했음.Fix 결정
옵션:
score_gate(lexical_score_gate/vector_score_gate/hybrid_score_gate) — 모든 consumer (CLI, eval, TUI)가 mode마다 다른 threshold를 알아야 함. 기각.구현
crates/kb-search/src/hybrid.rs에서 raw RRF score를2 / (k_rrf + 1)(두 retriever가 각각 rank 1을 차지할 때의 이론적 max)로 나눠 normalize:정규화 후 의미:
RRF의 rank ordering 불변량은 보존 (모든 score를 같은 양수 상수로 나누므로). sort + tiebreak 동작은 byte-identical. wire schema label
fusion_score은 슬롯 그대로, magnitude만 shift.검증
스모크 재실행 (default
score_gate = 0.05로 복원):[#2]rust/ownership.md[#1]arch/rag-architecture.md[#1]arch/embeddings.md워크스페이스 319 tests pass. clippy clean.
테스트 변경
rrf_formula_matches_known_value(단위 테스트)가1/61 + 1/62 ≈ 0.0325대신 정규화된(1/61 + 1/62) / (2/61) ≈ 0.9919을 기대하게 갱신.crates/kb-search/tests/fixtures/search/hybrid/run-1.json)는 absolute 값이 아니라 presence check (fusion_score_positive: true) 기반이라 regeneration 불필요.문서 갱신
tasks/HOTFIXES.md신규 — post-merge 핫픽스 dated 로그. 다음 세 fix 등록:--configflag silently ignored across all kb-cli subcommands (P3-5 follow-up, PR #20).--configregression inkb ask(P4-3 follow-up, PR #24).fusion_scoreincompatible withconfig.rag.score_gatedefault (이 PR).각 entry는
Discovered / Symptom / Root cause / Fix / Amends5개 필드. 원래 task spec은 frozen하고 HOTFIXES.md를 live source of truth로 운영.각 영향받은 task spec에 "Risks/notes" addendum 추가 (HOTFIXES.md 링크):
tasks/INDEX.md— "Post-merge 핫픽스" 섹션 신규.tasks/phase-3-vector-hybrid.md— RRF formula 정규화 형태로 갱신.tasks/p3/p3-4-hybrid-fusion.md— Behavior contract RRF 항목 갱신.tasks/p3/p3-5-app-wiring.md—--config핫픽스 노트.tasks/p4/p4-3-rag-pipeline.md—--config핫픽스 + score_gate 버그 노트.변경 파일
crates/kb-search/src/hybrid.rs(정규화 + 테스트 갱신)tasks/HOTFIXES.md(신규)tasks/INDEX.md,tasks/phase-3-vector-hybrid.md,tasks/p3/p3-4-hybrid-fusion.md,tasks/p3/p3-5-app-wiring.md,tasks/p4/p4-3-rag-pipeline.md(addendum)후속
P5-1 진입 가능. eval task가 grounded/refusal classification + citation_coverage 측정 시 정규화된
[0, 1]fusion_score을 기준으로 하면 됩니다.## Bug config.rag.score_gate default 0.05 was incompatible with hybrid RRF fusion_score: raw RRF tops out at num_retrievers / (k_rrf + 1) ≈ 0.0328 at the default k_rrf=60, so every hybrid `kb ask` tripped ScoreGate refusal even when the top hit was perfectly aligned across both retrievers. Symptomatic on the post-P4-3 manual smoke at /tmp/kb-smoke/ pointed at 192.168.0.47 Ollama: $ kb ask "Rust ownership 모델의 핵심 규칙은 뭐야?" --mode hybrid 근거 부족. KB에 해당 내용 없음. # top fusion_score = 0.0164 Per-mode score_gate (lexical_score_gate / vector_score_gate / hybrid_score_gate) was rejected because it forces every consumer (CLI, eval, TUI) to know which mode picks which threshold. Score normalization solves it at the source. ## Fix crates/kb-search/src/hybrid.rs divides every fused score by 2 / (k_rrf + 1), the theoretical RRF maximum with two retrievers each contributing rank 1. After normalization: - both retrievers agree on rank 1 → fusion_score = 1.0 - only one retriever finds the chunk → caps near 0.5 - typical mixed ranks → falls between 0 and 0.5 RRF's rank-ordering invariants are preserved (every score divides by the same positive constant), so sort + tiebreak behaviour is unchanged. Wire schema label `fusion_score` keeps its slot in RetrievalDetail; only the magnitude shifts, and only for hybrid mode (lexical / vector were already in [0, 1]). Verification: re-ran the four-scenario smoke at /tmp/kb-smoke/ with default score_gate = 0.05 — all four (Korean→Korean, English→ English, cross-language Korean↔English, out-of-corpus) succeed with the expected grounded / refusal classification, top fusion_score now ≈ 0.5. ## Tests One unit test (rrf_formula_matches_known_value) updated to expect the normalized value `(1/61 + 1/62) / (2/61) ≈ 0.9919` instead of the raw `1/61 + 1/62 ≈ 0.0325`. The integration snapshot fixture crates/kb-search/tests/fixtures/search/hybrid/run-1.json already used presence checks (fusion_score_positive: true) rather than absolute values, so it doesn't need regeneration. Workspace 319 tests pass; clippy clean across both feature configs. ## Docs This commit also adds tasks/HOTFIXES.md as a dated post-merge log covering this fix and the two earlier --config-flag regressions (P3-5 hotfix #20 across ingest/search/list/inspect/doctor; P4-3 follow-up #24 for kb ask). Original task specs in tasks/p<N>/ *.md stay frozen as the historical contract; HOTFIXES.md is the live source of truth for post-merge deltas. Each affected task spec gets a "Risks/notes" addendum pointing back to HOTFIXES.md so a reader landing on the spec sees the active behaviour: - tasks/INDEX.md gains a "Post-merge 핫픽스" section. - tasks/phase-3-vector-hybrid.md updates the RRF formula text to show the normalized form. - tasks/p3/p3-4-hybrid-fusion.md "Behavior contract" RRF bullet notes the normalization and reason. - tasks/p3/p3-5-app-wiring.md "Risks/notes" notes the --config fix. - tasks/p4/p4-3-rag-pipeline.md "Risks/notes" notes the kb-ask --config fix and the score_gate-RRF incompatibility (closed by the normalization in p3-4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>RRF 정규화 + post-merge hotfix 로그 코드 리뷰 — 셀프 머지 게이트로 인해 COMMENT only.
P4-3 후 직접 워크스페이스 (192.168.0.47 Ollama + gemma4:26b)에서 발견한 ScoreGate-RRF 호환성 버그를 source-level 정규화로 닫았습니다. 핵심 결정:
2/(k_rrf+1)로 나누므로 sort + tiebreak byte-identical. 단위 테스트 + 통합 스냅샷 모두 그대로 통과 (스냅샷은 presence check 기반이라 absolute 값에 종속되지 않았음).Answer.retrieval.score_gate이 다시 의미 있는 값 —[0,1]범위에서 비교 가능하므로 user-facing config로서 정상 동작.검증:
/tmp/kb-smoke/에서 defaultscore_gate = 0.05로 4개 시나리오 모두 통과. 한국어/영어/cross-language grounded + out-of-corpus refusal.문서 측면:
tasks/HOTFIXES.md신규 — post-merge 버그 dated 로그. 지금까지 발견된 3건 (P3-5 hotfix --config, P4-3 hotfix --config-in-ask, RRF normalize) 모두 등록.워크스페이스 319 passed / 0 failed. clippy clean.
inline 코멘트는 모두 결정에 대한 노트입니다. 머지 진행해도 됩니다. 다음은 P5-1 golden-fixture-runner.
@@ -194,0 +195,4 @@// Raw RRF: `Σ 1/(k_rrf + rank_m(c))` over the retrievers a chunk// appears in. With two retrievers the raw upper bound is// `2/(k_rrf + 1)` — at k_rrf=60 that's only ≈0.0328, which makes// a single `config.rag.score_gate` default of 0.05 silently정규화 인자
2 / (k_rrf + 1)의 의미를 코멘트 블록으로 박아둠. 두 retriever가 각각 rank 1을 차지할 때의 이론적 max가2/(k_rrf+1)이라는 점, 한쪽 retriever만 발견한 chunk는 ≈ 0.5에서 cap한다는 점, RRF rank ordering 보존된다는 점 모두 in-source. 미래 누군가가 "왜 2/(k_rrf+1)인가?" 질문할 때 즉시 답이 보임.@@ -194,2 +212,4 @@let FusionPolicy::Rrf { k_rrf } = self.fusion;let k_rrf_f = f64::from(k_rrf);// Both retrievers can contribute, so the per-mode RRF max is// 2 / (k_rrf + 1). Even when a chunk lands in only one mode, weRRF score를
2 / (k_rrf + 1)로 나누는 한 줄 수정으로 두 가지 문제 동시 해결: (1)fusion_score이 mode 간 incomparable했던 문제 (Lexical/Vector [0,1] vs Hybrid (0, 0.033]), (2)score_gatedefault 0.05이 hybrid에서 항상 trip하던 문제. 모든 score를 같은 양수 상수로 나누므로 rank ordering invariant은 보존 — sort + tiebreak 행동은 byte-identical.@@ -0,0 +1,71 @@---post-merge hotfix dated 로그 신규. 이번 fix가 세 번째 entry (앞서 P3-5 hotfix #20, P4-3 hotfix #24). 각 entry는 5필드 형식 (Discovered / Symptom / Root cause / Fix / Amends)으로 원래 task spec에 frozen 형태로 남기되 HOTFIXES.md가 live source of truth로 작동. 향후 P5+ 진입 시 reader가 git history diffing 없이 actual behavior를 알 수 있음.
@@ -83,2 +83,4 @@- [p9-5 desktop-tauri](p9/p9-5-desktop-tauri.md)## Post-merge 핫픽스INDEX 최상위에 "Post-merge 핫픽스" 섹션 신규. workspace를 처음 보는 reader가 task spec list 옆에 HOTFIXES.md를 발견할 수 있게. spec frozen / HOTFIXES live source-of-truth 분담을 INDEX 레벨에 명문화.
@@ -94,3 +94,3 @@- `SearchMode::Hybrid`:- run `lexical.search(query)` and `vector.search(query)` in sequence (fan-out is fine; not required).- fuse with RRF: `score(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.- fuse with RRF: `raw(c) = Σ_{m ∈ {lex, vec}} 1 / (k_rrf + rank_m(c))` where `k_rrf` from config (default 60). `rank_m` is 1-based; chunks not appearing in retriever `m` contribute 0.원래 spec의 RRF formula에 normalization step을 명시적으로 추가 — "divide by
num_retrievers / (k_rrf + 1)" + 이유 (lexical/vector와의 score 일관성 + score_gate default 호환). HOTFIXES.md로 cross-link해서 amendment 이력이 follow 가능.@@ -179,6 +179,8 @@ All tests under `cargo test -p kb-rag` with no real Ollama (mock LM only).## Risks / notes- Citation regex is STRICT `\[#(\d{1,3})\]` only. Models that emit `[1]`/`[ #1 ]`/`[foo]` are treated as no-marker → refusal. This is intentional: a noisy citation grammar lets prose `[1]` or `vec![1]` slip through as false positives, which corrupts both `grounded` and `kb eval` `citation_coverage`. The prompt template (`rag-v1`) explicitly instructs `[#번호]`.- **Post-merge fix (2026-05)**: kb-cli's `Cmd::Ask` arm originally called bare `kb_app::ask(query, opts)`, ignoring `--config <path>` and silently using XDG defaults (manifested as wrong model id / score_gate / data_dir surfacing in `Answer.retrieval`). Fixed by routing through `kb_app::ask_with_config(cfg, query, opts)`. See [HOTFIXES.md](../HOTFIXES.md).RAG pipeline 측 spec에 두 핫픽스 모두 노트로 추가 — (a)
--configflag 무시 회귀, (b)score_gate× RRF 호환성. 두 번째는 p3-4의 정규화로 닫혔지만 RAG가 score_gate를 사용하는 caller라 RAG spec에도 cross-reference 남겼음. 미래 reader가 ScoreGate refusal을 디버깅할 때 즉시 두 가지 경로 모두 확인 가능.