From 505b3889fb1f75534e01d2ba294d8cf37f4fea13 Mon Sep 17 00:00:00 2001 From: altair823 Date: Tue, 26 May 2026 01:44:57 +0000 Subject: [PATCH] =?UTF-8?q?feat(rag):=20fb-41=20PR-9d=20=E2=80=94=20dogfoo?= =?UTF-8?q?d=20retest=20+=20HOTFIXES=20PR-9=20closure=20+=20docs/dogfood/v?= =?UTF-8?q?0.18.0/=20=EB=B3=B4=EC=A1=B4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR-9 의 진짜 작동 확인 — PR-1~PR-9c-2 머지 후 `/build/cache/dogfood-v018/` corpus 의 S7/S1/S3/S10 multi-hop retest. 핵심 결과: **S7 hallucination root cause 해결 확정**. - PR-8 baseline: `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식** (caffeine 질문에 무관 hallucination, silent). - PR-9 retest: `refusal_reason=nli_verification_failed, nli_score=0.0035` (graceful refuse, NLI 가 entailment 0.35% 검출). 전체 비교 (4 case): - S7 ✅ hallucination FIXED. - S1 ✅ 둘 다 reject, NLI 가 더 deterministic (0.058). - S3 ⚠ consistent fail (`nli_model_unavailable`, 313s) — *v0.18.1 follow-up* (kebab-nli 의 특정 input 의존 fail, debug log emit 안 됨 → 진단 어려움). - S10 ✅ 둘 다 reject, NLI 가 더 deterministic (0.0028). - docs/dogfood/v0.18.0/SUMMARY.md (sanitized 보고서) + s{1,3,7,10}-multihop-post-pr9.json (sample wire output, repo 보존). - tasks/HOTFIXES.md 의 fb-41 PR-9 entry: "예정" → "완료 (2026-05-26)" + 비교 표 inline + S3 follow-up subsection (v0.18.1 candidate). RAM: 5-6 GB → 7-8 GB (ONNX session ~600 MB), 16 GB 안전. Disk: NLI model cache 1.1 GB (XDG default 또는 storage.model_dir). Wire 영향: 없음 (PR-9c-1 의 schema 변경만 + 측정값 sample 보존). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/dogfood/v0.18.0/SUMMARY.md | 66 +++++++++++++++++++ .../dogfood/v0.18.0/s1-multihop-post-pr9.json | 1 + .../v0.18.0/s10-multihop-post-pr9.json | 1 + .../dogfood/v0.18.0/s3-multihop-post-pr9.json | 1 + .../dogfood/v0.18.0/s7-multihop-post-pr9.json | 1 + tasks/HOTFIXES.md | 42 +++++++++++- 6 files changed, 110 insertions(+), 2 deletions(-) create mode 100644 docs/dogfood/v0.18.0/SUMMARY.md create mode 100644 docs/dogfood/v0.18.0/s1-multihop-post-pr9.json create mode 100644 docs/dogfood/v0.18.0/s10-multihop-post-pr9.json create mode 100644 docs/dogfood/v0.18.0/s3-multihop-post-pr9.json create mode 100644 docs/dogfood/v0.18.0/s7-multihop-post-pr9.json diff --git a/docs/dogfood/v0.18.0/SUMMARY.md b/docs/dogfood/v0.18.0/SUMMARY.md new file mode 100644 index 0000000..b2e4c01 --- /dev/null +++ b/docs/dogfood/v0.18.0/SUMMARY.md @@ -0,0 +1,66 @@ +# v0.18.0 dogfood retest (PR-9d closure) + +Post-PR-9c-2 dogfood retest 결과. PR-1~PR-8 머지 후 발견된 S7 caffeine hallucination (multi-hop synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 답변으로 emit) 의 NLI-based post-synthesis verification 효과 측정. + +- 환경: v0.18.0 candidate binary, Ollama gemma3:4b, fastembed multilingual-e5-small, mDeBERTa-v3-base-xnli-multilingual ONNX +- Config: `[rag] nli_threshold = 0.5`, `score_gate = 0.3` +- Corpus: `/build/cache/dogfood-v018/` (PR-1~PR-8 와 동일) +- Date: 2026-05-26 + +## 결과 비교 + +| case | query | PR-8 baseline | PR-9 retest | 판정 | +|---|---|---|---|---| +| **S7** | "What is the chemical formula of caffeine?" | `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식 (hallucination)** | `refusal_reason=nli_verification_failed`, `nli_score=0.0035`, `nli_threshold=0.5` | ✅ **HALLUCINATION FIXED** | +| S1 | "컴파일러 파이프라인 ... 출력 데이터 의존성" | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.058` | ✅ 둘 다 reject, NLI 가 더 deterministic | +| S3 | "Why does kebab combine multilingual-e5, LanceDB, and RRF together?" | `refusal_reason=llm_self_judge` | `refusal_reason=nli_model_unavailable`, latency 313s | ⚠ **consistent fail — follow-up 필요** | +| S10 | "Why did the dinosaurs go extinct?" (KB outside) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.0028` | ✅ 둘 다 reject, NLI 가 더 deterministic | + +## S7 hallucination root cause 해결 확정 + +PR-8 까지 multi-hop synthesize 가 chunks 와 entail 안 되는 답변을 *silent emit* 했음 — LLM-self-judge ceiling (synthesize prompt 의 "self-check rule" 가 caffeine 같은 single-fact 부재 case 를 못 잡음). PR-9c-2 의 step 8.5 NLI hook 가 entailment 0.0035 (0.35%) 로 검출 → graceful refusal. + +PR-9 의 *deterministic external verifier* (mDeBERTa-v3 XNLI) 가 LLM-self-judge 의 *probabilistic ceiling* 을 극복. + +## S3 의 nli_model_unavailable (follow-up) + +S3 만 `nli_model_unavailable` 로 fail (S1/S7/S10 의 entailment 측정은 정상). 잠재 원인: +- mDeBERTa session inference 가 *특정 input 에 대해* panic / err 변환 (`tokenizers::encode` 실패, `Session::run` shape 검증 fail 등) +- 또는 *eager session 재 load* 가 process 단위 보다 invocation 단위에서 race +- `KEBAB_LOG=info,kebab_rag=debug,kebab_nli=debug` 로 retry 시 debug log emit 안 됨 (env 이름 ignored 또는 tracing subscriber init 안 됨) — 진단 어려움 + +본 closure 의 scope 외. `tasks/HOTFIXES.md` 에 follow-up entry 등록 (HOTFIX candidate #15 와 별개 — kebab-nli 의 *간헐 / 특정 input dependent* issue). + +## 비교 측정값 + +| metric | PR-8 baseline | PR-9 retest | +|---|---|---| +| S7 latency | ~158s | ~241s (NLI inference 추가 + first-run model download) | +| S1 latency | ~150s | ~224s | +| S10 latency | ~80s (refuse 빠름) | ~79s | +| RAM peak | ~5-6 GB (gemma3:4b) | ~7-8 GB (gemma3:4b + ONNX session ~600 MB) | +| Disk (NLI model) | 0 | 1.1 GB (model 280 MB + tokenizer 16 MB + blobs/locks/snapshots overhead) | + +NLI inference latency 자체는 ~10-50 ms per call (spec §2.1 명세 일치). 첫 호출 시 model load (~30-60s) + multi-hop synthesize latency 가 dominant. + +## Sample wire outputs + +본 디렉토리의 `s{1,3,7,10}-multihop-post-pr9.json` 4 sample. + +Schema 정합: +- `answer.v1` 의 신규 `verification: { nli_score, nli_threshold, nli_passed }` field 확인. +- `refusal_reason` 의 `"nli_verification_failed"` / `"nli_model_unavailable"` 두 신규 값. +- pre-v0.18 reader 는 `verification` field 가 `skip_serializing_if = None` 으로 omit 되므로 backward-compat (PR-9c-1 의 additive minor wire). + +## NLI threshold tuning iteration trigger? + +현재 결과로는 *없음*: +- 모든 PASS 케이스 (S7/S1/S10) 가 *명백히 ungrounded* 답변에서 entailment < 0.1 — 0.5 threshold 가 *과도하게 엄격* 하지 않음. +- 모든 RETEST 가 PR-8 baseline 의 `llm_self_judge` refuse 와 일치 (false positive 없음). +- v0.18.1 candidate: S3 issue 진단 + 만약 happy-path (실 grounded 답변) 가 false positive 로 reject 되는 케이스 측정 시 threshold tuning. + +## 한계 + +- happy-path (NLI 통과하는 실 grounded 답변) 직접 측정 부재 — 모든 retest 가 refuse path. dogfood corpus 가 *부정 / 부재 사실 위주* 라 happy path 의 sample 부족. v0.18.1 candidate: corpus 보강. +- gemma3:4b 의 synthesize quality 가 baseline — 더 큰 모델 (gemma4:e4b 8B Q4) 에서는 happy path 확률 ↑ 가능. release notes 의 RAM 권장 가이드 의 의미. +- S3 의 follow-up. diff --git a/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json new file mode 100644 index 0000000..a5ef4e9 --- /dev/null +++ b/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:20:48.044449683Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":70908,"sub_queries":["컴파일러의 lexer가 parser에 전달하는 데이터는 무엇인가?","컴파일러의 parser가 type checker에 전달하는 데이터는 무엇인가?","lexer-parser 파이프라인의 출력 데이터는 type checker에 어떻게 사용되는가?","컴파일러 파이프라인에서 각 단계의 출력 데이터 형식은 무엇인가?","lexer-parser-typechecker 파이프라인의 데이터 의존성 흐름은 무엇인가?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_e5f08b72"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":224087,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.058334656059741974,"nli_threshold":0.5}} diff --git a/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json new file mode 100644 index 0000000..4ced183 --- /dev/null +++ b/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:35:26.098601876Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":37810,"sub_queries":["What were the main causes of the dinosaurs' extinction?","When did the dinosaurs go extinct?","What types of dinosaurs went extinct?","What evidence supports the theory of the dinosaurs' extinction?","What role did the asteroid play in the dinosaurs' extinction?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_3152e943"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":79182,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.0027875436935573816,"nli_threshold":0.5}} diff --git a/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json new file mode 100644 index 0000000..a99da1c --- /dev/null +++ b/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능.","citations":[],"created_at":"2026-05-26T01:26:45.887132823Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":37987,"sub_queries":["Kebab의 목적은 무엇인가?","Kebab은 multilingual-e5와 어떻게 통합되는가?","Kebab은 LanceDB와 어떻게 통합되는가?","Kebab은 RRF와 어떻게 통합되는가?","Kebab의 통합의 이유는 무엇인가?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_model_unavailable","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_d10dd875"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":311799,"prompt_tokens":0}} diff --git a/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json new file mode 100644 index 0000000..6c0748a --- /dev/null +++ b/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:16:38.053607085Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":55425,"sub_queries":["What is the chemical formula of caffeine?","Caffeine chemical formula"]},{"context_chunks_added":14,"forced_stop":false,"iter":1,"kind":"decide","llm_call_ms":41352,"sub_queries":["What is the chemical formula of caffeine?","What is the molecular structure of caffeine?","What are the functional groups present in caffeine?","What is the IUPAC name of caffeine?","What are the isomers of caffeine?"]},{"context_chunks_added":1,"forced_stop":true,"iter":2,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_a1b48d22"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":240649,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.0035389824770390987,"nli_threshold":0.5}} diff --git a/tasks/HOTFIXES.md b/tasks/HOTFIXES.md index 7254de6..127bc4e 100644 --- a/tasks/HOTFIXES.md +++ b/tasks/HOTFIXES.md @@ -77,9 +77,47 @@ PR-7 머지 후 같은 dogfood S7 (`What is the chemical formula of caffeine?`) **PR-8 의 한계**: gemma3:4b 가 prompt rule 무시. strong rule + small pool 도 hallucination 차단 못함. **LLM-self-judge 기반 safety 의 ceiling** 명확. -### PR-9 — NLI-based post-synthesis verification (예정) +### PR-9 — NLI-based post-synthesis verification (완료, 2026-05-26) -학계 / industry 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론: deterministic post-synthesis verification 이 정답. **mDeBERTa-v3-base-xnli-multilingual ONNX model (280 MB)** 가 `(premise = packed_chunks, hypothesis = answer)` entailment 검사 → score < 0.5 면 refuse. PR-8 위에 layered defense. design note: `/build/cache/dogfood-v018/results/PR-9-DESIGN.md`. 단계적 PR (9a / 9b / 9c) — 추정 ~10시간. v0.18.0 cut blocker. +학계 / industry 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론: deterministic post-synthesis verification 이 정답. **mDeBERTa-v3-base-xnli-multilingual ONNX model (280 MB)** 가 `(premise = packed_chunks, hypothesis = answer)` entailment 검사 → score < threshold 면 refuse. PR-8 위에 layered defense. design note: `/build/cache/dogfood-v018/results/PR-9-DESIGN.md`. 단계적 PR (9a / 9b / 9c-1 / 9c-2 / 9d) 모두 머지. + +**Sub-PR 시퀀스 (모두 머지)**: +- PR #176 (PR-9a): `kebab-nli` crate skeleton — trait surface + workspace deps. +- PR #177 (PR-9b): `OnnxNliVerifier` ONNX inference + model download (lazy `OnceLock`, OnlyFirst truncation). +- PR #178 (PR-9c-1): wire surface — `RefusalReason::Nli{Verification,Model}Failed/Unavailable`, `Answer.verification`, `RagPipeline.verifier` field + builder, `[rag] nli_threshold` + `[models.nli]` config. +- PR #179 (PR-9c-2): `ask_multi_hop` step 8.5 NLI hook 활성화 + `App::open_with_config` 의 NliVerifier construction + 5 mock multi-hop tests. +- PR-9d: dogfood retest + 본 closure (별 PR — fb-41 multi-hop NLI 검증). + +**Dogfood retest 결과** (2026-05-26, `/build/cache/dogfood-v018/results/post-pr9/`, repo 보존 = `docs/dogfood/v0.18.0/`): + +| case | PR-8 baseline | PR-9 retest | 판정 | +|---|---|---|---| +| **S7** (caffeine) | `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식 (hallucination)** | `refusal_reason=nli_verification_failed`, `nli_score=0.0035` | ✅ **HALLUCINATION FIXED** | +| S1 (compiler) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.058` | ✅ 둘 다 reject, NLI 더 deterministic | +| S3 (kebab EN) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_model_unavailable` (consistent) | ⚠ follow-up entry (다음 sub-section) | +| S10 (dinosaur) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.0028` | ✅ 둘 다 reject, NLI 더 deterministic | + +PR-9 의 핵심 목표 (S7 silent hallucination root cause 해결) ✅ **달성**. LLM-self-judge 의 *probabilistic ceiling* 을 NLI 의 *deterministic external verifier* 가 극복. + +**RAM peak**: PR-8 ~5-6 GB → PR-9 ~7-8 GB (gemma3:4b + ONNX session ~600 MB). 16 GB 환경 안전. + +**Disk**: NLI model cache 1.1 GB (model 280 MB + tokenizer 16 MB + hf-hub blobs/locks/snapshots overhead). user XDG (`~/.local/share/kebab/models/nli/`) 또는 config 의 `storage.model_dir`. + +### PR-9d 의 S3 follow-up (kebab-nli `nli_model_unavailable` consistent fail) + +**Symptom**: S3 query ("Why does kebab combine multilingual-e5, LanceDB, and RRF together?") 가 *consistent* (재시도 2회 모두) `nli_model_unavailable` 로 fail. 다른 case (S1/S7/S10) 의 entailment 측정은 정상 — NLI infrastructure 자체는 작동. S3 만 특정 input 의존 fail. + +**Diagnosis 시도**: `KEBAB_LOG=info,kebab_rag=debug,kebab_nli=debug` 로 retry — *debug log emit 안 됨* (env 이름 ignored 또는 tracing subscriber init 안 됨). stderr 비어 있어 graceful refuse path 만 확인. + +**Hypothesis** (확정 안 됨): +- mDeBERTa session inference 가 *S3 의 특정 packed_text shape* 에 대해 err (encode 단계 또는 ort Session::run shape 검증). +- 또는 *eager session reload* 가 process invocation 단위 의 race. + +**임시 대응**: 사용자가 `[rag] nli_threshold = 0` 로 disable 가능. release notes 의 known limitations 명시. + +**Next step**: v0.18.1 candidate — tracing 의 env 이름 검증 (`RUST_LOG` 또는 `KEBAB_TRACING_LEVEL` 등) + S3 packed_text shape 분석 (chunks 개수, char count, language mix). HOTFIX 진단 후 별 PR. + +**Amends**: 없음 (PR-9 의 known limitation, v0.18.1 candidate). ### 사용자 영향