diff --git a/docs/dogfood/v0.18.0/SUMMARY.md b/docs/dogfood/v0.18.0/SUMMARY.md new file mode 100644 index 0000000..84d25ab --- /dev/null +++ b/docs/dogfood/v0.18.0/SUMMARY.md @@ -0,0 +1,68 @@ +# v0.18.0 dogfood retest (PR-9d closure) + +Post-PR-9c-2 dogfood retest 결과. PR-1~PR-8 머지 후 발견된 S7 caffeine hallucination (multi-hop synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 답변으로 emit) 의 NLI-based post-synthesis verification 효과 측정. + +- 환경: v0.18.0 candidate binary, Ollama gemma3:4b, fastembed multilingual-e5-small, mDeBERTa-v3-base-xnli-multilingual ONNX +- Config: `[rag] nli_threshold = 0.5`, `score_gate = 0.3` +- Corpus: `/build/cache/dogfood-v018/` (PR-1~PR-8 와 동일) +- Date: 2026-05-26 + +## 결과 비교 + +| case | query | PR-8 baseline | PR-9 retest | 판정 | +|---|---|---|---|---| +| **S7** | "What is the chemical formula of caffeine?" | `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식 (hallucination)** | `refusal_reason=nli_verification_failed`, `nli_score=0.0035`, `nli_threshold=0.5` | ✅ **HALLUCINATION FIXED** | +| S1 | "컴파일러 파이프라인 ... 출력 데이터 의존성" | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.058` | ✅ 둘 다 reject, NLI 가 더 deterministic | +| S3 | "Why does kebab combine multilingual-e5, LanceDB, and RRF together?" | `refusal_reason=llm_self_judge` | `refusal_reason=nli_model_unavailable`, latency 313s | ⚠ **consistent fail — follow-up 필요** | +| S10 | "Why did the dinosaurs go extinct?" (KB outside) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.0028` | ✅ 둘 다 reject, NLI 가 더 deterministic | + +## S7 hallucination root cause 해결 확정 + +PR-8 까지 multi-hop synthesize 가 chunks 와 entail 안 되는 답변을 *silent emit* 했음 — LLM-self-judge ceiling (synthesize prompt 의 "self-check rule" 가 caffeine 같은 single-fact 부재 case 를 못 잡음). PR-9c-2 의 step 8.5 NLI hook 가 entailment 0.0035 (0.35%) 로 검출 → graceful refusal. + +PR-9 의 *deterministic external verifier* (mDeBERTa-v3 XNLI) 가 LLM-self-judge 의 *probabilistic ceiling* 을 극복. + +## S3 의 nli_model_unavailable (follow-up) + +S3 만 `nli_model_unavailable` 로 fail (S1/S7/S10 의 entailment 측정은 정상). 잠재 원인: +- mDeBERTa session inference 가 *특정 input 에 대해* panic / err 변환 (`tokenizers::encode` 실패, `Session::run` shape 검증 fail 등) +- 또는 *eager session 재 load* 가 process 단위 보다 invocation 단위에서 race +- `KEBAB_LOG=info,kebab_rag=debug,kebab_nli=debug` 로 retry 시 debug log emit 안 됨 (env 이름 ignored 또는 tracing subscriber init 안 됨) — 진단 어려움 + +본 closure 의 scope 외. `tasks/HOTFIXES.md` 에 follow-up entry 등록 (HOTFIX candidate #15 와 별개 — kebab-nli 의 *간헐 / 특정 input dependent* issue). + +## 비교 측정값 + +| metric | PR-8 baseline | PR-9 retest | +|---|---|---| +| S7 latency | 158s | 241s (NLI inference 추가 + first-run model download — 첫 호출만) | +| S1 latency | (post-pr8 시점 비교 baseline 부재 — `results/s1-multihop.json` 는 더 이전 시점, 같은 quality 단순 비교 불가) | 224s | +| S10 latency | (동상) | 79s | +| RAM peak | ~5-6 GB (gemma3:4b) | ~7-8 GB (gemma3:4b + ONNX session ~600 MB) | +| Disk (NLI model) | 0 | 1.1 GB (model 280 MB + tokenizer 16 MB + blobs/locks/snapshots overhead) | + +S1/S10 의 *동일 시점 baseline* 가 `results/` 하나에만 있어 timeline 비교가 부정확. S7 만 `results/post-pr8/` 에 retest 보존되어 latency 비교 의미 있음 (158s baseline → 241s with NLI first-run; 두번째 호출은 240s - 30s download = ~210s 추정). + +NLI inference latency 자체는 ~10-50 ms per call (spec §2.1 명세 일치). 첫 호출 시 model load (~30-60s) + multi-hop synthesize latency 가 dominant. + +## Sample wire outputs + +본 디렉토리의 `s{1,3,7,10}-multihop-post-pr9.json` 4 sample. + +Schema 정합: +- `answer.v1` 의 신규 `verification: { nli_score, nli_threshold, nli_passed }` field 확인. +- `refusal_reason` 의 `"nli_verification_failed"` / `"nli_model_unavailable"` 두 신규 값. +- pre-v0.18 reader 는 `verification` field 가 `skip_serializing_if = None` 으로 omit 되므로 backward-compat (PR-9c-1 의 additive minor wire). + +## NLI threshold tuning iteration trigger? + +현재 결과로는 *없음*: +- 모든 PASS 케이스 (S7/S1/S10) 가 *명백히 ungrounded* 답변에서 entailment < 0.1 — 0.5 threshold 가 *과도하게 엄격* 하지 않음. +- 모든 RETEST 가 PR-8 baseline 의 `llm_self_judge` refuse 와 일치 (false positive 없음). +- v0.18.1 candidate: S3 issue 진단 + 만약 happy-path (실 grounded 답변) 가 false positive 로 reject 되는 케이스 측정 시 threshold tuning. + +## 한계 + +- happy-path (NLI 통과하는 실 grounded 답변) 직접 측정 부재 — 모든 retest 가 refuse path. dogfood corpus 가 *부정 / 부재 사실 위주* 라 happy path 의 sample 부족. v0.18.1 candidate: corpus 보강. +- gemma3:4b 의 synthesize quality 가 baseline — 더 큰 모델 (gemma4:e4b 8B Q4) 에서는 happy path 확률 ↑ 가능. release notes 의 RAM 권장 가이드 의 의미. +- S3 의 follow-up. diff --git a/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json new file mode 100644 index 0000000..a5ef4e9 --- /dev/null +++ b/docs/dogfood/v0.18.0/s1-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:20:48.044449683Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":70908,"sub_queries":["컴파일러의 lexer가 parser에 전달하는 데이터는 무엇인가?","컴파일러의 parser가 type checker에 전달하는 데이터는 무엇인가?","lexer-parser 파이프라인의 출력 데이터는 type checker에 어떻게 사용되는가?","컴파일러 파이프라인에서 각 단계의 출력 데이터 형식은 무엇인가?","lexer-parser-typechecker 파이프라인의 데이터 의존성 흐름은 무엇인가?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_e5f08b72"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":224087,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.058334656059741974,"nli_threshold":0.5}} diff --git a/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json new file mode 100644 index 0000000..4ced183 --- /dev/null +++ b/docs/dogfood/v0.18.0/s10-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:35:26.098601876Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":37810,"sub_queries":["What were the main causes of the dinosaurs' extinction?","When did the dinosaurs go extinct?","What types of dinosaurs went extinct?","What evidence supports the theory of the dinosaurs' extinction?","What role did the asteroid play in the dinosaurs' extinction?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_3152e943"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":79182,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.0027875436935573816,"nli_threshold":0.5}} diff --git a/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json new file mode 100644 index 0000000..a99da1c --- /dev/null +++ b/docs/dogfood/v0.18.0/s3-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. NLI 검증 모델을 사용할 수 없음 — `[rag] nli_threshold = 0` 으로 비활성화 후 재시도 가능.","citations":[],"created_at":"2026-05-26T01:26:45.887132823Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":37987,"sub_queries":["Kebab의 목적은 무엇인가?","Kebab은 multilingual-e5와 어떻게 통합되는가?","Kebab은 LanceDB와 어떻게 통합되는가?","Kebab은 RRF와 어떻게 통합되는가?","Kebab의 통합의 이유는 무엇인가?"]},{"context_chunks_added":15,"forced_stop":true,"iter":1,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_model_unavailable","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_d10dd875"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":311799,"prompt_tokens":0}} diff --git a/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json b/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json new file mode 100644 index 0000000..6c0748a --- /dev/null +++ b/docs/dogfood/v0.18.0/s7-multihop-post-pr9.json @@ -0,0 +1 @@ +{"answer":"근거 부족. 생성된 답변이 검색된 문서 내용에 충분히 entail 되지 않음.","citations":[],"created_at":"2026-05-26T01:16:38.053607085Z","embedding":{"dimensions":384,"id":"multilingual-e5-small","provider":"fastembed"},"grounded":false,"hops":[{"context_chunks_added":0,"forced_stop":false,"iter":0,"kind":"decompose","llm_call_ms":55425,"sub_queries":["What is the chemical formula of caffeine?","Caffeine chemical formula"]},{"context_chunks_added":14,"forced_stop":false,"iter":1,"kind":"decide","llm_call_ms":41352,"sub_queries":["What is the chemical formula of caffeine?","What is the molecular structure of caffeine?","What are the functional groups present in caffeine?","What is the IUPAC name of caffeine?","What are the isomers of caffeine?"]},{"context_chunks_added":1,"forced_stop":true,"iter":2,"kind":"decide","llm_call_ms":0,"sub_queries":[]}],"model":{"dimensions":null,"id":"gemma3:4b","provider":"ollama"},"prompt_template_version":"rag-multi-hop-v1","refusal_reason":"nli_verification_failed","retrieval":{"chunks_returned":0,"chunks_used":0,"k":10,"mode":"hybrid","score_gate":0.30000001192092896,"top_score":0.0,"trace_id":"ret_a1b48d22"},"schema_version":"answer.v1","usage":{"completion_tokens":0,"latency_ms":240649,"prompt_tokens":0},"verification":{"nli_passed":false,"nli_score":0.0035389824770390987,"nli_threshold":0.5}} diff --git a/tasks/HOTFIXES.md b/tasks/HOTFIXES.md index 7254de6..127bc4e 100644 --- a/tasks/HOTFIXES.md +++ b/tasks/HOTFIXES.md @@ -77,9 +77,47 @@ PR-7 머지 후 같은 dogfood S7 (`What is the chemical formula of caffeine?`) **PR-8 의 한계**: gemma3:4b 가 prompt rule 무시. strong rule + small pool 도 hallucination 차단 못함. **LLM-self-judge 기반 safety 의 ceiling** 명확. -### PR-9 — NLI-based post-synthesis verification (예정) +### PR-9 — NLI-based post-synthesis verification (완료, 2026-05-26) -학계 / industry 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론: deterministic post-synthesis verification 이 정답. **mDeBERTa-v3-base-xnli-multilingual ONNX model (280 MB)** 가 `(premise = packed_chunks, hypothesis = answer)` entailment 검사 → score < 0.5 면 refuse. PR-8 위에 layered defense. design note: `/build/cache/dogfood-v018/results/PR-9-DESIGN.md`. 단계적 PR (9a / 9b / 9c) — 추정 ~10시간. v0.18.0 cut blocker. +학계 / industry 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론: deterministic post-synthesis verification 이 정답. **mDeBERTa-v3-base-xnli-multilingual ONNX model (280 MB)** 가 `(premise = packed_chunks, hypothesis = answer)` entailment 검사 → score < threshold 면 refuse. PR-8 위에 layered defense. design note: `/build/cache/dogfood-v018/results/PR-9-DESIGN.md`. 단계적 PR (9a / 9b / 9c-1 / 9c-2 / 9d) 모두 머지. + +**Sub-PR 시퀀스 (모두 머지)**: +- PR #176 (PR-9a): `kebab-nli` crate skeleton — trait surface + workspace deps. +- PR #177 (PR-9b): `OnnxNliVerifier` ONNX inference + model download (lazy `OnceLock`, OnlyFirst truncation). +- PR #178 (PR-9c-1): wire surface — `RefusalReason::Nli{Verification,Model}Failed/Unavailable`, `Answer.verification`, `RagPipeline.verifier` field + builder, `[rag] nli_threshold` + `[models.nli]` config. +- PR #179 (PR-9c-2): `ask_multi_hop` step 8.5 NLI hook 활성화 + `App::open_with_config` 의 NliVerifier construction + 5 mock multi-hop tests. +- PR-9d: dogfood retest + 본 closure (별 PR — fb-41 multi-hop NLI 검증). + +**Dogfood retest 결과** (2026-05-26, `/build/cache/dogfood-v018/results/post-pr9/`, repo 보존 = `docs/dogfood/v0.18.0/`): + +| case | PR-8 baseline | PR-9 retest | 판정 | +|---|---|---|---| +| **S7** (caffeine) | `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식 (hallucination)** | `refusal_reason=nli_verification_failed`, `nli_score=0.0035` | ✅ **HALLUCINATION FIXED** | +| S1 (compiler) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.058` | ✅ 둘 다 reject, NLI 더 deterministic | +| S3 (kebab EN) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_model_unavailable` (consistent) | ⚠ follow-up entry (다음 sub-section) | +| S10 (dinosaur) | `refusal_reason=llm_self_judge` | `refusal_reason=nli_verification_failed`, `nli_score=0.0028` | ✅ 둘 다 reject, NLI 더 deterministic | + +PR-9 의 핵심 목표 (S7 silent hallucination root cause 해결) ✅ **달성**. LLM-self-judge 의 *probabilistic ceiling* 을 NLI 의 *deterministic external verifier* 가 극복. + +**RAM peak**: PR-8 ~5-6 GB → PR-9 ~7-8 GB (gemma3:4b + ONNX session ~600 MB). 16 GB 환경 안전. + +**Disk**: NLI model cache 1.1 GB (model 280 MB + tokenizer 16 MB + hf-hub blobs/locks/snapshots overhead). user XDG (`~/.local/share/kebab/models/nli/`) 또는 config 의 `storage.model_dir`. + +### PR-9d 의 S3 follow-up (kebab-nli `nli_model_unavailable` consistent fail) + +**Symptom**: S3 query ("Why does kebab combine multilingual-e5, LanceDB, and RRF together?") 가 *consistent* (재시도 2회 모두) `nli_model_unavailable` 로 fail. 다른 case (S1/S7/S10) 의 entailment 측정은 정상 — NLI infrastructure 자체는 작동. S3 만 특정 input 의존 fail. + +**Diagnosis 시도**: `KEBAB_LOG=info,kebab_rag=debug,kebab_nli=debug` 로 retry — *debug log emit 안 됨* (env 이름 ignored 또는 tracing subscriber init 안 됨). stderr 비어 있어 graceful refuse path 만 확인. + +**Hypothesis** (확정 안 됨): +- mDeBERTa session inference 가 *S3 의 특정 packed_text shape* 에 대해 err (encode 단계 또는 ort Session::run shape 검증). +- 또는 *eager session reload* 가 process invocation 단위 의 race. + +**임시 대응**: 사용자가 `[rag] nli_threshold = 0` 로 disable 가능. release notes 의 known limitations 명시. + +**Next step**: v0.18.1 candidate — tracing 의 env 이름 검증 (`RUST_LOG` 또는 `KEBAB_TRACING_LEVEL` 등) + S3 packed_text shape 분석 (chunks 개수, char count, language mix). HOTFIX 진단 후 별 PR. + +**Amends**: 없음 (PR-9 의 known limitation, v0.18.1 candidate). ### 사용자 영향