Files

altair823 7c27633df2 chore(rag): post-PR9 refactor — H1/H2/H3/D/E + test coverage + post-refactor dogfood retest

OMC team `post-pr9-refactor` 의 architectural cleanup. architect priorities 분석 후 executor + test-engineer 가 file edits, system-architect 가 component-level review 로 *pre-cut nothing — all v0.18.1+ defer* 결론.

## Executor 작업 (H1/H2/H3/D/E)

- **H1** (kebab-nli/src/onnx.rs): `[models.nli]` config wire 활성화. `DEFAULT_MODEL_ID` const 제거 (kebab-config 의 NliCfg::defaults 가 single source). OnnxNliVerifier::new 가 config.models.nli.model 읽고 config.models.nli.provider 가 "onnx" 아니면 anyhow::bail. 3 stale "PR-9c-1 will wire this" 코멘트 제거. 2 unit test 추가 (`new_uses_config_model_id`, `new_rejects_unsupported_provider`).
- **H2** (kebab-rag/src/pipeline.rs): `truncate_for_nli(premise: &str, _hypothesis: &str)` → `truncate_for_nli(premise: &str)`. v0.18.1 placeholder doc 제거. 4 callsite (tests/multi_hop.rs) 갱신 + test rename `multi_hop_truncate_for_nli_preserves_hypothesis` → `multi_hop_truncate_for_nli_char_budget` (contract 정합).
- **H3** (kebab-rag/src/pipeline.rs:1041): `was_truncated` 가 tracing::debug! 으로 surface (observability 추가, signature 보존 — caller logging contract).
- **D** (kebab-mcp/tests/tools_call_ask_multi_hop.rs): request_timeout_secs 2 → 5 (slow CI 안정성), `mh_code` discriminator 제거. dispatch contract = `mh.is_error.unwrap_or(false)` (기존 assertion 으로 충분).
- **E** (tasks/HOTFIXES.md + pipeline.rs:1633-1638): fb-41 PR-9 closure entry 의 sibling 으로 "### PR-9 NLI refusal: terminal Synthesize hop omitted from hops trace" subsection 추가. pipeline 의 "cleanup deferred to a follow-up" → "// See tasks/HOTFIXES.md ... for follow-up" cross-link.

## Test-engineer 작업 (T1/T2/T3/T4, 9 new tests)

- **T1** (kebab-nli/src/onnx.rs::tests): sanitize_model_id 3 unit (replaces_slash / idempotent / leaves_other_chars).
- **T2** (kebab-rag/tests/multi_hop_nli_panic.rs 신규): 2 panic-path tests — facade invariant (`expect("verifier must be Some when nli_threshold > 0.0")`) 의 #[should_panic] + threshold=0 의 companion.
- **T3** (kebab-rag/tests/multi_hop_nli_stream.rs 신규): 2 StreamEvent::Final tests — refuse_nli_verification + refuse_nli_model_unavailable 의 stream_sink Final 분기 wire shape pinning.
- **T4** (kebab-app/tests/open_with_config_nli.rs 신규): 2 NLI failure path — model_dir 가 unwritable 일 때 App::open_with_config 의 Result<App> Err (with "OnnxNliVerifier" in chain) + threshold=0 일 때 graceful skip.

## System-architect 결론

3 lenses (absorption / duplication / under-engineered interface) 분석 결과 — *pre-cut nothing*. Top-3 items 모두 v0.18.1+ defer:
- Lens 1: kebab-normalize + kebab-parse-types 흡수 가능 (parse-md 만 사용, 5 parsers 우회) → v0.18.1+.
- Lens 3: Extractor + Chunker trait 의 dead polymorphism (모든 callsite 가 hardcoded) → v0.18.1+.
- Lens 1 bundled: kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars drag → low-risk dep-graph win, v0.18.1+ bundled.
- Defer-with-intent: LanguageModel async refactor (cloud-LLM 시), NliVerifier::score_batch + typed NliError (2nd impl 시), compute_stale → kebab-core::stale.

보고서: /build/cache/tmp/post-pr9-refactor-priorities.md, /build/cache/tmp/system-architecture-priorities.md (둘 다 repo 외 — analysis 보존).

## 검증

- cargo test -p kebab-nli -j 1 → 11/11 pass.
- cargo test -p kebab-rag -j 1 → 75/75 pass (5 NLI multi-hop + 4 신규 T2/T3 포함).
- cargo test -p kebab-app -j 1 → 23 pass + 2 ignored (T4 의 2 포함).
- cargo test -p kebab-mcp --test tools_call_ask_multi_hop -j 1 → 1 pass + 1 pre-existing flaky (HOTFIX #15, no_chunks short-circuit, executor D fix 와 무관 — line 86 의 base assertion 이 fixture 없어서 fail).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
- cargo test --workspace --no-fail-fast -j 1 → 1304 passed (+11 new) + 1 pre-existing flaky 동일.
- **Post-refactor dogfood retest byte-identical** (PR-9d / post-cleanup / post-refactor 3번 모두): S7 0.0035389824770390987, S1 0.058334656059741974, S10 0.0027875436935573816, S3 nli_model_unavailable.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-architectural-refactor retest" section 추가.

Wire 영향: 없음.
Behavior 영향: 없음 (H1 의 config wiring 가 default 와 같은 model → byte-identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-26 04:42:37 +00:00

7.3 KiB

Raw Permalink Blame History

v0.18.0 dogfood retest (PR-9d closure)

Post-PR-9c-2 dogfood retest 결과. PR-1~PR-8 머지 후 발견된 S7 caffeine hallucination (multi-hop synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 답변으로 emit) 의 NLI-based post-synthesis verification 효과 측정.

환경: v0.18.0 candidate binary, Ollama gemma3:4b, fastembed multilingual-e5-small, mDeBERTa-v3-base-xnli-multilingual ONNX
Config: [rag] nli_threshold = 0.5, score_gate = 0.3
Corpus: /build/cache/dogfood-v018/ (PR-1~PR-8 와 동일)
Date: 2026-05-26

결과 비교

case	query	PR-8 baseline	PR-9 retest	판정
S7	"What is the chemical formula of caffeine?"	`grounded=true, refusal_reason=null`, 답변=Adam gradient 공식 (hallucination)	`refusal_reason=nli_verification_failed`, `nli_score=0.0035`, `nli_threshold=0.5`	✅ HALLUCINATION FIXED
S1	"컴파일러 파이프라인 ... 출력 데이터 의존성"	`refusal_reason=llm_self_judge`	`refusal_reason=nli_verification_failed`, `nli_score=0.058`	✅ 둘 다 reject, NLI 가 더 deterministic
S3	"Why does kebab combine multilingual-e5, LanceDB, and RRF together?"	`refusal_reason=llm_self_judge`	`refusal_reason=nli_model_unavailable`, latency 313s	⚠ consistent fail — follow-up 필요
S10	"Why did the dinosaurs go extinct?" (KB outside)	`refusal_reason=llm_self_judge`	`refusal_reason=nli_verification_failed`, `nli_score=0.0028`	✅ 둘 다 reject, NLI 가 더 deterministic

S7 hallucination root cause 해결 확정

PR-8 까지 multi-hop synthesize 가 chunks 와 entail 안 되는 답변을 silent emit 했음 — LLM-self-judge ceiling (synthesize prompt 의 "self-check rule" 가 caffeine 같은 single-fact 부재 case 를 못 잡음). PR-9c-2 의 step 8.5 NLI hook 가 entailment 0.0035 (0.35%) 로 검출 → graceful refusal.

PR-9 의 deterministic external verifier (mDeBERTa-v3 XNLI) 가 LLM-self-judge 의 probabilistic ceiling 을 극복.

S3 의 nli_model_unavailable (follow-up)

S3 만 nli_model_unavailable 로 fail (S1/S7/S10 의 entailment 측정은 정상). 잠재 원인:

mDeBERTa session inference 가 특정 input 에 대해 panic / err 변환 (tokenizers::encode 실패, Session::run shape 검증 fail 등)
또는 eager session 재 load 가 process 단위 보다 invocation 단위에서 race
KEBAB_LOG=info,kebab_rag=debug,kebab_nli=debug 로 retry 시 debug log emit 안 됨 (env 이름 ignored 또는 tracing subscriber init 안 됨) — 진단 어려움

본 closure 의 scope 외. tasks/HOTFIXES.md 에 follow-up entry 등록 (HOTFIX candidate #15 와 별개 — kebab-nli 의 간헐 / 특정 input dependent issue).

비교 측정값

metric	PR-8 baseline	PR-9 retest
S7 latency	158s	241s (NLI inference 추가 + first-run model download — 첫 호출만)
S1 latency	(post-pr8 시점 비교 baseline 부재 — `results/s1-multihop.json` 는 더 이전 시점, 같은 quality 단순 비교 불가)	224s
S10 latency	(동상)	79s
RAM peak	~5-6 GB (gemma3:4b)	~7-8 GB (gemma3:4b + ONNX session ~600 MB)
Disk (NLI model)	0	1.1 GB (model 280 MB + tokenizer 16 MB + blobs/locks/snapshots overhead)

S1/S10 의 동일 시점 baseline 가 results/ 하나에만 있어 timeline 비교가 부정확. S7 만 results/post-pr8/ 에 retest 보존되어 latency 비교 의미 있음 (158s baseline → 241s with NLI first-run; 두번째 호출은 240s - 30s download = ~210s 추정).

NLI inference latency 자체는 ~10-50 ms per call (spec §2.1 명세 일치). 첫 호출 시 model load (~30-60s) + multi-hop synthesize latency 가 dominant.

Sample wire outputs

본 디렉토리의 s{1,3,7,10}-multihop-post-pr9.json 4 sample.

Schema 정합:

answer.v1 의 신규 verification: { nli_score, nli_threshold, nli_passed } field 확인.
refusal_reason 의 "nli_verification_failed" / "nli_model_unavailable" 두 신규 값.
pre-v0.18 reader 는 verification field 가 skip_serializing_if = None 으로 omit 되므로 backward-compat (PR-9c-1 의 additive minor wire).

NLI threshold tuning iteration trigger?

현재 결과로는 없음:

모든 PASS 케이스 (S7/S1/S10) 가 명백히 ungrounded 답변에서 entailment < 0.1 — 0.5 threshold 가 과도하게 엄격 하지 않음.
모든 RETEST 가 PR-8 baseline 의 llm_self_judge refuse 와 일치 (false positive 없음).
v0.18.1 candidate: S3 issue 진단 + 만약 happy-path (실 grounded 답변) 가 false positive 로 reject 되는 케이스 측정 시 threshold tuning.

한계

happy-path (NLI 통과하는 실 grounded 답변) 직접 측정 부재 — 모든 retest 가 refuse path. dogfood corpus 가 부정 / 부재 사실 위주 라 happy path 의 sample 부족. v0.18.1 candidate: corpus 보강.
gemma3:4b 의 synthesize quality 가 baseline — 더 큰 모델 (gemma4:e4b 8B Q4) 에서는 happy path 확률 ↑ 가능. release notes 의 RAM 권장 가이드 의 의미.
S3 의 follow-up.

Post-cleanup retest (2026-05-26)

workspace-wide cleanup (chore: clippy::pedantic baseline + auto-fix, 128 files / +552-472) 직후 동일 4 case 재실행. PR-9d 와 byte-identical 결과:

case	PR-9d	post-cleanup	회귀
S7	`nli_verification_failed`, score=0.0035389824770390987	`nli_verification_failed`, score=0.0035389824770390987	✓ identical
S1	`nli_verification_failed`, score=0.058334656059741974	`nli_verification_failed`, score=0.058334656059741974	✓ identical
S10	`nli_verification_failed`, score=0.0027875436935573816	`nli_verification_failed`, score=0.0027875436935573816	✓ identical
S3	`nli_model_unavailable`	`nli_model_unavailable`	✓ identical (cleanup 무관 — root cause v0.18.1 follow-up)

cleanup 가 mechanical refactor only — behavior 회귀 0 + NLI score deterministic. cut PR v0.18.0 진행 가능 baseline.

Post-architectural-refactor retest (2026-05-26)

OMC team post-pr9-refactor 의 architect 가 priorities 분석 후 executor + test-engineer 가 추가 cleanup 진행 — H1 (models.nli.model config wiring, DEFAULT_MODEL_ID 제거), H2 (truncate_for_nli 의 _hypothesis stub param 제거), H3 (was_truncated tracing::debug! 로 surfacing), D (MCP test flake fix), E (carried TODO → HOTFIXES cross-link). test-engineer 가 T1/T2/T3/T4 (총 9 new tests) 추가. system-architect 가 component-level review 후 "pre-cut nothing — all architectural items v0.18.1+ defer" 결론.

본 architectural refactor 후 동일 4 case 재실행. PR-9d / post-cleanup / post-refactor 3번 모두 byte-identical:

case	PR-9d	post-cleanup	post-refactor
S7	0.0035389824770390987	0.0035389824770390987	0.0035389824770390987 ✓
S1	0.058334656059741974	0.058334656059741974	0.058334656059741974 ✓
S10	0.0027875436935573816	0.0027875436935573816	0.0027875436935573816 ✓
S3	nli_model_unavailable	nli_model_unavailable	nli_model_unavailable ✓

H1 의 config wiring (DEFAULT_MODEL_ID 제거 후 config.models.nli.model 사용) 가 behavior 변경 0 — default config 의 model 값이 hardcoded 와 같음. workspace test 1304 passed + 1 pre-existing flaky (kebab-mcp HOTFIX #15 동일). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.

7.3 KiB Raw Permalink Blame History