feat: v0.20.2 — Ask 응답언어 rag-v3 + 8 dogfood findings + 검색 품질 eval baseline #192

Merged

altair823 merged 23 commits from fix/v0-20-2-dogfood-findings into main

2026-05-29 05:27:55 +00:00

Author	SHA1	Message	Date
altair823	ca8c83b1ba	chore(hotfixes): PR #192 회차 1 리뷰 반영 — refusal marker 표기 정정 `<REFUSE>` marker → citation marker(`[#번호]`) 유무 기반 (pipeline.rs:463-486). release-notes 정정과 일관. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 05:22:57 +00:00
altair823	6c611990d8	chore: bump version 0.20.1 -> 0.20.2 v0.20.2 — Ask 응답언어 자동매칭(rag-v3) + 8 dogfood findings + eval --config facade 패치 + 검색 품질 golden baseline (hybrid hit@3=1.0 / MRR=0.833, lexical hit@3=1.0). evidence: tasks/HOTFIXES.md 2026-05-29, docs/release-notes/v0.20.2-draft.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 05:10:42 +00:00
altair823	166b1404e4	docs(release-notes): correct refusal判정 mechanism + O-2 phrasing leader review of writer draft: refusal 판정은 citation marker(`[#번호]`) 유무 기반이며 `<REFUSE>` 특수 마커가 아님. O-2 문구 예시도 실제 rag-v3 규칙으로 정정. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:58:08 +00:00
altair823	2d0168b7ab	docs(sync): README + HANDOFF v0.20.2 surface 반영 CLAUDE.md docs-split 규칙에 따라 사용자 visible surface 변경 동기화. README: - [rag] prompt_template_version default rag-v2 → rag-v3 (v0.20.2) - v3 규칙 설명 (답변 언어 = 질문 언어) - O-2 known limitation (소형 모델 refusal 언어 불일치) HANDOFF: - 머지 후 발견된 버그/결정 에 v0.20.2 1줄 요약 추가 - 검색 품질 baseline (hybrid MRR=0.833) + O-2 known limitation 언급 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:26 +00:00
altair823	4afcaf96d2	docs(release-notes): add v0.20.2-draft (rag-v3 응답언어 + 검색 품질 eval 인프라) v0.20.2 릴리즈 노트 초안 작성. 사용자 영향 4단락 구조로 각 finding 기술. - Finding #1/O-2: rag-v3 응답언어 자동 매칭 + refusal 언어중립화 - Finding #2: bulk search input schema 확정 (15필드) - Finding #3: list docs human-readable path 보강 - Finding #7: index_version 두 곳 구분 (vector vs FTS5) - eval --config facade + 검색 품질 baseline (hybrid hit@3=1.0 / MRR=0.833) - Finding #4/#5/#6/#8: docs/schema 정비 - version cascade 주의 (rag-v3 → eval compare) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:19 +00:00
altair823	16c4579399	docs(hotfixes): add 2026-05-29 v0.20.2 dogfood findings + 검색 품질 baseline 8-finding 도그푸딩 라운드 및 검색 품질 baseline 결과를 HOTFIXES 에 기록. - 8 findings 요약 표 (rag-v3, bulk schema, list docs, index_version 등) - Finding O-2 known limitation (소형 모델 refusal 언어 불일치) - 검색 품질 baseline 표 (hybrid MRR=0.833, lexical MRR=0.7) - golden 큐레이션 교훈 (dispatch.py 정답 정정 → hit@3 0.9→1.0) - eval logs cross-link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:11 +00:00
altair823	40d7faee71	docs(dogfood): add §10.2 search quality baseline scenario (v0.20.2 golden suite) eval --config facade 패치로 dogfood KB 직접 평가 가능해짐에 따라 §10 Eval 에 §10.2 검색 품질 baseline 섹션 추가. - golden suite 실행 명령 (hybrid + lexical eval run → aggregate) - v0.20.2 metric baseline 표 (hybrid hit@3=1.0 / MRR=0.833) - 정성 체크리스트 (한국어 2자 hit@3, empty=0, MRR 임계치) - golden 큐레이션 절차 + dispatch.py 오류 교훈 - §10.1 로 기존 basic eval run 재구성 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:54:03 +00:00
altair823	a3bb2580bf	test(rag): add rag-v3 dispatch integration test + refresh stale rag-v2 docs (code-review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 04:46:27 +00:00
altair823	2429189447	docs(spec): reflect search-quality critic round-1 (eval --config, lang-filter non-goal, curation) Incorporates all critic (opus) round-1 findings into the dogfood search-quality eval design spec: BLOCKER-1: §4.4 execution commands now use --config /build/dogfood/config.toml (Task A facade-rule patch makes this the canonical path). §5.1 re-titled from "(후속 패치)" to "Task A로 적용됨 — 권장 운영 경로"; XDG workarounds demoted to "패치 전 fallback". Intro paragraph updated accordingly. MAJOR-1: §3 Non-Goals gains an explicit bullet: lang/media/code_lang SearchFilters validation is out of scope for this harness (runner uses SearchFilters::default(), runner.rs:151). §4.1 "code 검색" row no longer claims code_lang filter coverage. MINOR-1: §4.3 step 3 now names kebab inspect doc <id> as the primary chunk-selection path (breaks chunk-level curation loop); search hits demoted to "보조 확인용". MINOR-2: §4.1 golden category table gains two new rows — 한국어 N-gram fallback query (복합어/신조어 coverage) and 영어 whole-token exact query (separates substring artefacts). MINOR-3: §4.1 YAML header note added: record corpus_revision in golden file so stale-bail root cause is immediately traceable. NIT: §9 References line numbers corrected (runner.rs:31, metrics.rs:116/144); runner.rs:151 SearchFilters::default() reference added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 03:43:00 +00:00
altair823	d93b757cf1	fix(cli): thread --config through kebab eval run/aggregate/compare (facade-rule) Cmd::Eval now loads Config via cli.config (same pattern as all other subcommands) before dispatching to the inner match. Each arm now calls the *_with_config variant: run_eval(&opts) → run_eval_with_config(&cfg, &opts) compute_aggregate(run_id) → compute_aggregate_with_config(&cfg, run_id) store_aggregate(run_id, ..) → store_aggregate_with_config(&cfg, run_id, ..) Compare already called compare_runs_with_config but sourced cfg from Config::load(None) — that redundant load is removed; cfg comes from the shared binding above. Fixes the same facade-rule regression pattern as P3-5 / P4-3: previously `kebab --config /build/dogfood/config.toml eval run` silently evaluated the XDG-default (empty) KB instead of the dogfood KB. Also fixes runner.rs test that hardcoded rag-v2 after commit `5719969` bumped the default prompt_template_version to rag-v3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 03:42:40 +00:00
altair823	571996938c	docs(contract): bump default prompt_template_version to rag-v3 (Todo #1 ) line 899: V1만 legacy → V1/V2 둘 다 legacy, v0.20.2 부터 rag-v3 default 선언. line 1349 (★): config 예시 default rag-v2 → rag-v3. line 1533 (★): §9 cascade table 코드 상수 rag-v2 → rag-v3. line 287 이후: answer.v1 예시 블록에 historical snapshot 주석 추가 (n1 — model+ptv stale, 값 변경 안 함). task spec grep 판단: tasks/p9/p9-fb-15 의 rag-v2 언급 2줄은 rag-v2 도입 시점 historical 기술 → frozen 유지. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:45:13 +00:00
altair823	be79bdb83d	docs(#7 ): distinguish vector-store vs FTS5 index_version (Todo #7 ) schema.schema.json models.index_version: vector store (LanceDB) version 임을 명시. search_hit.schema.json index_version: lexical (FTS5) version 임을 명시. search_hit.schema.json retrieval: 내부 필드 목록 + hybrid 전용 fusion 설명 추가 (hunk 공유). README kebab schema 행: index_version 두 곳의 의미가 다름을 주의 표기 추가. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:45:04 +00:00
altair823	4e76f103c1	docs(#5,#6): clarify retrieval.* nesting + single-mode score relation (Todo #5/#6) README Score 해석 절에 score ↔ retrieval.* 구조 설명 추가: - fusion_score/lexical_score/vector_score/lexical_rank/vector_rank 는 retrieval 내부 (top-level 아님). - single-mode 에서 score==fusion_score==lexical/vector_score 가 같은 값인 것은 정상 (Finding X). search_hit.schema.json score 필드에 score_kind 관계 + single-mode 동일값 이유 설명 추가. search_hit.schema.json retrieval/index_version 설명은 Task 12 커밋에 포함 (같은 hunk). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:44:56 +00:00
altair823	4fd672193f	docs(#4 ): clarify lang vs code_lang semantic and und=code (Todo #4 ) lang_breakdown description에 code 문서는 자연어 감지 미수행(lang="und" 정상) 사실 추가. README에 lang vs code_lang 설명 절 신규 추가. task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 historical 기술 → frozen 유지. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:44:34 +00:00
altair823	1454321b12	fix(rag): rag-v3 refusal/hedge phrasing follows answer language (Finding O-2) SYSTEM_PROMPT_RAG_V3: 한국어 리터럴 refusal/hedge 문구를 언어 중립으로 교체. - 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다. - 근거가 모호하면 "확실하지 않다" 라고 명시한다. → 근거가 모호하면 답변 언어로 불확실함을 명시한다. MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT: 동일 패턴 두 곳 교체. - 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다. - self-check 의 즉시 "근거가 부족하다" 라고만 답한다. → 즉시 답변 언어로 근거가 부족하다고만 답한다. refusal 판정 로직(citation marker 기반)은 무변경 — 문구만 언어 중립화. test rag_v3_contains_v2_rules_plus_language_rule: "확실하지 않다" assert → "불확실함" assert 로 갱신. task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 도입 시점 historical 기술 → frozen 유지. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:43:28 +00:00
altair823	649ec35108	feat(cli): add Ollama remote-endpoint hint to kebab init (Todo #8 ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:19:34 +00:00
altair823	dece5e89fc	feat(bulk): document bulk search input schema + error shape hint (Todo #2 ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:16:21 +00:00
altair823	3cb49f1f9b	feat(cli): show title + doc_path in list docs human output (Todo #3 ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:03:10 +00:00
altair823	f5ff823984	docs(dogfood): add RAG response-language scenario (Todo #1 verification)	2026-05-28 21:28:22 +00:00
altair823	b82eaec21a	feat(config): default prompt_template_version rag-v2 -> rag-v3 (Todo #1 )	2026-05-28 21:20:59 +00:00
altair823	6daa43375b	feat(rag): add rag-v3 system prompt with response-language matching (Todo #1 )	2026-05-28 21:20:18 +00:00
altair823	85efeeca3e	docs(plan): v0.20.2 dogfood findings 구현 plan (15 task) planner(opus) 작성 → critic 리뷰 시도 → leader 좌표 검증. 8 todo → 15 task: 코드 4 (rag-v3 / list docs / bulk / init) + 각 finding 후 전체 도그푸딩 검증 task 4 + docs-only 3 + contract + HOTFIXES/release-notes + version bump. plan critic round-1 은 환경 도구 손상으로 좌표 blocker(B-1/B-2/M-1/M-2)를 오진 → leader 가 pipeline.rs/config/cli/bulk/Cargo.toml 을 직접 grep 검증해 plan 좌표 정확 확인, executor 용 "anchor grep 재확인" + binary 경로 주의 헤더 추가. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 21:09:39 +00:00
altair823	2b4ba8e104	docs(spec): v0.20.2 dogfood findings 설계 + round-1 critic 반영 v0.20.1 전체 도그푸딩에서 발견된 8 todo (Ask 응답언어 rag-v3 / doc.lang docs / bulk input / list title / fusion_score·score_kind / schema index_version / Ollama hint) 를 단일 patch release 로 설계. writer worker 초안 → opus critic round-1 리뷰 반영: - B1: top-level score placeholder → 확정 (score_kind 가 의미 선언, search.rs:95-99) - M1: 이미지 caption 언어 강제 out-of-scope 명시 - M2: config default 테스트(lib.rs:1316) 갱신 필요 명시 - M3: bulk input 전체 필드 (query/mode/k/trust_min/ingested_after/media/tag/lang) - M4: rag-v3 의 eval_runs.config_snapshot_json cascade 영향 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 20:37:42 +00:00

feat: v0.20.2 — Ask 응답언어 rag-v3 + 8 dogfood findings + 검색 품질 eval baseline #192

23 Commits