feat(p4-3): rag-pipeline — kb-rag 크레이트 + kb-app::ask 와이어링 #23
Reference in New Issue
Block a user
Delete Branch "feat/p4-3-rag-pipeline"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
변경 요약
P4 마지막 task. 사용자 입장에서 가장 큰 payoff입니다 — 본 PR 머지 후
kb ask가 실제로 동작합니다 (Ollama 백엔드 연결 시). 새 크레이트kb-rag이 retrieve → score gate → pack → render → generate → cite-validate → persist 9단계 파이프라인을 담고,kb-app::ask의 stub을 실제 본체로 교체합니다.무엇을 했는가
새 크레이트
kb-ragRagPipeline:Arc<dyn Retriever + Send + Sync>+Arc<dyn LanguageModel + Send + Sync>+Arc<SqliteStore>+Config. 그 자체로Send + Sync(compile-time check로 pin).AskOpts { k, explain, mode, temperature, seed, stream_sink: Option<mpsc::Sender<String>> }.k는config.search.default_k을 floor로 가지므로 caller가 실수로 0을 넘겨 retrieval을 starve할 수 없게 됨.9단계 파이프라인
dyn Retriever로SearchQuery검색.NoChunksrefusal (LLM 호출 없음). top-1 <config.rag.score_gate→ScoreGaterefusal (LLM 호출 없음, top-3 후보 답변 텍스트에 나열).max_context_tokens.saturating_sub(prompt overhead). 각 hit에[#n] doc=… heading=… span=…\n<text>포맷. 모든 hit의 chunk가 store에서 unfetchable이면 (search와 pack 사이 삭제) 빈[근거]를 LLM에 넘기지 않고NoChunksrefusal로 fallback (리뷰 MUST-FIX 반영).[질문]/[근거]user 템플릿.SendErrorsilently drop (caller cancellation에도 panic 없이 계속).max_completion은llm.context_tokens().saturating_sub(used_for_input).max(64)—rag.max_context_tokens로 cap하지 않음 (그건 [근거] 패킹 예산이지 LM 출력 예산이 아니라는 리뷰 MUST-FIX 반영).\[#(\d{1,3})\](OnceLock으로 한 번 컴파일). 느슨한[1],[ #1 ],[#foo],[#1234],vec![1]은 모두 거부 — 산문/Markdown footnote/code block 내 false positive 차단.[#7](packed 3개일 때) → LlmSelfJudge.근거 (가|이) 부족매치 → LlmSelfJudge (canonical Korean refusal phrase). phrase 매치는tracing::debug에 기록.citations: packed 리스트를 모델이 실제 인용한 marker로 필터. wire formatmarker: Some(\"[1]\")(square-bracketed bare index, design §2.3) — prompt-side[#n]문법과 분리 (리뷰 MUST-FIX 반영).embedding ModelRef: Vector/Hybrid 모드는config.models.embedding에서 읽음, Lexical은None. ScoreGate/NoChunks refusal에서도 vector retriever가 "consulted"되었으므로Some(...)유지 (코드 코멘트로 명문화).TraceId:ret_<8-hex>fromblake3(query, top_score, model_id, ns).usage: 마지막Donechunk에서. LLM이 0 보고 시 wall-clock fallback.SqliteStore::put_answer(&Answer, query, packed_chunks_json). 모든 경로 (refusal 포함)answers행 INSERT.packed_chunks_json은opts.explain=true일 때만.kb-app::ask와이어링bail!(\"not yet wired (P4-3)\")을 실제 본체로 교체.opts.mode에 따라LexicalRetriever/VectorRetriever/HybridRetriever빌드 +OllamaLanguageModel인스턴스 +RagPipeline::new후ask().kb-app::AskOpts는pub use kb_rag::AskOpts으로 이전 (kb-cli의kb_app::AskOptsimport 변경 없음).kb-app/Cargo.toml에kb-rag,kb-llm,kb-llm-local추가. P3-5 시점에 forbid되어 있던 항목들은 P4-3 spec에서 lift됩니다 — kb-app이 orchestrator이고ask는 trait 크레이트와 Ollama adapter 둘 다 필요.kb-cli/src/main.rs의AskOptsliteral에stream_sink: None추가 (P9 TUI에서 실제 sink 연결).테스트
embedding_ref_for모드별 동작) + 15 통합 (모든 spec test row +unfetchable_chunks_fall_back_to_no_chunks신규 가드 테스트).#[ignore]1건): 실제 Ollama onlocalhost:11434을 요구하는 ask end-to-end smoke. opt-in viacargo test -p kb-app -- --ignored.워크스페이스 default 319 passed / 26 ignored / 0 failed.
cargo clippy --workspace --all-targets -- -D warningsclean.의존성
Allowed deps 준수:
kb-core,kb-config,kb-search,kb-llm,kb-store-sqlite,serde,serde_json,regex,time,tracing,thiserror+ 강제anyhow(trait 반환 타입),blake3(TraceId 해싱).Forbidden deps (
kb-source-fs,kb-parse-md,kb-normalize,kb-chunk,kb-store-vector,kb-embed*,kb-llm-local,kb-tui,kb-desktop) 모두 직접 의존 0건. concrete adapter는 모두 trait object를 통해서만 도달.변경 파일
crates/kb-rag/Cargo.toml(신규)crates/kb-rag/src/lib.rs(신규)crates/kb-rag/src/pipeline.rs(신규 — 9단계 파이프라인)crates/kb-rag/tests/common/mod.rs(신규 — RagEnv + MockRetriever)crates/kb-rag/tests/pipeline.rs(신규)crates/kb-store-sqlite/src/answers.rs(신규 —put_answer)crates/kb-store-sqlite/src/lib.rs(mod 등록)crates/kb-app/Cargo.toml(kb-rag/kb-llm/kb-llm-local 추가)crates/kb-app/src/lib.rs(ask본체 와이어링)crates/kb-app/tests/ask_smoke.rs(신규#[ignore])crates/kb-cli/src/main.rs(stream_sink: None추가)Cargo.toml(workspace member +regexworkspace dep)Cargo.lockOut of scope
must_contain을 사용.--json스트리밍 — design §0 Q5 hybrid 따라 버퍼링.머지 후 사용 예시
design §0 Q4 (refusal), §0 Q7 (footer), §1.1-1.4 (ask scenes), §2.3 (Answer wire), §3.8 (internal Answer), §6.4 [rag], §10 errors 참고. 본 PR로 P4 단계 종료 — 다음은 P5-1 golden-fixture-runner.
P4 terminal task. Implements the user-facing payoff: retrieve → score gate → pack → render → generate → cite-validate → persist. After this commit, `kb ask` actually works against an Ollama backend; the pipeline grounds the answer in retrieved chunks and refuses cleanly when the gate trips or the model self-judges. New crate kb-rag: - pub struct RagPipeline { retriever, llm, docs, config } — all Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync. - pub fn ask(query, opts) -> Result<Answer> drives the nine-stage flow per spec §1. - pub struct AskOpts { k, explain, mode, temperature, seed, stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor over config.search.default_k so a low-k caller can't starve retrieval (documented in field doc). Pipeline stages: 1. Retrieve via the injected dyn Retriever. 2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 < config.rag.score_gate → ScoreGate refusal (no LLM call) with top-3 candidates listed in the synthesized answer text. 3. Pack: budget = config.rag.max_context_tokens.saturating_sub (prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>` with deterministic enumeration. If every hit's chunk is unfetchable from the store (deleted between search and pack), fall back to NoChunks refusal with a tracing::warn rather than feeding an empty [근거] to the LLM. 4. Render rag-v1 prompt with the spec's verbatim Korean system string + `[질문]/[근거]` user template. 5. Generate via dyn LanguageModel. Single-thread token loop owns the iterator; tokens optionally forward to opts.stream_sink (a `mpsc::Sender<String>`). SendError silently dropped — caller cancellation never panics the pipeline. After Done the loop reads (acc, finish_reason, usage) in lockstep with no race. max_completion = llm.context_tokens().saturating_sub (used_for_input).max(64) — explicitly NOT capped by config.rag.max_context_tokens (that's the packing budget for [근거], not the LM completion ceiling). 6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`, `[#1234]`, `vec![1]` are all rejected to prevent prose false-positives. 7. Citation validate covers four cases: - unknown marker (e.g. `[#7]` when only 3 packed) → LlmSelfJudge refusal. - empty answer with hits → LlmSelfJudge. - non-empty + no marker + matches `근거 (가|이) 부족` regex → LlmSelfJudge (model self-refused with the canonical phrase; phrase match logged via tracing::debug for observability). - non-empty + no marker + no refusal phrase → LlmSelfJudge (silent ungrounded answers are still refusals). - non-empty + ≥1 valid marker → grounded = true. 8. Build Answer per kb_core::Answer shape: - citations: filter packed list to exactly the markers cited. Wire format `marker: Some("[1]")` (square-bracketed bare index) per design §2.3, distinct from the prompt-side `[#n]` grammar. - embedding ModelRef: read from config.models.embedding for Vector/Hybrid; None for Lexical. Documented deviation since the Retriever trait doesn't expose the embedder. For ScoreGate/NoChunks refusals on Vector/Hybrid the embedding model is still recorded — the vector retriever WAS consulted even when the gate tripped. - TraceId minted as `ret_<8-hex>` from blake3(query, top_score, model_id, ns). - retrieval AnswerRetrievalSummary populated. - usage from the final Done chunk; latency_ms wall-clock fallback when the LLM reports zero. - created_at OffsetDateTime::now_utc(). 9. Persist via SqliteStore::put_answer (new inherent method on SqliteStore, not on the DocumentStore trait — answers aren't documents and adding to kb-core was forbidden). Always inserts, refusals included. packed_chunks_json is null unless opts.explain == true. kb-store-sqlite extension: - pub fn put_answer(&Answer, query, packed_chunks_json) -> Result<AnswerId>. Maps all 22 fields of the answers table per V001 schema in a single INSERT under a transaction. kb-app::ask wired: - bail!("not yet wired (P4-3)") replaced with a real body that builds the retriever per opts.mode (Lexical | Vector | Hybrid), instantiates OllamaLanguageModel from config, constructs RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is re-exported via `pub use kb_rag::AskOpts` so kb-cli's `use kb_app::AskOpts` keeps working. - kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's forbids on these are lifted by P4-3 spec — kb-app is the orchestrator and ask requires both the trait crate and the Ollama adapter. - kb-cli/main.rs's AskOpts literal updated with stream_sink: None for the CLI path (TUI in P9 will plumb a real sink). Tests (kb-rag: 18; kb-app: 1 ignored): - 3 unit in src/pipeline.rs: marker regex strictness (rejects all loose forms with byte-equal expectations), Send+Sync compile check, embedding_ref_for behavior across modes. - 15 integration in tests/pipeline.rs covering every spec test row + the new "all chunks unfetchable falls back to NoChunks" guard: empty-hits, score-gate, grounded happy path, unknown-marker, prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase, packing-budget overflow, streaming-forwards-to-mpsc, dropped- receiver-no-panic, usage-from-final-Done, answers-row-inserted- for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON shape, unfetchable-chunks-fall-back-to-no-chunks (the new M3 test). - kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that drives the wired ask end-to-end against `localhost:11434`. Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-search, kb-llm, kb-store-sqlite, serde, serde_json, regex, time, tracing, thiserror) plus forced waivers anyhow (Retriever / LanguageModel trait return types) and blake3 (TraceId minting). Forbidden (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store- vector direct, kb-embed* direct, kb-llm-local direct, kb-tui, kb-desktop) all absent from `cargo tree -p kb-rag` — concrete adapters reach the pipeline only through trait objects. Out of scope: reranker between retrieve and pack (P+), multi-turn chat memory (P+), LLM-as-judge eval (P5 uses rule-based must_contain), --json streaming (buffers per §0 Q5 hybrid). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>P4-3 코드 리뷰 — 셀프 머지 게이트로 인해 COMMENT only.
P4 마지막 작업이라 양쪽 리뷰에서 의미 있는 발견 다수:
\"#1\"vs spec\"[1]\") + 여러 NIT.5개 MUST-FIX 모두 + 3개 핵심 NIT을 PR에 반영했습니다:
\"#1\"→\"[1]\"(spec §2.3 준수, prompt-side[#n]과 분리).max_completion에서.min(rag.max_context_tokens)제거 — 패킹 예산과 LM 출력 예산은 별개._matched_refusal_phrase을tracing::debug필드로 wire — 미래 metric 수집의 발판.AskOpts.k가 floor overdefault_k라는 의미를 doc-comment에 명문화.6-8. embedding_ref DRY, fixtures 빈 디렉토리 정리, 거절 시 embedding=Some(...) 의미 코멘트.
핵심 포인트:
OnceLockcached) → 4-case validate → build Answer → persist.SqliteStore::put_answer이 inherent method로 추가 — DocumentStore trait 오염 방지.kb ask이 실제 동작 (Ollama 백엔드 연결 시).Send + Synccompile-time check로 향후 async UI 통합에 대비.워크스페이스 default 319 passed / 26 ignored / 0 failed. clippy clean.
inline 코멘트는 모두 핵심 결정에 대한 노트입니다. 머지 진행해도 됩니다. 머지 후 위에 만들어둔
/tmp/kb-smoke/워크스페이스에서 실제 Ollama로kb ask스모크를 진행할 수 있습니다 (사전:ollama serve+ollama pull qwen2.5:7b-instruct).P4 단계 종료 — 다음은 P5-1 golden-fixture-runner.
@@ -818,0 +867,4 @@let emb_dyn: Arc<dyn Embedder> = emb;let vec_retr = Arc::new(VectorRetriever::with_settings(vec_dyn,emb_dyn,ask_with_config이opts.mode에 따라 retriever를 빌드하고OllamaLanguageModel을 인스턴스화한 뒤RagPipeline::new로 조립. P3-5에서 lift된kb-rag/kb-llm/kb-llm-local의존성이 P4-3 시점에서 정확히 필요해진 형태.pub use kb_rag::AskOpts로 kb-cli의use kb_app::AskOptsimport는 무변 — backward compat.@@ -0,0 +162,4 @@if packed_entries.is_empty() {tracing::warn!(target: "kb-rag",chunks_returned = hits.len(),unfetchable chunks → NoChunks fallback가드. search hits가 있어도 모든get_chunk이 None을 반환하면 (검색-패킹 사이 chunk 삭제) 빈 [근거]를 LLM에 넘기지 않고 NoChunks refusal로. diagnostic 정확성이 큰 차이를 만드는 부분 — 이전엔 LlmSelfJudge로 misleading하게 분류되던 경로. tracing::warn으로 "왜 fallback이 발생했는지" surface, 신규 테스트로 pin.@@ -0,0 +177,4 @@// Completion budget is bounded only by what the LM context window// has left after the input. NOTE: `rag.max_context_tokens` is the// *packing budget* for the [근거] block (used by `pack_context`)// — it is intentionally NOT used here as a completion cap.max_completion에서.min(rag.max_context_tokens)제거한 결정. 두 예산이 의미상 다르다는 점 —rag.max_context_tokens는 [근거] 패킹 cap,llm.context_tokens()은 LM 입출력 합산 cap — 을 코드 코멘트로 분리. 작은 패킹 cap (테스트용 50 등) 때문에 LM ���력이 64로 압축되던 버그가 닫힘.@@ -0,0 +192,4 @@system: system.clone(),user: user.clone(),stop: vec!["\n\n[질문]".to_string()],max_tokens: max_completion,단일 스레드 token loop이 iterator를 소유 + opts.stream_sink가 있으면 forward,
SendError은 silently drop. caller가 receiver를 dropped해도 panic 없이 답변 collection + persist까지 진행됨 —dropped_receiver_does_not_abort_generation테스트로 검증. UI / TUI에서 사용자가 ask을 cancel해도answers행은 보존.@@ -0,0 +254,4 @@});let trimmed_answer = acc.trim();let matched_refusal_phrase = refusal_phrase.is_match(&acc);let grounded = !trimmed_answer.is_empty()STRICT marker regex
\[#(\d{1,3})\]이 OnceLock으로 한 번만 컴파일. 느슨한 form ([1],vec![1],[ #1 ],[#foo],[#1234]) 모두 거부. 산문/Markdown footnote/코드 블록 내 false positive 차단이 grounded judgment의 정확성을 직접 좌우합니다 — 모델이[1]만 emit하면 "인용 안 함"으로 처리되어 LlmSelfJudge refusal로 기록됨.@@ -0,0 +275,4 @@// `[1]`. The `[#1]` form is the *prompt-side* citation// grammar (what the LLM emits in its text); the wire-side// `AnswerCitation.marker` strips the `#`.marker: Some(format!("[{n}]")),citation marker wire format을
"#1"에서"[1]"로 정정한 결정. 두 grammar의 분리가 명확합니다 — prompt-side는[#n](LLM이 텍스트에 emit), wire-side는[n](design §2.3 line 228). 코드 코멘트로 두 grammar의 차이를 명시. 첫 리뷰에서 spec 위반으로 잡혔던 항목을 정확히 닫음.@@ -0,0 +540,4 @@/// Build the `ModelRef` recorded in `Answer.embedding` for a given/// retrieval mode. `Lexical` paths leave it `None`; vector / hybrid/// paths attach the configured embedding model so `kb explain` can/// later identify which embedder shaped the retrieval (even onembedding_ref_for(mode, cfg)private helper로 main path와 score-gate refusal path의 중복 제거. ScoreGate refusal에서도 vector retriever가 "consulted" 되었으므로Some(ModelRef)이 의미상 정확하다는 점 — 코멘트로 명문화 — 미래 누군가가 "refusal인데 왜 embedding이 None이 아니지?" 의문을 제기하고 "고치는" 회귀를 차단합니다.@@ -0,0 +1,113 @@//! `answers` row writer (P4-3 — design §5.7).SqliteStore::put_answer이 inherent method로 추가 —DocumentStoretrait에 안 박은 결정이 정답. answer는 document가 아니고,kb-core::DocumentStore을 확장하면 (a) kb-core에 새 타입 추가가 필요해서 spec 위반, (b) 모든 DocumentStore 구현체가 어쩌면 절대 안 쓸 메서드를 강제로 구현해야 함. 기능을kb-store-sqlite에 inherent로 두면 caller (kb-rag)만 명시적으로 알면 됩니다.