feat(llm): [models.llm] request_timeout_secs config + 권장 모델 가이드

v0.17.0 확장 도그푸딩 (2026-05-25) 에서 발견된 두 가지를
한 PR 에 묶음.

(1) llm.generate_stream 의 hard-coded 300s timeout 을 config 노브로
    빼냄. 8B+ 모델 (gemma4:e4b 등) 은 CPU only 환경에서 5분
    안에 첫 RAG 답변 못 마치고 `error: kb-rag: llm.generate_stream`
    으로 떨어지던 문제.

    - kebab-config::LlmCfg 에 request_timeout_secs: u64 additive
      필드 (#[serde(default = "default_llm_request_timeout_secs")]
      default 300). 옛 config 가 키 누락해도 그대로 파싱 + 동일
      동작.
    - env override KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS.
    - kebab-llm-local::ollama.rs 의 REQUEST_TIMEOUT 상수 제거 →
      OllamaLanguageModel::new 가 Duration::from_secs(
      llm.request_timeout_secs) 로 reqwest client 빌드. doc
      comment 도 동일 갱신.
    - 신규 unit test 3 — default 300 핀 / env override / legacy
      config (필드 누락) backward-compat.

(2) docs — README 사전 요구 절 + docs/SMOKE.md ollama 안내에 한 단락:
    CPU only / RAM ≤ 16 GB 환경 ⇒ ≤ 4B Q4 모델 권장
    (gemma3:4b / qwen2.5:3b / phi3:mini). 8B+ 시도 시 timeout
    패턴 사전 안내. request_timeout_secs 노브 사용법.

    HOTFIXES 2026-05-25 entry — 위 두 변경 + 미진행 사항
    (kebab-parse-image OCR 의 같은 hard-coded 300s 는 scope 외
    follow-up 으로 등재 + ask --stream 권장 강조 후속) 기록.

workspace cargo test -j 1 + clippy 통과. 코드 변경은 backwards-compat
(additive serde field) 라 기존 사용자 영향 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 03:01:03 +00:00
parent 578a60e3bb
commit 3f5e0e6e90
5 changed files with 160 additions and 9 deletions

View File

@@ -48,10 +48,12 @@ use serde::{Deserialize, Serialize};
use crate::error::LlmError;
/// Hard ceiling on a single HTTP exchange. Cold-loading a 14B model on
/// first call can take ~30s; 5 minutes is generous without being
/// open-ended.
const REQUEST_TIMEOUT: Duration = Duration::from_secs(300);
// v0.17.0 post-dogfood: the per-request ceiling now lives in
// `kebab_config::LlmCfg::request_timeout_secs` (default 300s) so users
// running larger models on CPU-only hosts can extend it without a
// rebuild. Cold-loading an 8B+ model on first call routinely takes
// 60-90 s plus multi-minute inference; 300s was the legacy hard
// ceiling and remains the default for back-compat.
/// `reqwest::blocking` adapter implementing [`LanguageModel`] over Ollama's
/// local HTTP API. Construction is cheap and offline; the first network
@@ -79,7 +81,7 @@ impl OllamaLanguageModel {
pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self> {
let llm = &config.models.llm;
let client = reqwest::blocking::Client::builder()
.timeout(REQUEST_TIMEOUT)
.timeout(Duration::from_secs(llm.request_timeout_secs))
.build()?;
Ok(Self {
client,
@@ -262,9 +264,11 @@ struct OllamaLine {
///
/// Timeout invariant: the iterator has no inherent stop condition for an
/// indefinitely-stalled server — only the underlying
/// `reqwest::blocking::Client`'s read timeout (`REQUEST_TIMEOUT`, 300s)
/// breaks the hang. Callers needing tighter cancellation should adjust
/// the client timeout in [`OllamaLanguageModel::new`].
/// `reqwest::blocking::Client`'s read timeout (configured via
/// `kebab_config::LlmCfg::request_timeout_secs`, default 300 s) breaks
/// the hang. Callers needing tighter / looser bounds should set
/// `[models.llm] request_timeout_secs = N` (or
/// `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=N`) before building.
struct OllamaStream {
reader: BufReader<reqwest::blocking::Response>,
line_buf: Vec<u8>,