kebab

Author	SHA1	Message	Date
altair823	d11a810119	feat(kebab-parse-image): P6-1 image extractor + EXIF whitelist - 새 crate kebab-parse-image 추가 (workspace 19개째). MediaType::Image(_) 자산을 단일-블록 CanonicalDocument 로 변환하는 ImageExtractor 구현. - parser_version "image-meta-v1" (§9 versioning). - 본문은 Block::ImageRef 1건만 포함 — OCR / caption 필드는 None 으로 남겨 두고 P6-2 / P6-3 에서 채운다. - EXIF 화이트리스트 (§9.1, PII 표면 최소화): Make / Model / Software / DateTimeOriginal / Orientation / GPSLatitude(+Ref) / GPSLongitude(+Ref). MakerNote / Thumbnail / 기타 태그는 폐기. DateTime 은 EXIF "YYYY:MM:DD HH:MM:SS" → ISO-8601 변환. GPS DMS triple + N/S/E/W ref → signed decimal degree. - 차원: image::ImageReader 헤더만 읽어 (w, h, format) 획득. 16k×16k cap 초과 또는 디코드 실패 → metadata.user.dimensions = null + Provenance Warning 이벤트 (Err 아님). 포맷 자체 인식 실패 → anyhow::Error (caller skip). - SourceSpan::Region { 0, 0, w, h } 으로 전체 이미지 영역 표기. 결정성: 동일 bytes + 동일 parser_version → 동일 doc_id + block_id (§4.2 ID recipe 그대로 사용). - metadata.source_type = Reference, trust_level = Primary, lang = "und". title = 확장자 제외 파일명, alt = 파일명. - 의존성 경계 (§8): kebab-core 만 + image 0.25 (default features off, png/jpeg/webp/gif/tiff 만), kamadak-exif 0.6, anyhow / serde / serde_json / time / tracing / thiserror. kebab-source-fs · parse-md · store-* · embed* · llm* · rag · UI crate 미참조. - 테스트 14개 (4 unit + 10 integration): • PNG 차원 추출, JPEG EXIF GPS 추출 (DMS → decimal 변환 정확도 1e-6), EXIF 없는 PNG → 빈 map, 손상 PNG → warning + null dims (panic 없음), 인식 불가 bytes → Err, 결정성, 스냅샷, supports() 매칭, media_type 불일치 거부. • 픽스처는 in-memory 생성 (PNG 는 image crate, EXIF JPEG 는 kamadak Writer 로 EXIF blob 만든 뒤 SOI 직후 APP1 splice) — 바이너리 fixture 커밋 없음. - HEIC / RAW 는 spec 상 v1 out of scope (image crate 미지원, Apple Vision sidecar 가 추후 P+ 에서 채움). - tasks/p6/p6-1-image-extractor-exif.md status: planned → completed. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText / ModelCaption stubs, §9.1 image extraction policy, §9 versioning.	2026-05-02 05:05:47 +00:00
altair823	f1a448d6dc	refactor(rename): kb → kebab — binary, env vars, XDG paths, file renames 두 번째 commit. 사용자 facing surface (CLI binary, env vars, XDG paths) + 코드 안 single-letter token (`KB_`, `kb.sqlite`, `/kb/`, tracing target) 일괄 rename. 그리고 3 개 file rename: - 디자인 doc `2026-04-27-kb-final-form-design.md` → `2026-04-27-kebab-final-form-design.md` - 최초 보고서 `kb_local_rust_report.md` → `kebab_local_rust_report.md` - workspace ignore `.kbignore` → `.kebabignore` ## 변경 - `crates/kebab-cli/Cargo.toml`: `[[bin]] name = "kb"` → `"kebab"`. - `crates/kebab-cli/src/main.rs`: `#[command(name = "kb", …)]` → `name = "kebab"`. - 모든 `KB_` env var (코드 + doc + 테스트) → `KEBAB_`. apply_env prefix 매칭 + 30+ 개 setting 키 모두. - XDG paths: `~/.config/kb` / `~/.local/share/kb` / `~/.cache/kb` / `~/.local/state/kb` → `~/.config/kebab` 등. config defaults + expand_path tests + paths.rs 의 hardcode 모두. - SQLite filename: `kb.sqlite` → `kebab.sqlite` (`SQLITE_FILE` const + 테스트 hardcode 모두). - tracing target: `target: "kb-"` → `"kebab-"` (10+ 곳). - snapshot fixture: `.kbignore` → `.kebabignore` (`fixtures/source-fs/ tree-1.snapshot.json` 갱신). ## 검증 - `cargo test --workspace -j 1` clean (linker OOM 회피 위해 직렬). - `cargo clippy --workspace --all-targets -- -D warnings` clean. 다음 commit 에서 docs sweep. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:01:35 +00:00
altair823	911fb49550	refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules 프로젝트 이름 `kb` → `kebab` rename 의 첫 단계. - workspace `Cargo.toml`: members `crates/kb-` → `crates/kebab-`, repository URL `altair823/kb` → `altair823/kebab`. - 18 crate 폴더 rename via `git mv` (history 보존). - 각 crate `Cargo.toml`: `name = "kb-"` → `"kebab-"`, path deps `../kb-` → `../kebab-`. - 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`, `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`, `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`, `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`, `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어 경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염). CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths, tracing target, 그리고 docs sweep 은 다음 commit 에서. ## 검증 - `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:28:08 +00:00
altair823	ee1f2339dd	fix(p5-2): apply push-time review items — citation/refusal correctness + nits 두 reviewer 의 should-fix 4 건 + nit 5 건 push 전 반영. ## should-fix - `citation_coverage`: 빈 citations[] 가 `Iterator::all` vacuous-true 로 1.0 새는 거 차단 — `!is_empty() && all(non-empty path)` 로 변경. 또한 `_store: &SqliteStore` dead 인자 시그니처에서 제거 (호출 사이트 + 테스트 helper 정리). - `refusal_correctness`: lexical-only run 에서 `answer == None` 인 경우 분모 증가 안 함 (NaN/null 출력) — 자동 fail 처리하던 게 metric 의미를 왜곡함. 새 unit test `refusal_correctness_nan_for_non_rag_run` 추가. - `groundedness`: `must_contain.is_empty() && forbidden.is_empty()`인 golden 은 분모에서 제외. unconfigured entry 가 free 1.0 받지 않게. 새 unit test `groundedness_skips_unconfigured_goldens` 추가. - `kb-cli/Cargo.toml` rationale 코멘트 사실 오류 정정 — kb-eval → kb-app 의존이지 그 반대 아님. ## nits - `KB_EVAL_GOLDEN` / `DEFAULT_GOLDEN_PATH` 중복 — `metrics::` 의 `pub(crate)` 로 단일화, `runner` 가 import. - `render_report_md` 의 `{:?}` `ComparisonKind` → 명시적 lowercase 매핑 함수 (`win`/`loss`/`draw`/`regression`) — JSON 직렬화 컨벤션과 통일. - `extract_chunker_version` `None == None` 매치 silent 위험에 대한 defensive 코멘트. - `delta_null_when_either_nan` 테스트의 `let mut` suppress hack → struct update syntax 로 정리. - `empty_store` test helper + 매번 `mem::forget(tmp)` 죽은 코드 제거. ## 추가 spec doc `tasks/p5/p5-2-metrics-compare.md` deviations 섹션 4 항목 추가: - `kb-eval` crate-level `kb-app` dep — P5-1 inheritance, 새 모듈 surface 는 import 안 함. - `citation_coverage` 약화된 resolver — `document_exists_by_path` 기다리는 중. - `refusal_correctness` non-RAG 런 NaN. - `groundedness` no-check golden skip. ## 검증 - `cargo test -p kb-eval` 35/35 (18 unit + 2 loader + 8 integration + 7 runner; 새 3 unit test). - `cargo clippy --workspace --all-targets -- -D warnings` clean. - `compare_report_snapshot_matches_fixture` 변경 없이 통과 — 새 동작이 스냅샷 입력 (lexical-only, no must_contain, no should-refuse) 영향 없음. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:17:32 +00:00
altair823	d9a5b88d27	feat(p5-2): kb-eval metrics + compare — AggregateMetrics, CompareReport, kb eval CLI P5-2 구현. 저장된 eval_runs / eval_query_results 위에서: - `kb_eval::metrics`: hit@k / MRR / recall@k_doc / citation_coverage / groundedness / empty_result_rate / refusal_correctness 계산. NaN metrics (분모 0)는 JSON null. 4-decimal round + Deserialize 추가로 aggregate_json 라운드트립. - `kb_eval::compare`: 두 run 비교 → CompareReport (per-metric Δ + per- query Win/Loss/Draw/Regression). chunker_version drift 시 graceful doc-id fallback (chunker_version_match: "fallback_doc"), `strict` 옵션이면 refuse. - `render_report_md`: 인간용 Markdown (집계 + Wins/Losses/Regressions 표). - `SqliteStore::{load_eval_run, load_eval_query_results, update_eval_run_aggregate}` + owned `EvalRunRecord` / `EvalQueryResultRecord` 추가 — write 측 borrow-shape는 그대로. - `kb eval` CLI: `run` (P5-1 위임), `aggregate <id>`, `compare <a> <b> [--strict-chunker-version] [--write-report]`. `--json` 으로 raw CompareReport, 기본은 Markdown 출력. ## Spec deviations (intentional, doc 명시) - Graceful 매칭은 doc-id-only (chunker_version_match: "fallback_doc") — 50% span overlap은 chunker re-index 후 양쪽 chunks 동시 보존이 현실적으로 안 돼서 P6+ 로 deferred. - `*_with_config` 헬퍼 추가: 통합 테스트가 TempDir Config 로 드라이브. no-arg 형태는 Config::load(None) 로 위임. - CLI 는 kb-cli → kb-eval 직접 wire (kb-app cycle 회피). DoD 의 "via kb-app" 의도는 facade 단일화였지만 cycle 발생. - `AggregateMetrics: Deserialize` 추가 — aggregate_json 라운드트립. ## 검증 - `cargo test -p kb-eval` 30/30 (15 unit + 2 loader + 8 metrics+compare 통합 + 7 runner). 8 통합 중 snapshot 1 건 (`compare-1.json`). - `cargo test -p kb-store-sqlite` 33/33. - `cargo clippy --workspace --all-targets -- -D warnings` clean. - forbidden imports 부재 (`kb-source-fs\|kb-parse\|kb-normalize\|kb-chunk\| kb-store-vector\|kb-embed\|kb-search\|kb-llm\|kb-rag\|kb-tui\|kb-desktop\| kb-app` — kb-app 는 metrics/compare 모듈에 부재; runner 만 사용). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:05:13 +00:00
altair823	e6ff9c412c	fix(p5-1): apply deferred review items — App reuse + expand_path hoist + nits - kb-app: promote App to pub, add open_with_config / search / ask methods so kb-eval (and future TUI) can amortize embedder + vector store + LLM cold-start across many queries on one App instance. Memoization is per-instance via OnceLock; *_with_config free functions delegate. - kb-config: add canonical expand_path helper + 8 unit tests; drop the 4 duplicate copies in kb-store-sqlite, kb-store-vector, kb-embed-local, kb-eval (net: -6 duplicate tests, +8 canonical tests). - kb-eval: extract elapsed_ms_u32 helper, drop redundant tracing debug log (with_context already names path on error), replace dead-port :1 test with bind-then-release ephemeral port. Verified: cargo clippy --workspace --all-targets -D warnings clean, all crate tests green (kb-app 12+3 ignored, kb-eval 11, kb-config 17, kb-store-sqlite 33, kb-store-vector 7+8 AVX-gated, kb-embed-local 7+7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:55:23 +00:00
altair823	58a11cc2b8	feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence - new kb-eval crate: load_golden_set (YAML) + run_eval (per-query search/ask + persistence) - new kb-store-sqlite::eval module: record_eval_run_with_results (transactional), document_exists / chunk_exists probes - fixtures/golden_queries.yaml: 5-entry KO+EN template - tests: 13 pass (loader: parse, dup-id, missing chunk_id; runner: elapsed, snapshot, error capture, JSONL, determinism, persistence, config_snapshot) - per_query.jsonl mirror written to runs_dir/<run_id>/ - temperature=0 + fixed seed → byte-identical per_query.jsonl (lexical) deviations from spec (documented in code): - run_id uses uuid::Uuid::now_v7().simple() (timestamp-ordered hex) instead of ULID — uuid already in workspace deps - load_golden_set_validated kept #[cfg(test)] pub(crate) — production inlines validate_against_db - snapshot fixture uses normalized projection (id/query/mode/first_hit) — full byte-determinism covered by separate test - index_version in config_snapshot left null (composed per call by kb-app, not config-level) deferred to follow-up: - App reuse across queries (currently rebuilds App per query) - expand_path hoist to kb-config (3 crate clones now) - --max-queries flag (deferred to P5-2 per updated spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:01:09 +00:00
altair823	9dde01eb9f	fix(rag): normalize RRF fusion_score to [0,1] + log post-merge hotfixes ## Bug config.rag.score_gate default 0.05 was incompatible with hybrid RRF fusion_score: raw RRF tops out at num_retrievers / (k_rrf + 1) ≈ 0.0328 at the default k_rrf=60, so every hybrid `kb ask` tripped ScoreGate refusal even when the top hit was perfectly aligned across both retrievers. Symptomatic on the post-P4-3 manual smoke at /tmp/kb-smoke/ pointed at 192.168.0.47 Ollama: $ kb ask "Rust ownership 모델의 핵심 규칙은 뭐야?" --mode hybrid 근거 부족. KB에 해당 내용 없음. # top fusion_score = 0.0164 Per-mode score_gate (lexical_score_gate / vector_score_gate / hybrid_score_gate) was rejected because it forces every consumer (CLI, eval, TUI) to know which mode picks which threshold. Score normalization solves it at the source. ## Fix crates/kb-search/src/hybrid.rs divides every fused score by 2 / (k_rrf + 1), the theoretical RRF maximum with two retrievers each contributing rank 1. After normalization: - both retrievers agree on rank 1 → fusion_score = 1.0 - only one retriever finds the chunk → caps near 0.5 - typical mixed ranks → falls between 0 and 0.5 RRF's rank-ordering invariants are preserved (every score divides by the same positive constant), so sort + tiebreak behaviour is unchanged. Wire schema label `fusion_score` keeps its slot in RetrievalDetail; only the magnitude shifts, and only for hybrid mode (lexical / vector were already in [0, 1]). Verification: re-ran the four-scenario smoke at /tmp/kb-smoke/ with default score_gate = 0.05 — all four (Korean→Korean, English→ English, cross-language Korean↔English, out-of-corpus) succeed with the expected grounded / refusal classification, top fusion_score now ≈ 0.5. ## Tests One unit test (rrf_formula_matches_known_value) updated to expect the normalized value `(1/61 + 1/62) / (2/61) ≈ 0.9919` instead of the raw `1/61 + 1/62 ≈ 0.0325`. The integration snapshot fixture crates/kb-search/tests/fixtures/search/hybrid/run-1.json already used presence checks (fusion_score_positive: true) rather than absolute values, so it doesn't need regeneration. Workspace 319 tests pass; clippy clean across both feature configs. ## Docs This commit also adds tasks/HOTFIXES.md as a dated post-merge log covering this fix and the two earlier --config-flag regressions (P3-5 hotfix #20 across ingest/search/list/inspect/doctor; P4-3 follow-up #24 for kb ask). Original task specs in tasks/p<N>/ *.md stay frozen as the historical contract; HOTFIXES.md is the live source of truth for post-merge deltas. Each affected task spec gets a "Risks/notes" addendum pointing back to HOTFIXES.md so a reader landing on the spec sees the active behaviour: - tasks/INDEX.md gains a "Post-merge 핫픽스" section. - tasks/phase-3-vector-hybrid.md updates the RRF formula text to show the normalized form. - tasks/p3/p3-4-hybrid-fusion.md "Behavior contract" RRF bullet notes the normalization and reason. - tasks/p3/p3-5-app-wiring.md "Risks/notes" notes the --config fix. - tasks/p4/p4-3-rag-pipeline.md "Risks/notes" notes the kb-ask --config fix and the score_gate-RRF incompatibility (closed by the normalization in p3-4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:16:01 +00:00
altair823	ed8bf87c65	fix(cli): honor --config flag in `kb ask` (P4-3 follow-up) The earlier P3-5 hotfix wired --config through ingest / search / list / inspect / doctor by switching kb-cli to call the *_with_config companions. P4-3 added the ask body but kb-cli's Cmd::Ask arm still called bare kb_app::ask(query, opts) — same bug as before, ask silently fell back to ~/.config/kb/config.toml regardless of what the user passed. Caught during the post-P4-3 smoke against /tmp/kb-smoke/ pointed at 192.168.0.47 Ollama with gemma4:26b: the answer's wire JSON reported `model.id = "qwen2.5:14b-instruct"` (the user's XDG default) instead of `gemma4:26b` from the explicit --config, plus the score_gate / data_dir / model fields all reflected XDG defaults. After this fix the same invocation correctly returns model.id=gemma4:26b, embedding=multilingual-e5-small (from the smoke config), grounded=true with `[#2]` citation pointing at rust/ownership.md. Same minimal pattern as the P3-5 hotfix: - Build the Config once via Config::load(cli.config.as_deref()). - Call kb_app::ask_with_config(cfg, query, opts) instead of kb_app::ask(query, opts). Workspace 319 tests pass; cargo clippy --workspace --all-targets -- -D warnings clean. Smoke verified across four scenarios: - Korean→Korean-body lookup: grounded with rust/ownership.md citation. - English→Korean-body cross-language: grounded with arch/rag- architecture.md citation. - Korean→English-body cross-language: grounded with arch/embeddings.md citation. - Out-of-corpus query: LlmSelfJudge refusal with "근거가 부족하다." Out of scope (filed for follow-up): - config.rag.score_gate default 0.05 is incompatible with hybrid RRF scores. RRF top score is bounded by 2/k_rrf (≈0.033 at k_rrf =60), so the spec default trips ScoreGate on every hybrid query. Workaround: lower the gate to 0.005 in the user's config; long- term fix needs either per-mode gate config or RRF score normalization to [0,1]. Tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:56:57 +00:00
altair823	e35b06d0d0	feat(p4-3): kb-rag crate — full RAG pipeline + kb-app::ask wired P4 terminal task. Implements the user-facing payoff: retrieve → score gate → pack → render → generate → cite-validate → persist. After this commit, `kb ask` actually works against an Ollama backend; the pipeline grounds the answer in retrieved chunks and refuses cleanly when the gate trips or the model self-judges. New crate kb-rag: - pub struct RagPipeline { retriever, llm, docs, config } — all Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync. - pub fn ask(query, opts) -> Result<Answer> drives the nine-stage flow per spec §1. - pub struct AskOpts { k, explain, mode, temperature, seed, stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor over config.search.default_k so a low-k caller can't starve retrieval (documented in field doc). Pipeline stages: 1. Retrieve via the injected dyn Retriever. 2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 < config.rag.score_gate → ScoreGate refusal (no LLM call) with top-3 candidates listed in the synthesized answer text. 3. Pack: budget = config.rag.max_context_tokens.saturating_sub (prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>` with deterministic enumeration. If every hit's chunk is unfetchable from the store (deleted between search and pack), fall back to NoChunks refusal with a tracing::warn rather than feeding an empty [근거] to the LLM. 4. Render rag-v1 prompt with the spec's verbatim Korean system string + `[질문]/[근거]` user template. 5. Generate via dyn LanguageModel. Single-thread token loop owns the iterator; tokens optionally forward to opts.stream_sink (a `mpsc::Sender<String>`). SendError silently dropped — caller cancellation never panics the pipeline. After Done the loop reads (acc, finish_reason, usage) in lockstep with no race. max_completion = llm.context_tokens().saturating_sub (used_for_input).max(64) — explicitly NOT capped by config.rag.max_context_tokens (that's the packing budget for [근거], not the LM completion ceiling). 6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`, `[#1234]`, `vec![1]` are all rejected to prevent prose false-positives. 7. Citation validate covers four cases: - unknown marker (e.g. `[#7]` when only 3 packed) → LlmSelfJudge refusal. - empty answer with hits → LlmSelfJudge. - non-empty + no marker + matches `근거 (가\|이) 부족` regex → LlmSelfJudge (model self-refused with the canonical phrase; phrase match logged via tracing::debug for observability). - non-empty + no marker + no refusal phrase → LlmSelfJudge (silent ungrounded answers are still refusals). - non-empty + ≥1 valid marker → grounded = true. 8. Build Answer per kb_core::Answer shape: - citations: filter packed list to exactly the markers cited. Wire format `marker: Some("[1]")` (square-bracketed bare index) per design §2.3, distinct from the prompt-side `[#n]` grammar. - embedding ModelRef: read from config.models.embedding for Vector/Hybrid; None for Lexical. Documented deviation since the Retriever trait doesn't expose the embedder. For ScoreGate/NoChunks refusals on Vector/Hybrid the embedding model is still recorded — the vector retriever WAS consulted even when the gate tripped. - TraceId minted as `ret_<8-hex>` from blake3(query, top_score, model_id, ns). - retrieval AnswerRetrievalSummary populated. - usage from the final Done chunk; latency_ms wall-clock fallback when the LLM reports zero. - created_at OffsetDateTime::now_utc(). 9. Persist via SqliteStore::put_answer (new inherent method on SqliteStore, not on the DocumentStore trait — answers aren't documents and adding to kb-core was forbidden). Always inserts, refusals included. packed_chunks_json is null unless opts.explain == true. kb-store-sqlite extension: - pub fn put_answer(&Answer, query, packed_chunks_json) -> Result<AnswerId>. Maps all 22 fields of the answers table per V001 schema in a single INSERT under a transaction. kb-app::ask wired: - bail!("not yet wired (P4-3)") replaced with a real body that builds the retriever per opts.mode (Lexical \| Vector \| Hybrid), instantiates OllamaLanguageModel from config, constructs RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is re-exported via `pub use kb_rag::AskOpts` so kb-cli's `use kb_app::AskOpts` keeps working. - kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's forbids on these are lifted by P4-3 spec — kb-app is the orchestrator and ask requires both the trait crate and the Ollama adapter. - kb-cli/main.rs's AskOpts literal updated with stream_sink: None for the CLI path (TUI in P9 will plumb a real sink). Tests (kb-rag: 18; kb-app: 1 ignored): - 3 unit in src/pipeline.rs: marker regex strictness (rejects all loose forms with byte-equal expectations), Send+Sync compile check, embedding_ref_for behavior across modes. - 15 integration in tests/pipeline.rs covering every spec test row + the new "all chunks unfetchable falls back to NoChunks" guard: empty-hits, score-gate, grounded happy path, unknown-marker, prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase, packing-budget overflow, streaming-forwards-to-mpsc, dropped- receiver-no-panic, usage-from-final-Done, answers-row-inserted- for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON shape, unfetchable-chunks-fall-back-to-no-chunks (the new M3 test). - kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that drives the wired ask end-to-end against `localhost:11434`. Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-search, kb-llm, kb-store-sqlite, serde, serde_json, regex, time, tracing, thiserror) plus forced waivers anyhow (Retriever / LanguageModel trait return types) and blake3 (TraceId minting). Forbidden (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store- vector direct, kb-embed* direct, kb-llm-local direct, kb-tui, kb-desktop) all absent from `cargo tree -p kb-rag` — concrete adapters reach the pipeline only through trait objects. Out of scope: reranker between retrieve and pack (P+), multi-turn chat memory (P+), LLM-as-judge eval (P5 uses rule-based must_contain), --json streaming (buffers per §0 Q5 hybrid). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:06:10 +00:00
altair823	3e38a9bcb4	feat(p4-2): kb-llm-local crate — Ollama HTTP adapter (reqwest::blocking) First real LanguageModel implementation. Wraps Ollama's local HTTP API at POST {endpoint}/api/generate with stream:true, parses the NDJSON streaming response into TokenChunk events, and maps Ollama error states to a thiserror-derived LlmError with actionable hints. Synchronous trait surface; reqwest::blocking handles the HTTP I/O. Public surface: - pub struct OllamaLanguageModel - pub fn new(config: &Config) -> Result<Self> — lazy connect; never hits the network. Spec line 96. - pub enum LlmError { Unreachable, ModelNotPulled, Timeout, Stream, Malformed }. Lives in this crate per spec — kb-core / kb-llm stay free of error taxonomy. - impl kb_core::LanguageModel via re-export from kb-llm. Streaming: - POST body shape per spec §11.2: model, prompt = system + "\n\n" + user, stream: true, options { temperature, seed, num_ctx, stop }. - OllamaStream owns BufReader<reqwest::blocking::Response>, reads NDJSON lines via read_until(b'\n'), parses each as {response, done, done_reason?, prompt_eval_count?, eval_count?, total_duration?}. Token frame → TokenChunk::Token; done frame → TokenChunk::Done { finish_reason, usage }. - done_reason mapping: "length" → Length, "abort" → Aborted, "stop" / missing / unknown → Stop (forward-compat with future Ollama tags). - Missing prompt_eval_count / eval_count default to 0 + tracing::warn (do NOT fail). Spec line 135. - EOF without a done line synthesizes Done { Aborted, zeros } so downstream pipelines never deadlock waiting for a terminal frame. - UTF-8: line-delimited framing means each JSON line is a complete UTF-8 sequence — no cross-HTTP-chunk codepoint splits to worry about. read_until accumulates whole lines regardless of how the underlying reqwest body chunks. Error mapping (LlmError): - reqwest::Error::is_connect() → Unreachable { endpoint, source } with hint "ensure `ollama serve` is running and reachable at <endpoint>". - reqwest::Error::is_timeout() → Timeout. - 200 with non-NDJSON first line (e.g., transparent-proxy HTML error page) → Stream(truncated body) — distinguished from Malformed by the iterator's has_emitted flag. - 404 with body containing model_id (case-insensitive) OR English "model" + "not found" → ModelNotPulled(model_id) with hint "ollama pull <model_id>". Tightened beyond spec to survive Ollama localizing the error message (Korean / Japanese / etc.) while keeping the original English-substring fallback. - Other 4xx/5xx → Stream(truncated body). - Mid-stream JSON parse failure (after at least one valid line) → Malformed(line). Truncate all error bodies to 512 chars (chars-based, multibyte safe) so an nginx 500 page can't blow up the diagnostic. - Trailing slash in endpoint stripped before formatting the URL — endpoint = "http://x:1234/" produces .../api/generate, not .../api//generate. Pinned by trailing-slash test. Tokio note: reqwest 0.12's blocking feature internally wraps a private current-thread tokio runtime, so cargo tree --edges normal shows tokio. The auditable invariant is "no top-level tokio dep + no async surface exposed to callers" — verified: src/ has zero async/await/tokio::*. default-features = false drops default-tls (rustls only) but does NOT drop tokio. Documented honestly in Cargo.toml + lib.rs. Switching to ureq would remove tokio entirely; deferred since reqwest is the spec's allowed dep. Tests (24 total: 23 default + 1 ignored): - 7 unit in src/ollama.rs: prompt-build, options-build, finish- reason mapping, truncate_body bounds (under_cap / over_cap_marker / multibyte_chars_not_bytes), 404+model-id heuristic. - 3 in tests/construction.rs: ModelRef shape, context_tokens passthrough, lazy-connect proven via port-1 pointing. - 13 in tests/streaming.rs: streamed tokens then Done, multibyte chars within a line round-trip (renamed from "split across chunks" to honestly reflect what's tested), Unreachable-with- hint, 4xx→Stream, 404→ModelNotPulled, concat-equals-canned, done_reason length / abort, missing eval counts default to zero, missing done_reason defaults to Stop, determinism-by-mock, trailing-slash endpoint, non-NDJSON 200 body → Stream not Malformed. - 1 #[ignore] in tests/integration.rs: real Ollama on localhost:11434 with the configured model. Opt-in via cargo test -p kb-llm-local -- --ignored after `ollama serve` + `ollama pull`. Workspace: 288 passed / 25 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. No native-tls, no openssl in the dep graph. Allowed deps respected: kb-core, kb-config, kb-llm, reqwest 0.12 (default-features=false; blocking, json, rustls-tls), serde, serde_json, tracing, thiserror plus anyhow (forced by trait return type). wiremock + tokio in [dev-dependencies] only. Out of scope: llama.cpp / candle adapters (P+), Ollama embed endpoint (separate adapter inside kb-embed-local if requested), cancellation / abort tokens (P+), connection-pool tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:28:34 +00:00
altair823	27c669fbf9	feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel Establishes the kb-llm trait crate so concrete LLM adapters (p4-2 Ollama, future llama.cpp / candle) target a stable surface. Pure re- export of kb_core::{LanguageModel, GenerateRequest, TokenChunk, FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic mock for downstream RAG tests (p4-3) that need an LLM trait object without an Ollama dependency. MockLanguageModel (cfg(feature = "mock"), default OFF): - Holds canned_response + canned_finish + canned_usage + (model_id, provider, context_tokens). Pure in-memory; no I/O. - generate_stream() honors GenerateRequest.stop: scans every non-empty stop string against the canned response, takes the earliest byte position (Iterator::min returns the first equal element on ties so declaration order in req.stop wins), truncates with a direct byte- slice (str::find returns a UTF-8 char boundary by contract). - When a stop matches, finish_reason is overridden to Stop (matches OpenAI / Ollama real-world behaviour); otherwise the caller's canned_finish passes through verbatim. - Emits one TokenChunk::Token per Unicode scalar value (char), NOT per grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining marks split. Acceptable for trait-shape testing; real adapters MAY combine. Documented in module docs. - Always terminates with TokenChunk::Done { finish_reason, usage } even if the canned response is empty. The returned iterator is a boxed Vec<TokenChunk>::into_iter().map(Ok), trivially Send. - Real adapters MAY return Err from generate_stream itself (e.g. connection refused) before any chunk is yielded; the mock never does. Documented for the trait re-exporter consumer audience. Helpers: - assert_finish_chunk(chunks) — asserts the last chunk is a Done. Useful for proptests asserting trait contract over random inputs. Tests: - cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests. - cargo test -p kb-llm --features mock: 9 tests including 100-case proptest over random Unicode strings asserting Done terminator, char-count == streamed Token chunks, concat == canned (truncated by stop), plus explicit cases for stop-string truncation, first-stop- match precedence, model_ref dimensions=None invariant, finish reason pass-through. - All 271 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating verified: - cargo build --release -p kb-llm (default): nm shows zero MockLanguageModel symbols. - cargo build --release -p kb-llm --features mock: three trait-impl symbols present. Spec invariant "release builds MUST NOT include MockLanguageModel" enforced at the symbol level. Allowed deps respected: only kb-core (path) and anyhow (workspace, forced by trait return type). Dropped kb-config / serde / thiserror / tracing from the spec's allowed list — they are listed as Allowed but nothing in this skeleton crate references them, and dropping them keeps the dependency graph slim for downstream consumers. p4-2/p4-3 will add what they need at their own dep sites. Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-, kb-embed, kb-search, kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm. Out of scope: real adapter (p4-2 Ollama), token counting against the real tokenizer, server-side cancellation / abort signals (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:37:46 +00:00
altair823	a08f61a242	fix(cli): honor --config flag + improve search output legibility Two issues surfaced during the post-P3-5 manual smoke test against a six-document workspace: 1. --config flag was silently ignored. kb-cli read cli.config only while building SourceScope inside the Ingest arm, then called kb_app::ingest(scope, summary_only) which internally re-loads Config::load(None) — falling back to ~/.config/kb/config.toml regardless of what the user passed. Same pattern in search, list, inspect, doctor. Users had to rely on KB_* env vars to point at a non-default config. 2. Search output collapsed RRF hybrid scores to "0.02" because `{:.2}` truncated the (0, 0.033]-bounded fused score, and chunks from the same document showed up as identical lines ("3. 0.02 arch/rag-architecture.md") since heading_path was never printed. Fix: - kb-app: doctor/ingest/search/list/inspect already had _with_config(Config, ...) seams introduced for integration tests (#[doc(hidden)] pub). Repurpose them as the official "config-explicit" API — kb-cli now builds the Config once via Config::load(cli.config.as_deref()) at the top of every subcommand and threads it into the _with_config variant. Module doc-comment updated to reflect three callers (CLI --config, integration tests, TUI session) instead of "test-only seam". - kb-app: doctor() rewritten as doctor_with_config_path(Option<&Path>) that respects an explicit path. config_loaded probe now reports the actual path checked, returning a clear hard error if --config points at a non-existent or malformed file (defaults would silently mask user intent). data_dir_writable resolves storage.data_dir from the loaded config (with env overrides applied via Config::apply_env) so --config users see their custom paths reflected. Original doctor() signature kept as a None-passing wrapper. - kb-cli: ingest/search/list/inspect/doctor each call the _with_config companion. Search printer switches to {:.4} score formatting (RRF hybrid range bounded by ~2/k_rrf ≈ 0.033 at k_rrf=60 default) and appends `> head1 / head2` when heading_path is non- empty so chunks from the same document are visually distinguishable. Verified manually: - `kb --config /tmp/kb-smoke/config.toml doctor` reports the custom config path + custom data_dir, not the XDG defaults. - `kb --config /tmp/kb-smoke/config.toml search "..." --mode hybrid` returns hits with distinct 4-digit scores and heading paths ("rust/ownership.md > Rust 소유권 모델 / Borrow checker"). Workspace 269 passed / 24 ignored / 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:46:37 +00:00
altair823	17d52461b2	feat(p3-5): wire kb-app facade — ingest / search / list / inspect Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real bodies that compose the libraries shipped through P3-4. After this commit, `cargo run -p kb-cli -- index` actually walks the workspace and persists chunks (SQLite + optionally LanceDB), and `cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3 owns it. App lifecycle (crates/kb-app/src/app.rs): - Internal pub(crate) struct App holds the Config plus Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind OnceLock<Arc<...>> for memoization. First call pays the ~470MB fastembed init / Lance open; subsequent calls return the cached Arc::clone. OnceLock::set race losers fall back to get().cloned() so the lazy-init is concurrent-safe. - One-shot CLI invocations pay the cost once at most. The P9 TUI (which holds an App for the session) gets memoization for free. ingest pipeline (lib.rs): - FsSourceConnector::scan(&scope) → per asset: parse_frontmatter → parse_blocks → build_canonical_document → MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document → put_blocks → put_chunks. One transaction per document per design §5.8 (kb-store-sqlite's put_* methods own the transactions). - When provider != "none" and dimensions > 0: build embedder once, embed each doc's chunks as Document kind, ensure_table once at the top of the run, then upsert the VectorRecord batch. Lexical-only config (provider == "none") skips both — verified by ingest_provider_none_skips_lance test. - Per-asset parse failures recorded as IngestItemKind::Error with the warning attached; the run continues. Only structural failures (DB unreachable etc.) abort. - Aggregate counts (assets_scanned / new / updated / skipped / errors / chunks_indexed / embeddings_indexed / duration_ms) flow into both the JobRepo progress_json AND a dedicated ingest_runs row written via SqliteStore::record_ingest_run (new pub(crate) helper added to kb-store-sqlite — see below). summary_only=true writes items_json=NULL but still populates the count columns. search dispatch: - SearchMode::Lexical → LexicalRetriever directly. - SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore. - SearchMode::Hybrid → HybridRetriever composing the two. - Vector / Hybrid with provider=none returns a clear error naming the config key to flip ("models.embedding.provider"). list_docs / inspect_doc / inspect_chunk delegate straight to DocumentStore trait methods. Returns Err with actionable message on not-found. Test seam: each public free function has a matching #[doc(hidden)] pub fn _with_config(cfg, ...) companion that integration tests invoke directly (the public form internally calls load_config()). pub(crate) would not reach across the integration- tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the function comment flags it as test-only. kb-store-sqlite additions: - pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore for the kb-app aggregate-counts persistence path. Helper writes the ingest_runs row directly with all aggregate columns; jobs table still gets a JobRepo create/update_progress/finish trio in parallel. Tests (11 default, 2 #[ignore] AVX-gated): - ingest_lexical: round-trip, idempotent, summary_only_drops_items, provider_none_skips_lance (asserts no .lance dir on disk), records_ingest_runs_row_with_aggregate_counts, tags_any filter, inspect_doc_not_found, inspect_chunk_not_found. - search_lexical: lexical hits with embedding_model=None, empty_query_returns_empty, vector_mode_with_provider_none returns clear error. - search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector mode embedding_model assertion (#[ignore], AVX). Both run on the AVX VM in ~21s combined (first run pays the model download). - TestEnv pins workspace.root + storage.{data_dir,model_dir} to a TempDir so tests don't touch the user's $HOME/.local/share. - Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has three small markdown files with varied frontmatter (rust+cargo+ python tags) so the tags_any filter test exercises a non-trivial predicate. Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo clippy --workspace --all-targets -- -D warnings clean. CLI smoke verified manually: `cargo run -p kb-cli -- index` returns a real IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns hits with citations; `cargo run -p kb-cli -- list docs` lists the indexed documents. Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types, kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector, kb-embed, kb-embed-local plus existing tracing / anyhow / serde / toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm, kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from cargo tree -p kb-app. Out of scope per spec: ask body (P4-3), --rebuild-fts wiring, --resume checkpointing (P+), --watch (P+), TUI / desktop integration (P9 consumes this facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:11:21 +00:00
altair823	ccd49ef546	feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF) Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever wrapper around LanceVectorStore (P3-3) into a single Retriever that dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and vector candidates via Reciprocal Rank Fusion and populates the full RetrievalDetail per SearchHit so kb search --explain can attribute scores back to each side. Public surface (kb-search crate): - pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>, Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction. - pub struct HybridRetriever { lexical, vector, fusion, k }. - pub enum FusionPolicy { Rrf { k_rrf: u32 } }. VectorRetriever: - Embeds query.text as EmbeddingKind::Query before delegating to VectorStore::search(query_vec, query.k * 2, &query.filters). Over- fetches by ×2 for filter losses; LanceVectorStore applies the filters internally so they propagate naturally. - Hydrates each VectorHit into a full SearchHit by joining on chunk_id in a single IN-clause batch (no N+1): doc_path, section_label, chunker_version, source_spans for citation, plus embedding_model from embedder.model_id(). - Snippet trimmed to config.search.snippet_chars (vector mode lacks FTS5 highlighting; chunk text prefix is the next-best signal). - Citation built from the chunk's first source span via the shared citation_helper module — extracted from lexical.rs so both retrievers compute citations identically (Byte/empty fallback to Line{1,1} preserved with tracing::warn). - RetrievalDetail.method = Vector for standalone calls; both fusion_score and vector_score set to the LanceVectorStore-shifted cosine score; lexical_* None. HybridRetriever: - Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail. - Hybrid mode runs both retrievers with k * 2 fanout, fuses with RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC with deterministic tiebreaker (lex_rank ASC then chunk_id ASC), takes top query.k. Fusion math runs in f64 throughout; cast to f32 only at the SearchHit boundary where bounded magnitude (≤ ~0.033 at k_rrf=60) makes f32 precision sufficient for ranking. - Per-hit lexical preferred for snippet/citation/heading_path/ chunker_version/embedding_model when the chunk appears in both retrievers — FTS5 highlighting is more user-relevant than vector's truncated text. Vector-only chunks fall through to vector hit data. - index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at construction; mismatched lex/vec versions trigger a tracing::warn so users notice stale indexes (spec line 143). kb-search additions: - citation_helper.rs — pub(crate) citation_from_first_span shared between lexical and vector retrievers. Extracted from lexical.rs; no behavior drift. Tests (38 default + 3 ignored): - 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid preserves single-side hits with the missing side's RetrievalDetail None, deterministic tiebreaker on identical fused scores, composite index_version, mismatched-version warn at construction. - 2 unit tests in vector.rs covering the snippet-prefix and citation fallback paths. - 11 unit tests in lexical.rs (unchanged from P2-2). - 13 lexical integration tests (unchanged). - 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus recall (lex returns A,B; vec returns C,D; hybrid returns all 4), determinism over two queries, snapshot stability against tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was regenerated against this branch on an AVX-enabled VM and contains 4 real chunks (c1/c2 lex+vec, c3/c4 vec-only). - KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of silently passing — matches the P3-2/P3-3 fail-loud-instead-of- silent-pass philosophy. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow). kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder> trait object; the concrete adapter is runtime-injected by kb-app. Out of scope: reranker (P+), score calibration across modes (RRF is rank-comparable so absolute calibration is P+), multimodal retrieval (P6+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:22:21 +00:00
altair823	dd42740cc0	test(p3-3): pin LanceDB snapshot fixture from AVX-capable host Replaces the placeholder run-1.json with the captured Vec<VectorHit> from `cargo test -p kb-store-vector --test snapshot -- --ignored` on an AVX2-capable VM (host-passthrough CPU model). Verified by re- running the same ignored lane and asserting against the pinned fixture. Full ignored lane on AVX hardware: - upsert_search.rs: 8 / 8 pass (ensure_table idempotent, search-empty, upsert+search, dim-mismatch, tags filter, model isolation, determinism, crash-recovery promotes pending → committed). - snapshot.rs: 1 / 1 pass against the pinned fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:58:10 +00:00
altair823	3cd5117a7e	feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending\|committed\|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:01:31 +00:00
altair823	bcbe2b8531	feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small First real Embedder implementation. Wraps fastembed-rs (ONNX runtime) with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME} template expansion so model files land under config.storage.model_dir/ fastembed/ without polluting kb-config's public API. Public surface: - pub struct FastembedEmbedder - pub fn new(config: &Config) -> Result<Self> - impl kb_core::Embedder (via kb-embed re-export) Behavior: - Default model multilingual-e5-small (384 dims). model_id and model_version come from config.models.embedding.{model,version}. - Pre-load dim check via TextEmbedding::get_model_info: dim mismatch bails before paying the ~470MB ONNX init cost. - e5 prefix applied BEFORE tokenization: "passage: " for EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned by prefix_input unit tests. - Batches inputs into chunks of config.models.embedding.batch_size, concatenates results in input order. - L2 normalization is performed by fastembed 4.9's default transformer pipeline (verified at fastembed/src/text_embedding/output.rs:43); we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so a future fastembed bump that drops this invariant fails loudly. - Synchronous (no async runtime). Mutex serializes calls into the underlying ONNX session — conservative; ORT Session is Send+Sync but callers (kb-app indexer) batch sequentially anyway. Revisit if profiling shows contention. - First-run model download surfaces via tracing::info before/after TextEmbedding::try_new — users no longer stare at a silent 30-60s pause during the 470MB pull. Tests: - 11 default-lane tests covering: check_dim match/mismatch (no model load), prefix_input Document/Query/empty, resolve_model known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive ~ expansion). XDG tests serialize on a Mutex + RAII guard since edition 2024 makes set_var/remove_var unsafe. - 7 #[ignore] integration tests covering: full construction with default config, dim-mismatch belt-and-braces, Document vs Query cosine differential, L2 unit norm, byte-equal determinism, batch-64 performance under 5s, snapshot-hash stability over a 5-sentence multilingual fixture. - Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints the captured hash and panics with paste-back instructions, so first --ignored run forces the maintainer to pin the baseline rather than silently passing. - Workspace: 222 tests pass (default lane); clippy clean. Allowed deps respected: kb-config, kb-embed (re-exports kb-core trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and ort enter transitively through fastembed; reqwest/hyper/hf-hub also transitive (model download is fastembed's responsibility per spec carve-out). No direct kb-core dep needed — re-exports cover it. Pinned to fastembed 4.x rather than the recent 5.x to limit blast radius; consider bump when p3-3 (lancedb-store) consumes the embedder output shape. Out of scope: reranker (P+), Ollama embedding endpoint, candle adapter, image embeddings (P6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:39:38 +00:00
altair823	2e3eb8f437	feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder Establishes the kb-embed trait crate so concrete embedding adapters (p3-2 fastembed, future ollama-embed/candle) target a stable surface. Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic mock for downstream tests. MockEmbedder (cfg(feature = "mock"), default OFF): - Per-component hash recipe: blake3(seed_le8 \|\| kind_byte \|\| text_len_le8 \|\| text \|\| i_le8). Length-prefixed text avoids the domain-separation ambiguity where two (text, i) pairs could shift bytes between text tail and the i field. - Document = 0u8, Query = 1u8 — same text different kind yields different vectors (mirrors e5 prefix behaviour). - Per component: blake3 first 8 bytes → u64 → reinterpret as i64 → f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32 rounds to -1.0; range [-1, 1] holds. - L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision loss) before f32 cast. Zero-norm guard skips the divide. - with_seed(...) constructor lets two embedders share identity but produce different vectors — useful for downstream parametric tests. Helpers: - assert_vector_shape(vecs, dims) — len + finite check. - assert_unit_norm(vecs, tolerance) — caller-supplied tolerance; 5e-4 documented as safe for dims=384 under f32 epsilon × √dims. Tests: - cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests. - cargo test -p kb-embed --features mock: 7 tests including 100-case proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text), Doc(text1) ≠ Doc(text2). - All 220 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating: nm on the release rlib confirms zero MockEmbedder symbols under default features; three trait impl symbols under --features mock. Spec invariant "release builds MUST NOT include MockEmbedder" verified at the symbol level. Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing, plus anyhow (forced by trait return type) and blake3 (justified by the determinism contract; already in workspace lockfile via kb-core). No fastembed/ort/tokenizers anywhere. Out of scope: real adapter (p3-2), reranker traits (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:15:44 +00:00
altair823	b335151d18	feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1) to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with bm25-derived ranking, snippet() previews, and W3C-fragment-style Citation built from the chunk's first source_spans entry. New crate kb-search: - LexicalRetriever::new(Arc<SqliteStore>, IndexVersion). - search() builds an FTS5 MATCH expression by escaping every whitespace token into a quoted literal (inner " doubled); single-quote-wrapped text passes through verbatim as raw FTS5 syntax. Empty query short-circuits to Ok(vec![]). - bm25 normalization: score = -bm25 / (1 + \|bm25\|), bounded (0, 1] for any FTS5-returned negative bm25. - Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet enforces the char cap on the way out (chars per design §6.4 — accepts the combining-mark trade-off). - Citation from chunks.source_spans_json first span: Line / Page / Region / Time forwarded; Byte / empty array fall back to Line{1,1} with a tracing::warn so forward-compat regressions surface. - Filters: tags_any (subquery on document_tags), lang (= column), trust_min (CASE-rank in SQL) all applied at SQL level. path_glob uses globset with literal_separator(true) — guarantees '' does not cross '/' per spec Risks/notes — applied as Rust post-filter with +128 row over-fetch when set, then rank reassigned 1..k contiguously. - Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex tiebreaker on identical bm25). Tested explicitly with two chunks of identical text content. - RetrievalDetail: method=Lexical, both lexical_score and fusion_score set, vector_ None. kb-store-sqlite: - Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>. Read-only contract is doc-only — kb-search needs MutexGuard for prepare_cached + iter, which a closure-scoped wrapper would awkwardly constrain. Closure variant left as a P3 follow-up. Tests (26 new): empty corpus, empty query, single hit + citation round-trip, snippet length cap, tags_any exclusion, lang+trust composition, path_glob with '' not crossing '/', citation line round- trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical- score tiebreaker), index_version passthrough, snapshot (crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace: 211 tests pass, cargo clippy --workspace --all-targets -D warnings clean. Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (forced by trait return type), serde_json (parses _json TEXT columns), globset (path_glob '*' boundary). Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4), reranker (P+), Korean morphological tokenizer (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:20:35 +00:00
altair823	94bfc50efd	feat(p2-1): chunks_fts virtual table + sync triggers (V002 migration) Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror every chunks mutation into chunks_fts inside the host transaction. V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT so existing P1 databases gain searchability without re-ingest. Refinery bookkeeping makes double-apply naturally idempotent. Adds rebuild_chunks_fts(&Connection) escape hatch for kb index --rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT instead of Transaction so callers can invoke from inside an outer transaction; WAL serializes writers so no DELETE/INSERT race vs. concurrent chunks mutators is possible. Tests (10): V001-only → V002 cold-upgrade backfill (literal path), chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5 verbatim diff guard (anchored extraction from both files), WAL/SHM release on store drop. All 185 workspace tests pass. Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no new deps). FTS query implementation deferred to p2-2 (lexical-retriever). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:42:15 +00:00
altair823	b7367dedfe	p1-6: doc-only TODO markers (section_label, doc_version invariant) M9: add a `TODO(P2/P3)` comment near the NULL persistence at documents.rs (put_chunks). The `section_label` column exists in the §5.5 DDL but neither the in-memory Chunk struct nor the §2.6 wire schema carries the field, so NULL is the correct canonical value today — flag the future-bump intent in-line rather than leaving it implicit. M10: add a one-line invariant comment near the i64 -> u32 narrowing for `doc_version` in `get_document`. The invariant is documented at the write site (UPSERT bumps by 1 per re-ingest) — restate it at the read site so the cast is not silently load-bearing. No behaviour change. No tests touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:34:17 +00:00
altair823	15b4d80efc	p1-6: rename StoreError::Sqlx -> Sqlite, drop dead assets_root helper M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No external pattern matches; the variant is reached only via `#[from]`. M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a test that never landed. Delete it so the dead-code allow goes with it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:33:50 +00:00
altair823	e41279de96	p1-6: harden store boundary (atomic asset write, poison-tolerant mutex, AssetId validation) Three Important review fixes on the kb-store-sqlite write path: I1. Atomic asset write. put_asset_with_bytes now stages bytes to `<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into place (atomic on POSIX same-fs). On any failure between staging and rename we best-effort `remove_file` the temp so the previous orphan risk on UPSERT failure is gone. Reference mode is unchanged (no I/O, no orphan risk). I2. Poison-tolerant mutex. New `lock_conn` helper does `.lock().unwrap_or_else(\|p\| p.into_inner())`, so a single panic mid- transaction no longer poisons every subsequent store call. The rusqlite Transaction Drop already rolls back on panic, leaving the Connection state safe to reuse. All 13 prior `.expect("sqlite mutex poisoned")` sites in store.rs / documents.rs / jobs.rs now route through `lock_conn`. I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a hand-construction bypass the `FromStr` 32-hex invariant. Added `validate_asset_id` (32 ASCII hex chars) at every store entry that turns an AssetId into a path: `put_asset_with_bytes` and `DocumentStore::put_asset`. This shuts a potential path-traversal via `assets_path_for`'s `&id[..2]` shard slice. Tests: - `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final file and no leaked `.tmp.`. - `put_asset_with_bytes_rejects_invalid_asset_id` — passes `AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`). Asserts error and zero filesystem artifacts under `data_dir/assets/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:33:19 +00:00
altair823	efdb71b1c3	p1-6: list_documents filter coverage test Round-trip three docs (en/ko, varied tags, varied trust) and exercise each DocFilter axis: default (all), lang, path_glob (workspace_path GLOB), tags_any (intersection via document_tags subquery + per-row tag hydration), and trust_min (Primary > Secondary > Generated rank gate).	2026-04-30 17:14:17 +00:00
altair823	111f40ddf0	p1-6: kb-store-sqlite test suite (8 categories) All 8 test categories from the task plan, plus a JobRepo subset: migration — tests/migration.rs: fresh DB after run_migrations exposes every required §5 table + index. unit (copy) — tests/asset_writer.rs: copy mode writes file with mode 0o644 + correct bytes. unit (ref) — tests/asset_writer.rs: reference mode does not write file; row records source path. unit (cs) — tests/asset_writer.rs: tampered checksum returns a Conflict-flavoured anyhow error. unit (idem) — tests/idempotency.rs: same put_document twice → 1 row, doc_version 1→2; tags re-derived. unit (rb) — tests/idempotency.rs: put_blocks with FK violation rolls back; pre-existing rows unchanged. contract — tests/contract_roundtrip.rs: drives kb-parse-md + kb-normalize + kb-chunk on fixtures/markdown/code-and-table.md, persists, then reloads via DocumentStore::get_document / get_chunk and asserts byte-equal round-trip. snapshot — tests/ingest_report_snapshot.rs + snapshots/ingest_report.snapshot.json: pin the wire JSON form of kb_core::IngestReport for an inline fixture run. jobs — tests/jobs.rs: create → progress → finish flow; error message round-trip; list filters on status/kind. Drops the unused `serde` direct dep from Cargo.toml; serde_json brings its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1` to live only in the dev tree.	2026-04-30 17:13:03 +00:00
altair823	a3390d5171	p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL New workspace member crate `kb-store-sqlite` (allowed deps only: kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json, time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md / kb-normalize / kb-chunk for the contract round-trip test). Migration V001 replaces the P0-1 stub with the full §5 DDL (assets, documents, document_tags, blocks, chunks with policy_hash, embedding_records, jobs, ingest_runs, answers, eval_runs, eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers remain deferred to V002 (P2-1). Public surface per task spec: SqliteStore::open / run_migrations / put_asset_with_bytes impl DocumentStore for SqliteStore (7 trait methods) impl JobRepo for SqliteStore (4 trait methods) StoreError { Sqlx, Migration, Conflict } Behavior: - Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY. - Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else reference. blake3(bytes) verified against asset.checksum; mismatch → Conflict. - Idempotency: put_document UPSERTs and bumps doc_version + 1 on conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags re-derived per put_document. - get_document rehydrates blocks via payload_json ordered by stream ordinal. - list_documents builds dynamic WHERE from DocFilter (lang / trust_min / path_glob via GLOB / tags_any via document_tags subquery). - JobRepo: jobs.kind/status are stored as lowercase enum tags; create mints a 32-hex JobId via blake3(kind \|\| payload \|\| nanos). Tests follow in subsequent commits.	2026-04-30 17:08:36 +00:00
altair823	094c4641ba	kb-chunk: populate Chunk.policy_hash field Set the new policy_hash field on every emitted Chunk to the same hex already computed for the chunk_id recipe (§4.2). No recipe / chunk_id change — only the field on the struct is now populated. Pairs with the kb-core hotfix (preceding commit) and unblocks P1-6's DocumentStore::put_chunks to read chunk.policy_hash directly per §5.5.	2026-04-30 17:02:17 +00:00
altair823	16b2a5c150	kb-core: add policy_hash field to Chunk struct (P1-6 schema reconcile) Add policy_hash: String to kb_core::Chunk to align with the §5.5 SQLite schema (chunks.policy_hash NOT NULL), so kb-store-sqlite persistence is a straight field copy rather than a recompute. This is a §9 schema migration: - §5.5 (the persistence schema) is authoritative. - §3.5 (the domain model) must accommodate. The chunker already computed policy_hash for the chunk_id recipe (§4.2); P1-5 stored it implicitly. We now hold it explicitly on the Chunk so any DocumentStore::put_chunks impl can read it directly. Follow-up commits update kb-chunk to populate the field and refresh the P1-5 snapshot baseline accordingly.	2026-04-30 17:02:11 +00:00
altair823	ceeac9f974	p1-5: doc rationale for respect_markdown_headings, policy_hash panic, overlap accounting Doc-only follow-ups for review minors I1, M3, M4. No behavior change. * I1: rustdoc on `MdHeadingV1Chunker` now records that `policy.respect_markdown_headings` flows into `policy_hash` only; the `md-heading-v1` variant unconditionally treats headings as boundaries by name. To disable heading awareness, ship a different `chunker_version` (none in P1-5). * M3: `# Panics` rustdoc on `policy_hash` documents the unreachable-in-practice failure mode of `serde_json_canonicalizer::to_vec` and explains why the `expect` is retained as future-proofing. * M4: Inline comment at the `would_exceed` decision noting that `acc.text_tokens` already includes the prior chunk's overlap seed, and that the I3 clamp guarantees a flush here never produces a chunk smaller than the seed budget. * Heading-path bullet in the behavior contract updated to reflect the I2 fix wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:52:18 +00:00
altair823	a81460f9d0	p1-5: clamp overlap seed budget to target_tokens / 2 A pathological `ChunkPolicy { overlap_tokens >= target_tokens }` caused `md-heading-v1` to degenerate into 1-block-per-chunk: the seeded `acc.text_tokens` already exceeded `target_tokens` before any fresh content landed, so the next paragraph immediately tripped the `would_exceed` flush. Clamp the seed budget in `collect_overlap_seed` to `min(overlap_tokens, target_tokens / 2)`. This guarantees that after seeding, the chunk has at least `target/2` worth of room for new content before flushing, restoring the intended paragraph-overlap behavior on every reasonable and unreasonable policy. Adds a regression test pinning a 50/200 (overlap = 4× target) policy and asserting every emitted chunk holds ≥2 blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:51:00 +00:00
altair823	f780c71ce0	p1-5: heading-only chunk carries self in heading_path When a Heading block opens a chunk and is followed by another Heading or an atomic block (Code, Table, ImageRef, AudioRef) with no intervening prose, the prior fallback used `common.heading_path` from the heading itself — which per kb-normalize convention does NOT include the heading's own text. Result: heading-only and heading-led chunks for `# Alpha\n## Beta\n...` patterns landed with `heading_path = []`, losing citation context. Synthesize the leading heading into the chunk's heading_path when blocks[0] is a Heading: parent path + heading.text. The first non-Heading branch (existing logic for normal mid-section chunks) is unchanged. `chunk_id` recipe is `(doc_id, chunker_version, block_ids, policy_hash)` — `heading_path` is not in the recipe, so this fix does NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json` also unchanged because every heading in that fixture is followed by a paragraph (the bug only triggers on direct heading→heading or heading→atomic adjacency). Adds `heading_with_parents` test helper and a regression test pinning the `# Alpha\n## Beta\n[code]` pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:50:12 +00:00
altair823	58f7b8573d	p1-5: add long-section fixture + Vec<Chunk> snapshot test Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s, nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table, and a multi-paragraph Gamma section) into the JSON snapshot baseline. Confirms the priority rules end-to-end: - Heading boundaries hold across H1 → H2 → H1 transitions - The code block emits one chunk at 427 tokens > target=200 - The table stays single-chunk - Gamma's paragraph stream splits with one block of overlap seed A second test runs the full parse → normalize → chunk pipeline 5 times and asserts identical chunk_ids each pass. Drops the unused `kb-config` and `serde` from regular dependencies — they were declared but no source path imports them; `serde` flows in transitively via `kb-core` as a public API requirement, and `ChunkingCfg` lives in `kb-config` but the chunker takes `ChunkPolicy` directly. Production deps are now exactly the allowed set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer, tracing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:33:29 +00:00
altair823	0237022d0e	p1-5: implement md-heading-v1 chunking rules Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset from the design (§0 / §14): 1. Heading is a hard boundary; the heading itself starts and is included in its chunk so heading text is retrievable. 2. Code blocks never split, regardless of size. 3. Tables stay single-chunk (row-split deferred per task spec). 4. Long sections split at target_tokens with paragraph-level overlap_tokens worth of seeded tail blocks. 5. ImageRef / AudioRef each become their own chunk with token_estimate = 0. 6. heading_path lifts from the first contributing non-Heading block; source_spans concatenate in document order. 7. chunk_id derives from id_for_chunk(doc_id, chunker_version, block_ids, policy_hash) per §4.2. Covers the unit + determinism rows of the P1-5 test plan: heading boundary respected, 800-token code block stays single, small table stays single, long paragraph chain splits with overlap, ImageRef chunk has token_estimate=0, and 1000-iter chunk_id determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:30:26 +00:00
altair823	8142449eb7	p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:42 +00:00
altair823	557275c04e	p1-4: doc-only follow-ups for deferred review minors (M8, M9, M11, M12) M8: kb-parse-md frontmatter doc-comment claimed filename fallback was P1-4's job; P1-4 spec did not include it. Reconcile: defer to a later phase (P1-7 / kb-app integration) where the workspace_path filename is known to the caller. Updated comment in build_metadata(). M9: kb-parse-md tests use the #[ignore] regenerator pattern, while kb-normalize's integration test uses an UPDATE_SNAPSHOTS=1 env-var. Migrating kb-parse-md is out of scope; one-line note added to blocks_snapshots.rs mod doc-comment to flag the intentional split. M11, M12: doc-only comments in lift_block (already added in the previous commit) — list-item shared block_id rationale and the intentional camel-case Debug-format for WarningKind in Provenance notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:02 +00:00
altair823	e0df42984e	p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path) I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from build_canonical_document are tracked separately and attributed to "kb-normalize", so the I1 mapping change does not lie about kb-normalize-originated drops. I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with an invalid empty AssetId (would violate AssetId::from_str's 32-hex invariant). Block is dropped, Warning surfaces in Provenance with src mention, attributed to kb-normalize (lift-stage decision). TODO(P8) comment marks this as a placeholder until the audio extractor lands. I3: NFC-normalize each heading_path string in lift_block before feeding into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does not NFC heading text and serde_json_canonicalizer v0.3 does not either, so canonically-equivalent NFD/NFC inputs would produce different block_ids without this normalization. Mirrors the existing doc_id NFC handling via to_posix. Minors: - M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer, blake3 (unused); keep tracing (now wired) + unicode-normalization (now used by I3). - M5: determinism_1000_iterations_under_1s now uses the same 5-block fixture as block_ordinals_scoped_per_heading_and_kind (extracted into fixture_blocks_five helper) so the determinism property is exercised on a real lift_block path, not just an empty Vec. Still < 1s. - M6: snapshot integration test now passes BodyHints { first_h1: Some("Code And Table"), .. } and asserts doc.title == "Code And Table" end-to-end. Baseline JSON updated. - M7: title/lang edge-case unit tests pin policy: empty string lifts to empty string; non-stringy values silently drop. Rustdoc updated. - M10: provenance_contains_stage_events_in_order asserts events[1].at == events[2].at to pin the shared-now_utc invariant. New tests (unit, kb-normalize): - provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1) - audio_ref_block_skipped_with_warning (I2) - nfc_nfd_korean_heading_path_same_block_id (I3) - title_empty_string_in_user_map_falls_back_to_default (M7) - title_non_string_in_user_map_silently_drops (M7) - lang_invalid_shape_silently_drops (M7) kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:41:50 +00:00
altair823	5352bede5c	p1-4: snapshot + determinism tests Add the integration snapshot test pinning the full `CanonicalDocument` JSON for `fixtures/markdown/code-and-table.md` (run through the real `kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only). Non-deterministic `provenance.events[*].at` for the Parsed and Normalized events is stripped before comparison; the Discovered event's `at` is pinned by constructing the test `RawAsset` with a fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate. Add the 1000-iteration determinism property: same inputs ⇒ byte- identical JSON (modulo the same stripped timestamps), in under one second of wall-clock time. A regression in canonical JSON, BLAKE3 hashing, ordinal counting, or any other deterministic field would surface here immediately. The integration test depends on `kb-parse-md` only as a dev-dep, so `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the production dep tree per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:20:18 +00:00
altair823	1cc0ba9f37	p1-4: provenance + title/lang lift Build a `Provenance` with one event per pipeline stage (`Discovered` sourced from `RawAsset.discovered_at`, then `Parsed` and `Normalized` stamped with one shared `now_utc()` reading), plus one `Warning` event per upstream warning. Sharing `now` between Parsed and Normalized bounds intra-call timestamp jitter — event ordering is preserved by `Vec` position regardless. Warning agents are routed back to the upstream component (`kb-parse-md` for parse warnings, `kb-normalize` for `ExtractFailed`). Lift `metadata.user["title"]` and `metadata.user["lang"]` (where P1-2 stashes them since the `Metadata` struct itself does not carry those fields) into `CanonicalDocument.title` / `CanonicalDocument.lang`. Both keys are removed from the user map after lifting so the wire form does not duplicate the data; missing keys default to empty string / empty `Lang`. Other user-map keys survive. Tests pin the event ordering, the warning routing, and the lift behavior (including non-duplication in `metadata.user`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:19:33 +00:00
altair823	fc05f3a2be	p1-4: build_canonical_document core + ID assignment Implement the §4.3 ordinal rule and §3.4 block lift. Each `ParsedBlock` maps to a `kb_core::Block` variant carrying a `CommonBlock` whose `block_id = id_for_block(doc_id, payload_kind, heading_path, ordinal, source_span)`. Ordinals are scoped to `(heading_path, payload_kind)`, 0-based, in document order — three paragraphs under one H1 get 0/1/2, a code block under the same H1 starts fresh at 0, a paragraph under a different H1 also starts at 0. `payload_kind` is the lowercase-no-spaces convention from §4.2: "heading", "paragraph", "list", "code", "table", "quote", "imageref", "audioref". `ListBlock.items` re-uses the parent list's `CommonBlock` per §3.4 (no per-item BlockId is allocated). `AudioRefBlock` placeholder fields (`asset_id`, `duration_ms`) are filled in by P8 — for now we synthesize the minimal record so the document is well-typed. Tests pin the four §4.4 ID properties (1000-iteration determinism, NFC ≡ NFD Korean path, `./a/b.md` ≡ `a/b.md`, ordinal grouping). Provenance and title/lang lift land in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:18:19 +00:00
altair823	c0096ce44b	p1-4: scaffold kb-normalize crate Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a dev-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical entry point per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:53 +00:00
altair823	cfccb3687d	p1-3: update kb-parse-md callers + drop BlockView projection in snapshots Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)` construction and match sites under the new struct-variant shape introduced in the previous commit. `Inline::Link { text, href }` is unchanged. The snapshot test in `tests/blocks_snapshots.rs` previously projected `ParsedBlock` into a `BlockView`/`PayloadView` shim because the old `Inline` could not serialize. With the schema fix in place we now serialize `ParsedBlock` directly through serde — the shim and its `flatten_inline` helper are removed. Inlines surface as structured objects (`{"kind":"text","text":"…"}` etc.). Regenerated `nested-headings.blocks.snapshot.json` to reflect the new shape via the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json` has no inlines and is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:10:54 +00:00
altair823	606ce1cf66	kb-core: hotfix Inline serde schema (struct variants) `#[serde(tag = "kind")]` rejects newtype variants whose payload is not a struct, so 4 of 5 `Inline` variants (`Text(String)`, `Code(String)`, `Strong(Vec<…>)`, `Emph(Vec<…>)`) failed to serialize at runtime — only `Link { text, href }` worked. Convert every variant to struct form so the internally-tagged shape is well-formed and round-trips through JSON. Add `inline_serde_round_trip` covering all five variants. Per design §9, this is a wire-schema migration; no `docs/wire-schema/v1/*.json` change required since `Inline` is not directly referenced there. Callers in kb-parse-md follow in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:10:40 +00:00
altair823	80123e9e27	p1-3: pin reviewer probe inputs as regression tests The quality reviewer named three specific input probes for the C1/C2/ C3 fixes. Encode each as a verbatim test so future regressions on those exact inputs surface immediately: - probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty + Warning::ExtractFailed. - probe_list_escape: list with embedded code block → single List block, two items. - probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is `["Real"]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:42:21 +00:00
altair823	2b6d9abc0f	p1-3: doc-comment + test pin Quote drops non-text children `ParsedPayload::Quote { text, inlines }` cannot represent block-level children (lists, code, tables, images), so the BlockQuote end handler silently drops them when assembling the Quote payload. This matches §3.4 for now but is non-obvious and easy to regress without an explicit pin. Add a TODO(P1-future) comment near the Quote emission code and a regression test (`quote_with_list_inside_drops_list`) that fixes the current shape: a `> - item` blockquote produces a Quote with empty text and empty inlines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:41:47 +00:00
altair823	23ff4d68af	p1-3: preserve whitespace in link text across SoftBreak/HardBreak `[multi\nline](http://x)` produced `Inline::Link.text = "multiline"` because the SoftBreak/HardBreak handler called `push_text(" ")` — which updates `paragraph.text` and the inline buffer, but NOT the open link frame's flattened text accumulator. Text events flowed through `push_link_text`; line breaks didn't. Add `push_link_text(" ")` alongside the existing `push_text(" ")` in the break handler so a line break inside `[ ... ](href)` collapses to a visible space rather than disappearing. New tests: - link_with_soft_break_preserves_space_in_text - link_with_hard_break_preserves_space_in_text Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:42 +00:00
altair823	73040cab30	p1-3: capture image refs from pulldown-cmark Tag::Image events The previous block-level image detector scanned paragraph source bytes for the literal `![alt](src)` shape. That was fragile in three ways: - `![alt](src "title")` leaked the title into `src` (`src "title"`) - `![alt](<https://x.com/a b>)` kept the angle brackets verbatim - `![]()` had undefined behavior Replace the byte-scan with state on `Frame::Paragraph` that observes the actual `Tag::Image` events from pulldown-cmark: - `image_count` increments on each `Start(Tag::Image)` and `image_src` captures `dest_url` (which already strips angle brackets and excludes the title). - Text events seen while `image_depth > 0` are routed into `image_alt` and suppressed from the inline buffer. - Strong/Emph/Link starts and any non-image text outside the image flag `non_image_text_seen`. At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff `image_count == 1 && !non_image_text_seen`. The byte-scanner `match_block_image` is removed. New tests: - image_with_title_attribute (title dropped, no leak into src) - image_with_angle_bracketed_url (brackets stripped) - empty_image_alt_and_src (`![]()` pins to empty/empty) Existing image tests (`image_ref_block_captures_src_and_alt`, `inline_image_inside_paragraph_is_dropped`) continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:04 +00:00
altair823	d49dbc1926	p1-3: skip empty heading text when building heading_path `# ` (a heading with no following text) used to seed the heading stack with `Some("")`, which then propagated into every child block's `heading_path` as a `""` segment — visibly polluting the path that downstream consumers index by. Filter empty entries from both `heading_path()` and the in-line ancestor collection at heading-end. We deliberately keep `Some("")` in the stack rather than skipping the assignment so the slot remains occupied and a subsequent deeper heading is still positioned correctly relative to its level — only the visible path drops the empty. New tests: - empty_heading_does_not_pollute_path - empty_h1_then_h2_does_not_break_stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:37:40 +00:00
altair823	0050cf32ea	p1-3: route block-level content inside list items into parent inline buffer `emit_block` previously walked the frame stack looking only for a Quote container, falling back to top-level on miss. That caused any block emitted inside a list item — code blocks, images, tables, headings — to escape the list and appear at the top of `blocks`, after the entire list and out of source order. `ParsedPayload::List { items: Vec<Vec<Inline>> }` cannot represent a child block structurally, so the choice is between dropping content and flattening. Extend the reverse-walk to also recognize `Frame::ListItem` and route the block into a textual rendering appended to the item's inline buffer (`flatten_block_into_item`): - Code → fenced text approximation, preserving lang hint + body - Image → `![alt](src)` text - Audio → `[audio](src)` text - Heading → leading hashes + text - Quote → `> text` - Nested List → same rendering as `nested_in_item` flatten - Table → pipe-table approximation Document order is preserved because flattening happens inside the item's frame, before the item closes. New tests: - code_block_inside_list_item_flattens_into_parent - image_inside_list_item_flattens_into_parent - block_content_in_list_preserves_document_order Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:36:27 +00:00
altair823	de9164802b	p1-3: fix span arithmetic overflow + body_offset_lines fuzz `span_for` previously used `u32 + u32` directly, so callers passing a large `body_offset_lines` could panic (debug, then masked by `catch_unwind` and the entire body discarded) or wrap to an inverted span with `start > end` (release). Switch to `checked_add`; on overflow flag the walk state and at the end of `parse_blocks_inner` discard accumulated blocks and surface a single `Warning::ExtractFailed` carrying the offending body line. This degrades cleanly without panicking and without emitting a silently-broken span. Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets across the fuzz iterations so the overflow path is exercised by the randomized corpus. New tests: - body_offset_lines_max_returns_extract_failed - body_offset_lines_zero_at_max_minus_one_no_overflow Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:34:58 +00:00

1 2

74 Commits