kebab

Author	SHA1	Message	Date
altair823	1409eaae51	chore: bump workspace version 0.2.0 → 0.2.1 — p9-fb-25 release 사용자 도그푸딩 / 실사용 binary 필요로 release bump (CLAUDE.md release 규약 트리거 1). p9-fb-25 변경사항: - WorkspaceCfg.include 제거 (옛 config 의 include = [...] silently 무시 + 단발 deprecation warning). - IngestItem.warnings 가 Skipped 시 사유 채움 (`unsupported media type: .docx` 등). - IngestReport.skipped_by_extension: BTreeMap<String, u32> 신규 (additive wire — v1 호환). - CLI / TUI summary 에 breakdown (`5 skipped: 3 docx, 1 txt`). - kebab init 헤더 + README 에 지원 형식 (md / png / jpg / pdf) 명시. caller breaking 없음 (모든 surface 변경 backward-compat). 730 워크스페이스 테스트 통과. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:49:10 +00:00
altair823	ce68885d92	chore: bump workspace version 0.1.0 → 0.2.0 for next dev cycle v0.1.0 release tagged on 2026-05-05 — first official release covering P0~P4 + P5 + P6~P7 + P9 (UI) 의 도그푸딩 사이클 1회 완성. 이후 도그푸딩 follow-up (p9-fb-21 ~ p9-fb-24) 까지 v0.1.0 에 포함: - p9-fb-22: TUI input cursor mid-string editing + Ask follow-tail. - p9-fb-23: incremental ingest (skip unchanged docs). - p9-fb-24: TUI status/key bar + Library 컬럼 헤더 + PgUp/PgDn. 다음 dev cycle 부터 0.2.0. minor bump rationale (semver pre-1.0): - Wire schema additive (`IngestReport.unchanged`, `IngestEvent` 의 `Unchanged` variant) — backward-compat 유지 but 신규 surface. - 신규 CLI flag (`--force-reingest`). - 신규 SQLite migration V006 (incremental ingest 컬럼). - TUI surface 변경 (status bar 항상 노출, Library 헤더 row, PgUp/PgDn, cursor 편집). - IngestOpts struct 도입 (AskOpts 패턴) — 기존 wrapper 보존하므로 caller breaking 은 없음. 기존 ~723 워크스페이스 테스트 무수정 통과 (CARGO_PKG_VERSION 가 env! 로 동적 — TUI status bar 테스트 자동 추적). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:28:20 +00:00
altair823	0e408fb1b5	feat(kebab-app + kebab-store-sqlite): p9-fb-19 search LRU cache + corpus_revision 도그푸딩 item 15 — TUI / 같은 process 안에서 동일 query 반복 시 SQLite FTS + Lance + RRF 재계산이 매번 발생하던 비용 해소. in-process LRU 캐시 + 모노토닉 corpus_revision 카운터로 ingest commit 발생 시 모든 entry 자동 stale. ## 핵심 변경 - SQLite V004 migration: `kv (key TEXT PRIMARY KEY, value TEXT) STRICT` + `corpus_revision = '0'` seed. 미래의 다른 scalar 도 같은 테이블에 들어갈 수 있는 generic shape. - `SqliteStore::corpus_revision()` / `bump_corpus_revision()` — `UPDATE ... CAST AS INTEGER + 1` atomic. INSERT-OR-IGNORE 도 함께 실행 (V004 seed 가 무슨 이유로 누락된 케이스 paranoid). - `kebab-app::ingest_with_config_cancellable` — `new + updated > 0` 시 bump, no-op (skipped-only) reingest 는 cache 보존. - `App.search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec< SearchHit>>>>` — `config.search.cache_capacity` (default 256, 0 비활성). `lru = "0.12"` workspace dep 추가. - `SearchCacheKey` = `query_norm` (NFKC + trim + lowercase) + `mode` + `k` + `snippet_chars` + `embedding_version` (vector/hybrid 만, lexical 은 빈 문자열) + `chunker_version` + `corpus_revision` snapshot. - `App::search` rewrite — cache 활성 시 lookup → miss 면 기존 `search_uncached` 호출 후 put. cache 비활성이거나 lock 실패면 straight-line. - `App::search_uncached` (rename of pre-fb-19 `search` body) + `search_uncached_with_config` facade — CLI `kebab search --no-cache` 로 진입. - `Config.search.cache_capacity: usize` field, `#[serde(default)]` 로 기존 config 호환. - CLI `--no-cache` flag — 디버깅용 (CLI 는 매 호출이 새 process 라 사실상 no-op 이지만 spec 명시 + 향후 long-lived process 호환). - frozen design §9 versioning 표에 `corpus_revision` row 추가 (기존 `index_version` 라벨과 다른 차원: 라벨은 retrieval 형상, corpus_revision 은 ingest commit ack). ## 테스트 - `kebab-store-sqlite` 신규 3 unit (fresh=0, monotonic bump, persist across reopen) - `kebab-app` 신규 4 integration (cached repeat 같은 hits, NFKC 정규화 로 case/whitespace collapse, --no-cache parity, first ingest bumps corpus_revision) - 워크스페이스 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0 - `cargo clippy --workspace --all-targets -- -D warnings` clean ## 문서 - README `kebab search` 행: 캐시 동작 + `--no-cache` 안내 + corpus_ revision 무효화 메커니즘 - docs/SMOKE.md `[search]` 절에 `cache_capacity` 라인 추가 - HANDOFF: 2026-05-03 entry - spec status planned → in_progress ## Out of scope - patch-and-merge incremental (RRF 정규화 전체 hit set 기준이라 어려움) - SQLite 영속 cache (P+) - 다른 process 간 cache 공유 (in-process 만 — corpus_revision 이 cross-process 무효화는 O(1)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:01:31 +00:00
altair823	702c7c89f7	review(p9-fb-05): 회차 1 nit 반영 - `Config.source_dir` 를 `pub(crate)` 로 좁힘. invariant ("from_file / load 만이 정당한 setter") 가 외부 mutation 으로 깨지지 않도록. 대신 `pub fn source_dir(&self) -> Option<&Path>` (read-only) + `pub fn with_source_dir(self, dir) -> Self` (builder) 노출 — 테스트 / 프로그래마틱 사용은 builder 통과. - `resolve_workspace_root` 의 `current_dir()` 실패 fallback 에 `tracing::warn!` 추가. chroot / deleted-cwd / permission 문제로 cwd 가 안 잡힐 때 silently `./root` 로 떨어지지 않고 로그가 남음. `tracing` 을 kebab-config 의 deps 에 추가 (workspace dep). 테스트 27 통과 + 워크스페이스 clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:23:16 +00:00
altair823	c462dbf6a4	feat(kebab-tui): p9-fb-11 ask answer markdown rendering 도그푸딩 item 9 — TUI Ask 답변 본문이 raw `bold` / `# Title` / ` ```code``` ` 그대로 보여 가독성 떨어지던 문제 해소. pulldown-cmark 파싱 → ratatui Span/Line 변환. ## 핵심 변경 - `kebab-tui::markdown::render(text, &Theme) -> Vec<Line<'static>>` 신규. pulldown-cmark = "0.13" (이미 kebab-parse-md 가 사용 중인 버전) 위에 build. inline: - `bold` / `__bold__` → `Modifier::BOLD` - `italic` / `_italic_` → `Modifier::ITALIC` - `~~strike~~` → `Modifier::CROSSED_OUT` - `` `code` `` → `Role::Hint` (DIM 스타일 — 터미널 호환성 위해 bg color 보다 안전) - `[text](url)` → `Role::CitationMarker` + `Modifier::UNDERLINED` block: - heading H1/H2 → `Role::Heading` (Cyan + BOLD), H3-H6 → `Role::Title` (White + BOLD) - bullet list `-`/`` → `- ` + 깊이별 indent - ordered list `1.` → 실제 번호 prefix + indent - fenced code block ``` ``` ``` → ` ` indented + `Role::Hint` - blockquote `>` → 좌측 `▎` bar (중첩 시 반복) + `Role::Hint` - table `\| col \|` → `\| col1 \| col2 \|` 식 줄, `\|` separator 색 강조 - horizontal rule `---` → `─` × 40 - streaming 안전성: 매 frame 재 parse 가 spec — pulldown 토크나이저가 µs/KB 라 비용 무시. unterminated `` (사용자가 한창 입력 중인 inline 가 닫히기 전) 은 pulldown 이 Text 로 처리 → literal `` 그대로 표시 (글자 누락 X). - `ask::push_turn_lines` 통합: grounded 답변에서만 markdown 렌더 사용. refusal turn (`Role::Warning` override) 와 streaming turn (`Role::Hint`) 은 raw 로 두어 role color 시그널이 markdown 스타일에 묻히지 않도록. body line 들은 ` ` indent 로 transcript 에서 답변 본문 시각 구분. - CLI `kebab ask` 출력은 raw markdown* — 터미널 호환성 + pipe 처리 시 안정성 위해 (ANSI escape 없이 plain text). ## 테스트 (markdown.rs 14 unit) - empty input → 빈 라인 1 줄 (caller scroll/measure 안전) - plain text → 단일 라인 + paragraph blank - bold / italic / strikethrough / inline code → 해당 modifier 검증 - link → UNDERLINED 검증 - heading H1 → BOLD 텍스트 span - bullet list `-` / numbered list `1./2.` → prefix 검증 - code fence body → 줄별 ` ` indent 보존 - blockquote → `▎` prefix - 2x2 table → `\|`-separated 줄 검증 - unterminated `**` → 글자 누락 없음 (streaming 안전성 회귀 방지) - composite (heading + para + list + code) → 문서 순서 보존 기존 75 TUI 테스트 + 신규 14 markdown = 89 통과. clippy clean. ## 문서 - README `kebab tui` 행에 markdown 렌더 안내 + CLI 는 raw 명시 - HANDOFF: 2026-05-03 entry - spec status planned → in_progress Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:33:21 +00:00
altair823	fa02a7c68d	feat: ingest cooperative cancellation (p9-fb-04) Ctrl-C / Esc 가 ingest 를 즉시 중단. 현재 in-flight asset 마무리 후 이후 asset 미실행, IngestEvent::Aborted { partial_counts } 발신, Ok(IngestReport) 정상 반환 (Err 아님). 부분 commit 보존, 다음 ingest 가 idempotent 재개. 신규 facade: kebab-app::ingest_with_config_cancellable(.., progress, cancel: Option<Arc<AtomicBool>>). 기존 _progress 가 cancel=None forwarding wrapper. asset loop 시작 boundary 마다 atomic load — true 면 break + Aborted emit + 정상 종료. Lock 없음. CLI: ctrlc crate 신규 dep. SIGINT handler 가 첫 신호에 cancel.store(true) + stderr hint, 두 번째 신호에 std::process::exit(130) (canonical SIGINT exit code). install_sigint_cancel() helper 가 Arc<AtomicBool> 반환, Cmd::Ingest 가 facade 에 전달. TUI: IngestState 에 cancel: Arc<AtomicBool> field 추가 (회차 1 review 결과의 reshape 정확). start_ingest 가 둘 다 만들어 worker 에 clone move. cancel_running_ingest(&app) helper — Esc / Ctrl-C 가 ingest 진행 중일 때만 cancel 우선, 그 외에는 quit. Test: - 3 facade integration (cancel-before / cancel-mid / no-cancel default). - 3 tui lib unit (cancel_running_ingest no-state / in-flight / terminated). Plan 갱신: p9-fb-04 status planned → in_progress. 머지 후 한 줄 commit 으로 completed flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:36:17 +00:00
altair823	e613236d60	feat(cli): kebab ingest progress display (p9-fb-02) + p9-fb-01 status flip `kebab ingest` 가 진행 상황을 사용자에게 보여주는 두 surface 추가: - 사람 모드 (TTY): indicatif `ProgressBar` on stderr — scan 중에는 spinner, ScanCompleted 후 bar 로 전환, 매 asset 마다 message 갱신. - 사람 모드 (non-TTY, CI/pipe): indicatif draw target 을 hidden 으로 두고 stderr 에 한 줄씩 (`ingest: scanning`, `ingest: 1/N path`, `ingest: complete (...)`). - `--json` 모드: stderr 비우고 stdout 에 line-delimited `ingest_progress.v1` JSON 을 emit. 마지막 줄은 기존 `ingest_report.v1` 그대로 (외부 wrapper backward-compat). 구현: - 신규 `crates/kebab-cli/src/progress.rs` — `ProgressMode::{Json, Human { tty }}`, `ProgressDisplay` (background thread 가 channel drain + 모드별 render), `now_rfc3339` helper. mode 가 무엇이든 ts 는 wire emit 시점에 stamp. - `crates/kebab-cli/src/wire.rs` 에 `wire_ingest_progress` 추가. serde tag (`kind`) 위에 `schema_version` + `ts` 두 필드 더해 spec §2.4a wire shape 완성. - `Cmd::Ingest` 핸들러: mpsc channel 만들고 background thread 가 display 돌리는 동안 main 이 `ingest_with_config_progress` 호출. ingest 반환 시 Sender drop → display thread 정상 종료. join 후 최종 ingest_report 출력. - 새 dep: `indicatif` 0.17 (TTY 전용 진행 바, non-TTY/--json 에서는 hidden draw target). Test: - 3 lib unit (mode resolution + RFC 3339 round-trip). - 3 integration (--json line-delimited / non-TTY stderr text / ts+kind 검증). 16 PASS 전체 회귀 0. Plan 갱신: - p9-fb-01: status `in_progress` → `completed` (PR #52 머지 후속). - p9-fb-02: status `planned` → `in_progress`. 머지 후 별도 한 줄 commit 으로 `completed` flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 19:57:02 +00:00
altair823	8879bd4de2	chore(p9-fb-06): sync Cargo.lock for tempfile dev-dep + flip spec to completed Two follow-ups after PR #49 (kebab reset) merged: - Cargo.lock: kebab-cli's new dev-dep `tempfile` was committed in the feature PR but the lockfile entry was not regenerated, leaving main with a stale lock. `cargo metadata` regenerates the one-line addition to kebab-cli's dependency list. - tasks/p9/p9-fb-06-data-reset-command.md: status `in_progress` → `completed`, per the plan's task 6 commitment to flip in a separate one-line commit only after the implementation PR merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:47:23 +00:00
altair823	b6e0ab352f	feat(kebab-tui): P9-4 Inspect pane — doc/chunk detail with collapsible sections Library Enter / Search 'i' 가 Inspect 진입. Doc 또는 Chunk 단일 view 로 metadata / provenance / blocks (doc) 또는 spans / text / embeddings (chunk) 6 section 을 collapsible 로 표시. Esc/q 로 originating pane 으로 복귀. 핵심: - InspectTarget enum (`Doc(DocumentId) \| Chunk(ChunkId)`). - InspectState 본체 (`app.rs`) — target / doc / chunk / collapsed HashSet / scroll / return_to / needs_fetch / loading. - `src/inspect.rs`: - `render_inspect` — target 종류별 render_doc / render_chunk 분기, section header 가 collapse marker (▾/▸) 표시. metadata.user JSON pretty-printed. - `handle_key_inspect`: j/k / Down/Up scroll. PageDown/PageUp 10 row. c = toggle all sections (v1 simplification). Esc/q = SwitchPane(return_to). - `enter_inspect(state, target, return_to)` helper — Library 와 Search 공통 entry point. - run-loop hook `refresh_inspect` — needs_fetch 면 lazy inspect_doc_with_config / inspect_chunk_with_config. - run.rs: Pane::Inspect arm 이 handle_key_inspect + render_inspect. Idle tick 마다 refresh_inspect. SwitchPane(Inspect) lazy init. - Library: Enter 가 enter_inspect(Doc(selected)) 호출 후 SwitchPane. - Search: 'i' (plain modifier) 가 enter_inspect(Chunk(selected_hit)) 호출 후 SwitchPane. typing 'i' (\"instance\") 와 충돌 가드. 테스트 12개 (`tests/inspect.rs`, TestBackend) — Esc 가 return_to 사용 / q 도 동작 / j/k scroll bounds / PgUp PgDn ±10 / c 일괄 toggle / no target hint / loading / doc view header+metadata+provenance+blocks / collapse hides body / chunk view text+block_ids / no slot → SwitchPane(Library) / enter_inspect helper sets fields. Spec deviation (HOTFIXES `2026-05-02 P9-4`): - `render_inspect<B: Backend>` generic 제거 (P9-1/2/3 와 동일). - Search `i` 키 추가 (P9-2 spec 에 없었음, P9-4 retroactive 추가). - `c` 일괄 collapse — spec 의 \"focus 기반 selective collapse\" 는 P+. Docs (sync rule): - README: TUI 행 \"4 패널\" + Quick start 코멘트. - HANDOFF: 한 줄 요약 + Phase status (P9 3/5 → 4/5) + deviation 한 줄. - HOTFIXES: P9-4 entry. - tasks/p9/p9-4 status: completed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 15:41:11 +00:00
altair823	43ff4048e8	feat(kebab-tui): P9-1 Ratatui shell + Library pane 새 crate `kebab-tui` 가 §8 facade rule 따라 `kebab-app` 만 import. Ratatui 0.28 + crossterm 0.28 기반 shell 이 다음을 제공: - `App` 구조체: config + focus + library + 3 Option sub-state slot (search/ask/inspect — p9-2/3/4 가 자기 모듈에서 채우는 parallel-safety contract). p9-1 외에 App 정의 손대지 않음. - `Pane` enum (Library/Search/Ask/Inspect/Jobs). - `KeyOutcome` (Continue/Quit/SwitchPane/Refresh). - `LibraryState` + 내부 inner: docs / list_state / filter / filter_edit / needs_refresh / loading / pending_g. - `render_library` (Frame, area, &App) — heading/body, filter overlay toggleable, Korean/wide-char 너비는 unicode-width 로 계산. - `handle_key_library`: j/k/Down/Up 이동, gg/G 끝, f 필터 overlay, /=>Search ?=>Ask Enter=>Inspect, q/Esc 종료. error overlay 가 켜 있으면 어떤 키든 dismiss. - 필터 overlay: tags_any (CSV) + lang 두 필드, Tab cycle, Enter apply→Refresh, Esc cancel. - `ErrorOverlay`: anyhow chain 캡쳐 후 popup 렌더 (Clear + 빨간 border). - 터미널 lifecycle: `TuiTerminal` 가 enter raw mode + alt screen, Drop 이 종료 시 (panic 포함) restore — 사용자 쉘 깨지지 않게. - 비동기 없음: facade 호출은 main thread 동기. v1 의 brief hang 수용. CLI: `kebab tui` 서브커맨드 추가, --config 받아 App::new + run. 테스트 10건 (`tests/library.rs`, TestBackend 사용): - 빈 library / 3-doc render / q,Esc quit / / Search 전환 / ? Ask 전환 - Enter 빈 list 무동작 / Enter Inspect 전환 / j 이동 (3-step clamp) / f 필터 overlay → 입력 → Enter Refresh. Test seam: `App::populate_library_for_testing` (#[doc(hidden)]) 가 `pub(crate)` inner 를 우회. spec parallel-safety contract 그대로 유지. Spec deviation (HOTFIXES `2026-05-02 P9-1`): - `render_library` 의 `<B: Backend>` generic 제거 — ratatui 0.28 의 Frame 이 backend-agnostic. - `populate_library_for_testing` 추가 (test seam, 공식 API 아님). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:26:24 +00:00
altair823	5f3a37cafa	feat(kebab-app): P7-3 PDF ingest wiring — kebab ingest 가 PDF 자산도 처리 P7-1 (`PdfTextExtractor`) + P7-2 (`PdfPageV1Chunker`) 의 라이브러리를 `kebab-app::ingest_with_config` 에 와이어링. `kebab-source-fs` 가 이미 `*.pdf` 를 `MediaType::Pdf` 로 분류하던 자산이 이제 검색 가능한 doc 으로 색인됨. P6-4 image wiring 패턴과 평행 — `ingest_one_asset` 에 `MediaType::Pdf` arm 추가, 새 private fn `ingest_one_pdf_asset` 로 분기. 핵심 동작: - per-medium chunker 선택: PDF 자산은 `PdfPageV1Chunker` 하드코딩 (compile-time match 기반). `config.chunking.chunker_version` 은 markdown 만 represent — PDF 는 항상 `pdf-page-v1`. HOTFIXES entry `2026-05-02 P7-3` 에 deviation 기록. - encrypted PDF / corrupt PDF → `errors+=1` + P7-1 의 `qpdf --decrypt` hint 를 `IngestItem.error` 에 verbatim 보존. - 빈/scanned candidate 페이지 → 0 chunk, P7-1 의 `Provenance::Warning` 그대로 통과. v1 에서는 검색 불가, P+ scanned-PDF OCR fallback 대기. - determinism stress: extract → chunk 사이 `now()` 추가 호출 없음 (P6-4 invariant 계승). PDF doc/chunk_id 모두 결정적. 통합 테스트 (`tests/pdf_pipeline.rs`, 8 passed + 1 ignored): - 3-page text PDF → 1 doc + 3 chunk + Page span 검증 - identical re-ingest → Updated, doc_id 동일 - encrypted PDF → Error + `qpdf` hint 보존 - corrupt header PDF → Error + 미저장 - mixed page (page 2 빈) → 2 chunk + Warning 1개 - IngestReport 산술 invariant - 50-page 긴 PDF → ≥50 chunk - inspect doc → SourceSpan::Page round-trip - (ignored) edited bytes re-ingest → storage UNIQUE bug 노출, P+ fix 대기 추가 발견 (HOTFIXES `2026-05-02 P7-3`): `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하는 부분 사이에 gap 존재. byte 변경 시 새 asset_id → 같은 workspace_path 충돌. md / image / pdf 모두 영향. P7-3 통합 테스트가 처음 노출. 본 PR 은 fix 안 함 — P+ storage task. `docs/SMOKE.md` 에 PDF 섹션 + 검증 체크리스트 + 알려진 동작 4건 추가. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:28:06 +00:00
altair823	8de08cf38c	review(p7-1): 회차 1 지적 반영 - Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`, `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히 `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript, type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5, linked-hash-map …) 가 모두 사라짐. lopdf 만 남음. - info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy` 는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림. byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트 `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가. - info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding 은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름. - lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에 `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다 garbled order 가 운영에 유리). - tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된 타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:40:40 +00:00
altair823	5a158d7343	feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument `PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출. 페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }` emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning` ("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력. 핵심 동작: - `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러 (`qpdf --decrypt` 안내). - 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed page panic 을 recoverable warning 으로 변환. - `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리). - 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse, corrupt header error, page_count 메타, UTF-16BE Title, filename fallback, determinism, snapshot. `parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7` (원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:34:55 +00:00
altair823	ca0567c72b	feat(kebab-app): P6-4 image ingest wiring — kebab ingest 가 PNG/JPEG 자산도 처리 P6-1/P6-2/P6-3 의 라이브러리 (`ImageExtractor`, `OllamaVisionOcr`, `apply_caption`) 가 그동안 CLI 에서 보이지 않던 미완 구간을 완성. 이제 `kebab ingest` 가 markdown 외에 이미지 자산을 end-to-end 로 색인하고, `kebab search` / `kebab ask` 가 OCR 텍스트 + caption 으로 이미지를 매칭/인용한다. ## kebab-app - `[dependencies]` 에 `kebab-parse-image` 추가. - `ingest_with_config` 진입 시 `image.ocr.enabled` / `image.caption.enabled` 플래그에 따라 `OllamaVisionOcr` / `OllamaLanguageModel` 을 ingest 세션당 1회 빌드. 자산 루프에서 trait object 로 공유. reqwest::blocking::Client 의 내부 Arc 덕분에 알로케이션 비용은 자산 수와 무관. - 두 어댑터 + ImageExtractor 를 한 묶음으로 `ImagePipeline` 구조체에 담아 `ingest_one_asset` 매개변수 폭증 차단 (clippy::too_many_arguments 대응). - `ingest_one_asset` 의 markdown-only 가드를 `match media_type` 으로 교체 — Markdown 은 기존 경로, Image(_) 는 새 `ingest_one_image_asset` 로 분기, PDF/Audio/Other 는 종전대로 skipped. - 신규 `ingest_one_image_asset`: - bytes 읽기 → `ImageExtractor::extract` (실패 시 caller 가 errors+=1) - `apply_ocr` (Lenient — 실패 시 ProvenanceKind::Warning 이벤트 + `IngestItem.warnings` 에 \"ocr_failed: ...\", `block.ocr` 는 None 유지) - `apply_caption` (동일 Lenient 정책) - 기존 `MdHeadingV1Chunker` 호출 — 청커는 이미 `Block::ImageRef` 를 단일 청크로 emit - 기존 persist + embed 시퀀스 그대로 (markdown 과 byte-identical) - `lang_hint_from_doc` — `Lang(\"und\")` 또는 빈 문자열을 None 으로 매핑 (image-pipeline 어댑터의 build_prompt 가 \"und\" 를 silent drop 하지 않도록 caller 측에서 미리). ## kebab-chunk - `render_block_text` 의 `Block::ImageRef` 분기를 P6-4 (β) plain concat 정책으로 교체 — `[alt, ocr.joined, caption.text]` 를 `\\n\\n` 로 join, 빈 부분은 drop. alt 가 비면 `src` 의 basename 으로 fallback (P6-1 contract 의 defensive guard). - 신규 unit 테스트 `image_ref_p6_4_plain_concat_drops_empty_parts` — alt-only / alt+ocr / alt+caption / alt+ocr+caption / 빈 alt → src fallback 다섯 케이스 모두 검증. - 기존 `image_ref_emits_own_chunk_zero_tokens` 그대로 통과 — 청커의 per-block dispatch 는 변경 없음, text 렌더링만 갱신. ## 통합 테스트 (kebab-app/tests/image_pipeline.rs) wiremock 으로 Ollama 를 stub. 5건: 1. OCR-only happy path — 1 PNG + ocr.enabled → 1 doc + 1 chunk emit, `block.ocr.joined` 가 mock 의 \"Hello World 2026\". 2. OCR + caption 동시 활성 — 두 필드 모두 채워지고 chunk text 에 alt + ocr + caption 세 부분 모두 포함. 3. Lenient 실패 검증 — OCR 503 시 자산은 indexed (kind=New), `errors=0`, ProvenanceKind::Warning attributed to \"kb-app\", `IngestItem.warnings` 에 \"ocr_failed:\" 노트. 4. 양쪽 비활성 — `image.ocr.enabled=false && image.caption.enabled=false` 여도 자산은 chunk 1개로 indexed (chunk text=filename), EXIF + dimensions 그대로 채워짐. 5. 결정성 (re-ingest) — 동일 PNG 두 번 ingest 시 두 번째는 `Updated` + 동일 `doc_id`. ## SMOKE.md `kebab search --mode lexical \"Hello World\"` 단계를 명령 시퀀스에 추가. `[image.ocr]` / `[image.caption]` config 절 예시 + ingest 시간 추정 (자산당 ~5-10초) 추가. \"책은 P7 PDF 라인으로\" 가이드를 검증 체크리스트 와 \"알려진 동작\" 양쪽에 박음. ## 실 Ollama 통합 검증 192.168.0.47 + gemma4:e4b 기준: ``` $ kebab --config /tmp/kebab-smoke/config.toml ingest scanned 2 new 2 updated 0 skipped 0 errors 0 (18395 ms) $ kebab inspect doc <image_doc_id> parser_version: image-meta-v1 blocks: [{ alt: \"hello.png\", ocr: \"Hello World 2026\", caption: \"The image displays the text \\\"Hello World 2026\\\" in a large, black, sans-serif font.\" }] $ kebab --json ask \"Hello World 텍스트가 어디에 있나?\" --mode hybrid grounded: true citations: [{marker: \"[1]\", doc_path: \"hello.png\"}] ``` ## 검증 - `cargo test --workspace --no-fail-fast -j 1` — 전부 pass - `cargo clippy --workspace --all-targets -- -D warnings` — pass - `cargo test -p kebab-chunk image_ref` — 2 pass (P1-5 회귀 + P6-4 신규 unit) - `cargo test -p kebab-app --test image_pipeline` — 5 pass ## 의존성 경계 - `kebab-app` 이 `kebab-parse-image` 추가 — spec Allowed dep 그대로. - 새 forbidden 침범 없음 (기존 `kebab-tui` / `kebab-desktop` / `kebab-eval` 미참조 유지). - 본 task 가 신설하는 image-specific 비즈니스 로직 0줄 — 모두 `kebab-parse-image` 에 위임. `tasks/p6/p6-4-image-ingest-wiring.md` status: planned → completed. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2 Extractor/Chunker traits, §9.1 image extraction policy.	2026-05-02 07:37:56 +00:00
altair823	cd2213e48d	feat(kebab-parse-image): P6-3 caption adapter — vision LM via trait - 신규 모듈 `crates/kebab-parse-image/src/caption.rs` 추가: • `caption_image(llm, bytes, lang_hint, cfg)` — `&dyn LanguageModel` 위에서 동작. 비전 LM (예: gemma4:e4b) 이 한 문장 객관 설명 출력. temperature=0 / seed=0 결정성. • `apply_caption(llm, bytes, block, lang_hint, cfg, events)` — `block.caption = Some(...)` 으로 채우고 ProvenanceKind::CaptionApplied 이벤트 1건 추가. `image.caption.enabled = false` 면 클린 no-op (Ok(())). LM 실패 시 block.caption None 그대로 + events 미기록. • 다운스케일 long-edge `[128, 1536]` 클램프. PNG passthrough hot path 보존, 그 외는 단일 디코드 + PNG 재인코딩. • 한국어 / 영어 프롬프트 분기 (lang_hint=\"ko\"/\"kor\" → 한국어). • `ModelCaption.model_version = \"<provider>/<prompt_template_version>\"` (예: \"ollama/caption-v1\") — prompt 또는 모델 회귀 감사 가능. ## kebab-core / kebab-llm-local 변경 - `kebab_core::GenerateRequest` 에 `images: Vec<String>` 필드 추가. `#[serde(default)]` 으로 기존 wire 페이로드 / snapshot 호환. - `kebab-llm-local::OllamaLanguageModel` 가 req.images 를 Ollama `images: [base64, ...]` 와이어 필드로 라우팅. `#[serde(skip_serializing_if = is_empty)]` 로 비어 있을 때 wire shape 가 pre-P6-3 와 byte-identical. ## kebab-config - 신규 `ImageCfg.caption: CaptionCfg`: - `enabled: bool` (default false) - `max_pixels: u32` (default 768, 클램프 [128, 1536]) - `prompt_template_version: String` (default \"caption-v1\") - `KEBAB_IMAGE_CAPTION_{ENABLED,MAX_PIXELS,PROMPT_TEMPLATE_VERSION}` 3종 환경변수 추가. ## Spec deviations `tasks/HOTFIXES.md` 2026-05-02 항목 추가: - Symptom 1: spec p6-3 시그니처가 `&dyn LanguageModel` 인데 frozen trait + GenerateRequest 가 vision 미지원. → trait 확장. - Symptom 2: spec 의 cargo feature `caption` (default OFF at compile time) → runtime gate 1개로 통합. base64/image/kebab-llm 외 추가 deps 없어 cargo feature 의 binary 절감 가치 미미. p4-1 / p4-2 / p6-3 spec 의 amends 명시. ## 테스트 `cargo test -p kebab-parse-image --test caption` — 9건 + 1 ignored: - feature gate (disabled → no-op / Err on direct call) - happy path (block.caption Some + Provenance CaptionApplied) - 빈 토큰 stream → empty text + caption.is_some() - CapturingMock 으로 req.images 라우팅 검증 (base64 1개, decode 가능) - 한국어 / 영어 프롬프트 분기 (CapturingMock 의 system 캡처) - LM Err → block.caption None 유지 + events 미기록 - 결정성 (동일 mock 입력 → 동일 caption) - max_pixels 클램프 (99999 → 1536, 4000×3000 PNG 다운스케일 검증) - opt-in 통합 (실 192.168.0.47 Ollama / gemma4:e4b → \"The image is a solid red color.\" 검증 완료, 4.3초) `cargo test --workspace --no-fail-fast -j 1` 전체 pass. `cargo clippy --workspace --all-targets -- -D warnings` pass. ## 의존성 경계 - 추가 deps: `kebab-llm` (trait 만), `base64` (이미 P6-2 에서 추가). - dev-deps: `kebab-llm/mock` 으로 `MockLanguageModel`, `kebab-llm-local` (통합 테스트 전용 — 런타임 deps 에는 없음). - forbidden 침범 없음: `kebab-source-fs / parse-md / normalize / chunk / store-* / embed* / search / rag / UI` 미참조. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 ImageRefBlock.caption, §3.7a ModelCaption, §9.1 caption (model-generated, low trust).	2026-05-02 06:05:39 +00:00
altair823	4ed5536c92	feat(kebab-parse-image): P6-2 OCR adapter — Ollama-vision default - 새 모듈 `crates/kebab-parse-image/src/ocr.rs` 추가. spec 의 `OcrEngine` trait 그대로 + `OllamaVisionOcr` default 구현 + `apply_ocr` 헬퍼. - `OllamaVisionOcr`: `<endpoint>/api/generate` 비스트리밍 호출, `images: [base64]` 필드로 이미지 전달, 프롬프트는 언어 힌트 + 화이트리스트 언어 목록 포함. 응답 prose 를 `OcrText.joined` 로, prepared image 전체 영역 단일 region (confidence 1.0) 으로 wrap. 기본 모델 `gemma4:e4b`. endpoint 비어 있으면 `models.llm.endpoint` 로 fallback. - 이미지 전처리: long-edge `config.image.ocr.max_pixels` (기본 1600, 256~4096 클램프) 초과 시 PNG 로 재인코딩 (image::imageops::resize, Triangle filter). PNG 입력이 max 이내면 zero-copy passthrough. - `apply_ocr` 는 OCR 성공 시 block.ocr 를 Some 으로 채우고 ProvenanceKind::OcrApplied 이벤트 추가. 실패 시 block.ocr 는 None 그대로 + provenance 미기록 (부분 상태 누출 금지). - `kebab-config`: 새 `ImageCfg.ocr: OcrCfg` 블록 (enabled/engine/model /endpoint/languages/max_pixels). `#[serde(default)]` 로 pre-P6 TOML 호환. `KEBAB_IMAGE_OCR_*` 환경변수 5종 추가. ## Spec deviation 원래 P6-2 spec 은 Tesseract 를 default OCR 엔진으로 지정했으나, dev / CI 호스트에서 `libtesseract-dev` 시스템 패키지 설치를 피하려고 Ollama-vision 으로 default 를 교체. `OcrEngine` trait 추상화는 spec 그대로 보존 — Tesseract / Apple Vision / PaddleOCR 어댑터는 같은 trait 으로 추후 feature-gate 추가 가능. 자세한 내역은 `tasks/HOTFIXES.md` 2026-05-02 항목 참조. Trust 측면: vision LM 은 hallucinate 가능. `OcrText.engine = "ollama-vision"` 필드로 consumer 가 엔진 별 신뢰 분기 가능. ## 테스트 - 신규 (`tests/ocr.rs`, 8 + 1 ignored): - 200 happy → OcrText 디코딩 (joined / engine / engine_version / region count / bbox / confidence) - 빈 응답 → 빈 regions - 5xx → Err with status + body 포함 - 200 error envelope → Err - apply_ocr → block.ocr Some + Provenance OcrApplied 1건 - apply_ocr error → block.ocr None 유지 + events 미기록 - 4000×3000 PNG → max_pixels=1024 까지 다운스케일, aspect ratio 보존 - from_parts max_pixels 클램프 - opt-in `KEBAB_OCR_INTEGRATION=1` 통합 (실제 192.168.0.47 Ollama `gemma4:e4b` 로 \"Hello World 2026\" 전사 검증 완료) - 신규 (`src/ocr.rs` unit): truncate, build_prompt 언어/힌트 처리 - `kebab-config` 테스트 +3: defaults, env override, pre-P6 TOML 호환 전체: `cargo test -p kebab-parse-image` 28 pass + 1 ignored, `cargo test -p kebab-config` 20 pass, `cargo clippy --workspace --all-targets -- -D warnings` pass. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 ImageRefBlock.ocr, §3.7a OcrText / OcrRegion, §9.1 OCR vs caption provenance.	2026-05-02 05:38:24 +00:00
altair823	194dd34668	review(p6-1): 회차 1 지적 반영 - Cargo.toml: 미사용 deps 제거 (`serde`, `thiserror`) + dev-deps 의 `serde_json` 중복 선언 제거. - src/lib.rs: 변수명 `decode_warning` → `dim_warning` (16k cap 초과 분기까지 포괄하므로 더 정확). - src/exif_extract.rs: `ascii_field` / `u32_field` 의 dead-flexibility `In` 인자 제거 (모든 호출이 `In::PRIMARY` 였음). 두 단 `if let` 을 Rust 2024 let-chain 으로 정리. EXIF 화이트리스트 출력 키를 workspace wire-schema 컨벤션에 맞춰 snake_case 로 통일 (`Make` → `make`, `DateTimeOriginal` → `date_time_original` 등). - tests/common/mod.rs: 호출되지 않는 `fake_path` 헬퍼 + `Path` import 제거. - tests/extractor.rs: snake_case 키로 assertion 갱신. cargo test -p kebab-parse-image — 14건 모두 pass. cargo clippy -p kebab-parse-image --all-targets -- -D warnings — pass.	2026-05-02 05:11:40 +00:00
altair823	d11a810119	feat(kebab-parse-image): P6-1 image extractor + EXIF whitelist - 새 crate kebab-parse-image 추가 (workspace 19개째). MediaType::Image(_) 자산을 단일-블록 CanonicalDocument 로 변환하는 ImageExtractor 구현. - parser_version "image-meta-v1" (§9 versioning). - 본문은 Block::ImageRef 1건만 포함 — OCR / caption 필드는 None 으로 남겨 두고 P6-2 / P6-3 에서 채운다. - EXIF 화이트리스트 (§9.1, PII 표면 최소화): Make / Model / Software / DateTimeOriginal / Orientation / GPSLatitude(+Ref) / GPSLongitude(+Ref). MakerNote / Thumbnail / 기타 태그는 폐기. DateTime 은 EXIF "YYYY:MM:DD HH:MM:SS" → ISO-8601 변환. GPS DMS triple + N/S/E/W ref → signed decimal degree. - 차원: image::ImageReader 헤더만 읽어 (w, h, format) 획득. 16k×16k cap 초과 또는 디코드 실패 → metadata.user.dimensions = null + Provenance Warning 이벤트 (Err 아님). 포맷 자체 인식 실패 → anyhow::Error (caller skip). - SourceSpan::Region { 0, 0, w, h } 으로 전체 이미지 영역 표기. 결정성: 동일 bytes + 동일 parser_version → 동일 doc_id + block_id (§4.2 ID recipe 그대로 사용). - metadata.source_type = Reference, trust_level = Primary, lang = "und". title = 확장자 제외 파일명, alt = 파일명. - 의존성 경계 (§8): kebab-core 만 + image 0.25 (default features off, png/jpeg/webp/gif/tiff 만), kamadak-exif 0.6, anyhow / serde / serde_json / time / tracing / thiserror. kebab-source-fs · parse-md · store-* · embed* · llm* · rag · UI crate 미참조. - 테스트 14개 (4 unit + 10 integration): • PNG 차원 추출, JPEG EXIF GPS 추출 (DMS → decimal 변환 정확도 1e-6), EXIF 없는 PNG → 빈 map, 손상 PNG → warning + null dims (panic 없음), 인식 불가 bytes → Err, 결정성, 스냅샷, supports() 매칭, media_type 불일치 거부. • 픽스처는 in-memory 생성 (PNG 는 image crate, EXIF JPEG 는 kamadak Writer 로 EXIF blob 만든 뒤 SOI 직후 APP1 splice) — 바이너리 fixture 커밋 없음. - HEIC / RAW 는 spec 상 v1 out of scope (image crate 미지원, Apple Vision sidecar 가 추후 P+ 에서 채움). - tasks/p6/p6-1-image-extractor-exif.md status: planned → completed. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 Block::ImageRef + ImageRefBlock, §3.7a OcrText / ModelCaption stubs, §9.1 image extraction policy, §9 versioning.	2026-05-02 05:05:47 +00:00
altair823	911fb49550	refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules 프로젝트 이름 `kb` → `kebab` rename 의 첫 단계. - workspace `Cargo.toml`: members `crates/kb-` → `crates/kebab-`, repository URL `altair823/kb` → `altair823/kebab`. - 18 crate 폴더 rename via `git mv` (history 보존). - 각 crate `Cargo.toml`: `name = "kb-"` → `"kebab-"`, path deps `../kb-` → `../kebab-`. - 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`, `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`, `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`, `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`, `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어 경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염). CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths, tracing target, 그리고 docs sweep 은 다음 commit 에서. ## 검증 - `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:28:08 +00:00
altair823	d9a5b88d27	feat(p5-2): kb-eval metrics + compare — AggregateMetrics, CompareReport, kb eval CLI P5-2 구현. 저장된 eval_runs / eval_query_results 위에서: - `kb_eval::metrics`: hit@k / MRR / recall@k_doc / citation_coverage / groundedness / empty_result_rate / refusal_correctness 계산. NaN metrics (분모 0)는 JSON null. 4-decimal round + Deserialize 추가로 aggregate_json 라운드트립. - `kb_eval::compare`: 두 run 비교 → CompareReport (per-metric Δ + per- query Win/Loss/Draw/Regression). chunker_version drift 시 graceful doc-id fallback (chunker_version_match: "fallback_doc"), `strict` 옵션이면 refuse. - `render_report_md`: 인간용 Markdown (집계 + Wins/Losses/Regressions 표). - `SqliteStore::{load_eval_run, load_eval_query_results, update_eval_run_aggregate}` + owned `EvalRunRecord` / `EvalQueryResultRecord` 추가 — write 측 borrow-shape는 그대로. - `kb eval` CLI: `run` (P5-1 위임), `aggregate <id>`, `compare <a> <b> [--strict-chunker-version] [--write-report]`. `--json` 으로 raw CompareReport, 기본은 Markdown 출력. ## Spec deviations (intentional, doc 명시) - Graceful 매칭은 doc-id-only (chunker_version_match: "fallback_doc") — 50% span overlap은 chunker re-index 후 양쪽 chunks 동시 보존이 현실적으로 안 돼서 P6+ 로 deferred. - `*_with_config` 헬퍼 추가: 통합 테스트가 TempDir Config 로 드라이브. no-arg 형태는 Config::load(None) 로 위임. - CLI 는 kb-cli → kb-eval 직접 wire (kb-app cycle 회피). DoD 의 "via kb-app" 의도는 facade 단일화였지만 cycle 발생. - `AggregateMetrics: Deserialize` 추가 — aggregate_json 라운드트립. ## 검증 - `cargo test -p kb-eval` 30/30 (15 unit + 2 loader + 8 metrics+compare 통합 + 7 runner). 8 통합 중 snapshot 1 건 (`compare-1.json`). - `cargo test -p kb-store-sqlite` 33/33. - `cargo clippy --workspace --all-targets -- -D warnings` clean. - forbidden imports 부재 (`kb-source-fs\|kb-parse\|kb-normalize\|kb-chunk\| kb-store-vector\|kb-embed\|kb-search\|kb-llm\|kb-rag\|kb-tui\|kb-desktop\| kb-app` — kb-app 는 metrics/compare 모듈에 부재; runner 만 사용). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:05:13 +00:00
altair823	58a11cc2b8	feat(p5-1): kb-eval crate — golden-fixture runner + eval persistence - new kb-eval crate: load_golden_set (YAML) + run_eval (per-query search/ask + persistence) - new kb-store-sqlite::eval module: record_eval_run_with_results (transactional), document_exists / chunk_exists probes - fixtures/golden_queries.yaml: 5-entry KO+EN template - tests: 13 pass (loader: parse, dup-id, missing chunk_id; runner: elapsed, snapshot, error capture, JSONL, determinism, persistence, config_snapshot) - per_query.jsonl mirror written to runs_dir/<run_id>/ - temperature=0 + fixed seed → byte-identical per_query.jsonl (lexical) deviations from spec (documented in code): - run_id uses uuid::Uuid::now_v7().simple() (timestamp-ordered hex) instead of ULID — uuid already in workspace deps - load_golden_set_validated kept #[cfg(test)] pub(crate) — production inlines validate_against_db - snapshot fixture uses normalized projection (id/query/mode/first_hit) — full byte-determinism covered by separate test - index_version in config_snapshot left null (composed per call by kb-app, not config-level) deferred to follow-up: - App reuse across queries (currently rebuilds App per query) - expand_path hoist to kb-config (3 crate clones now) - --max-queries flag (deferred to P5-2 per updated spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:01:09 +00:00
altair823	e35b06d0d0	feat(p4-3): kb-rag crate — full RAG pipeline + kb-app::ask wired P4 terminal task. Implements the user-facing payoff: retrieve → score gate → pack → render → generate → cite-validate → persist. After this commit, `kb ask` actually works against an Ollama backend; the pipeline grounds the answer in retrieved chunks and refuses cleanly when the gate trips or the model self-judges. New crate kb-rag: - pub struct RagPipeline { retriever, llm, docs, config } — all Arc<dyn Trait + Send + Sync> so the pipeline shares + Sync. - pub fn ask(query, opts) -> Result<Answer> drives the nine-stage flow per spec §1. - pub struct AskOpts { k, explain, mode, temperature, seed, stream_sink: Option<mpsc::Sender<String>> }. k acts as a floor over config.search.default_k so a low-k caller can't starve retrieval (documented in field doc). Pipeline stages: 1. Retrieve via the injected dyn Retriever. 2. Score gate: empty hits → NoChunks refusal (no LLM call); top-1 < config.rag.score_gate → ScoreGate refusal (no LLM call) with top-3 candidates listed in the synthesized answer text. 3. Pack: budget = config.rag.max_context_tokens.saturating_sub (prompt overhead). Per-hit `[#n] doc=… heading=… span=…\n<text>` with deterministic enumeration. If every hit's chunk is unfetchable from the store (deleted between search and pack), fall back to NoChunks refusal with a tracing::warn rather than feeding an empty [근거] to the LLM. 4. Render rag-v1 prompt with the spec's verbatim Korean system string + `[질문]/[근거]` user template. 5. Generate via dyn LanguageModel. Single-thread token loop owns the iterator; tokens optionally forward to opts.stream_sink (a `mpsc::Sender<String>`). SendError silently dropped — caller cancellation never panics the pipeline. After Done the loop reads (acc, finish_reason, usage) in lockstep with no race. max_completion = llm.context_tokens().saturating_sub (used_for_input).max(64) — explicitly NOT capped by config.rag.max_context_tokens (that's the packing budget for [근거], not the LM completion ceiling). 6. Citation extract via STRICT regex `\[#(\d{1,3})\]` (compiled once via OnceLock). Loose forms `[1]`, `[ #1 ]`, `[#foo]`, `[#1234]`, `vec![1]` are all rejected to prevent prose false-positives. 7. Citation validate covers four cases: - unknown marker (e.g. `[#7]` when only 3 packed) → LlmSelfJudge refusal. - empty answer with hits → LlmSelfJudge. - non-empty + no marker + matches `근거 (가\|이) 부족` regex → LlmSelfJudge (model self-refused with the canonical phrase; phrase match logged via tracing::debug for observability). - non-empty + no marker + no refusal phrase → LlmSelfJudge (silent ungrounded answers are still refusals). - non-empty + ≥1 valid marker → grounded = true. 8. Build Answer per kb_core::Answer shape: - citations: filter packed list to exactly the markers cited. Wire format `marker: Some("[1]")` (square-bracketed bare index) per design §2.3, distinct from the prompt-side `[#n]` grammar. - embedding ModelRef: read from config.models.embedding for Vector/Hybrid; None for Lexical. Documented deviation since the Retriever trait doesn't expose the embedder. For ScoreGate/NoChunks refusals on Vector/Hybrid the embedding model is still recorded — the vector retriever WAS consulted even when the gate tripped. - TraceId minted as `ret_<8-hex>` from blake3(query, top_score, model_id, ns). - retrieval AnswerRetrievalSummary populated. - usage from the final Done chunk; latency_ms wall-clock fallback when the LLM reports zero. - created_at OffsetDateTime::now_utc(). 9. Persist via SqliteStore::put_answer (new inherent method on SqliteStore, not on the DocumentStore trait — answers aren't documents and adding to kb-core was forbidden). Always inserts, refusals included. packed_chunks_json is null unless opts.explain == true. kb-store-sqlite extension: - pub fn put_answer(&Answer, query, packed_chunks_json) -> Result<AnswerId>. Maps all 22 fields of the answers table per V001 schema in a single INSERT under a transaction. kb-app::ask wired: - bail!("not yet wired (P4-3)") replaced with a real body that builds the retriever per opts.mode (Lexical \| Vector \| Hybrid), instantiates OllamaLanguageModel from config, constructs RagPipeline, calls pipeline.ask. AskOpts moves to kb-rag and is re-exported via `pub use kb_rag::AskOpts` so kb-cli's `use kb_app::AskOpts` keeps working. - kb-app/Cargo.toml gains kb-rag, kb-llm, kb-llm-local. P3-5's forbids on these are lifted by P4-3 spec — kb-app is the orchestrator and ask requires both the trait crate and the Ollama adapter. - kb-cli/main.rs's AskOpts literal updated with stream_sink: None for the CLI path (TUI in P9 will plumb a real sink). Tests (kb-rag: 18; kb-app: 1 ignored): - 3 unit in src/pipeline.rs: marker regex strictness (rejects all loose forms with byte-equal expectations), Send+Sync compile check, embedding_ref_for behavior across modes. - 15 integration in tests/pipeline.rs covering every spec test row + the new "all chunks unfetchable falls back to NoChunks" guard: empty-hits, score-gate, grounded happy path, unknown-marker, prose-`[1]` rejection, `vec![1]` rejection, refusal-phrase, packing-budget overflow, streaming-forwards-to-mpsc, dropped- receiver-no-panic, usage-from-final-Done, answers-row-inserted- for-each-refusal-kind, determinism temp=0 seed=0, Answer JSON shape, unfetchable-chunks-fall-back-to-no-chunks (the new M3 test). - kb-app/tests/ask_smoke.rs: 1 #[ignore]'d real-Ollama smoke that drives the wired ask end-to-end against `localhost:11434`. Workspace: 319 passed / 26 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-search, kb-llm, kb-store-sqlite, serde, serde_json, regex, time, tracing, thiserror) plus forced waivers anyhow (Retriever / LanguageModel trait return types) and blake3 (TraceId minting). Forbidden (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store- vector direct, kb-embed* direct, kb-llm-local direct, kb-tui, kb-desktop) all absent from `cargo tree -p kb-rag` — concrete adapters reach the pipeline only through trait objects. Out of scope: reranker between retrieve and pack (P+), multi-turn chat memory (P+), LLM-as-judge eval (P5 uses rule-based must_contain), --json streaming (buffers per §0 Q5 hybrid). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:06:10 +00:00
altair823	3e38a9bcb4	feat(p4-2): kb-llm-local crate — Ollama HTTP adapter (reqwest::blocking) First real LanguageModel implementation. Wraps Ollama's local HTTP API at POST {endpoint}/api/generate with stream:true, parses the NDJSON streaming response into TokenChunk events, and maps Ollama error states to a thiserror-derived LlmError with actionable hints. Synchronous trait surface; reqwest::blocking handles the HTTP I/O. Public surface: - pub struct OllamaLanguageModel - pub fn new(config: &Config) -> Result<Self> — lazy connect; never hits the network. Spec line 96. - pub enum LlmError { Unreachable, ModelNotPulled, Timeout, Stream, Malformed }. Lives in this crate per spec — kb-core / kb-llm stay free of error taxonomy. - impl kb_core::LanguageModel via re-export from kb-llm. Streaming: - POST body shape per spec §11.2: model, prompt = system + "\n\n" + user, stream: true, options { temperature, seed, num_ctx, stop }. - OllamaStream owns BufReader<reqwest::blocking::Response>, reads NDJSON lines via read_until(b'\n'), parses each as {response, done, done_reason?, prompt_eval_count?, eval_count?, total_duration?}. Token frame → TokenChunk::Token; done frame → TokenChunk::Done { finish_reason, usage }. - done_reason mapping: "length" → Length, "abort" → Aborted, "stop" / missing / unknown → Stop (forward-compat with future Ollama tags). - Missing prompt_eval_count / eval_count default to 0 + tracing::warn (do NOT fail). Spec line 135. - EOF without a done line synthesizes Done { Aborted, zeros } so downstream pipelines never deadlock waiting for a terminal frame. - UTF-8: line-delimited framing means each JSON line is a complete UTF-8 sequence — no cross-HTTP-chunk codepoint splits to worry about. read_until accumulates whole lines regardless of how the underlying reqwest body chunks. Error mapping (LlmError): - reqwest::Error::is_connect() → Unreachable { endpoint, source } with hint "ensure `ollama serve` is running and reachable at <endpoint>". - reqwest::Error::is_timeout() → Timeout. - 200 with non-NDJSON first line (e.g., transparent-proxy HTML error page) → Stream(truncated body) — distinguished from Malformed by the iterator's has_emitted flag. - 404 with body containing model_id (case-insensitive) OR English "model" + "not found" → ModelNotPulled(model_id) with hint "ollama pull <model_id>". Tightened beyond spec to survive Ollama localizing the error message (Korean / Japanese / etc.) while keeping the original English-substring fallback. - Other 4xx/5xx → Stream(truncated body). - Mid-stream JSON parse failure (after at least one valid line) → Malformed(line). Truncate all error bodies to 512 chars (chars-based, multibyte safe) so an nginx 500 page can't blow up the diagnostic. - Trailing slash in endpoint stripped before formatting the URL — endpoint = "http://x:1234/" produces .../api/generate, not .../api//generate. Pinned by trailing-slash test. Tokio note: reqwest 0.12's blocking feature internally wraps a private current-thread tokio runtime, so cargo tree --edges normal shows tokio. The auditable invariant is "no top-level tokio dep + no async surface exposed to callers" — verified: src/ has zero async/await/tokio::*. default-features = false drops default-tls (rustls only) but does NOT drop tokio. Documented honestly in Cargo.toml + lib.rs. Switching to ureq would remove tokio entirely; deferred since reqwest is the spec's allowed dep. Tests (24 total: 23 default + 1 ignored): - 7 unit in src/ollama.rs: prompt-build, options-build, finish- reason mapping, truncate_body bounds (under_cap / over_cap_marker / multibyte_chars_not_bytes), 404+model-id heuristic. - 3 in tests/construction.rs: ModelRef shape, context_tokens passthrough, lazy-connect proven via port-1 pointing. - 13 in tests/streaming.rs: streamed tokens then Done, multibyte chars within a line round-trip (renamed from "split across chunks" to honestly reflect what's tested), Unreachable-with- hint, 4xx→Stream, 404→ModelNotPulled, concat-equals-canned, done_reason length / abort, missing eval counts default to zero, missing done_reason defaults to Stop, determinism-by-mock, trailing-slash endpoint, non-NDJSON 200 body → Stream not Malformed. - 1 #[ignore] in tests/integration.rs: real Ollama on localhost:11434 with the configured model. Opt-in via cargo test -p kb-llm-local -- --ignored after `ollama serve` + `ollama pull`. Workspace: 288 passed / 25 ignored / 0 failed. cargo clippy --workspace --all-targets -- -D warnings clean. No native-tls, no openssl in the dep graph. Allowed deps respected: kb-core, kb-config, kb-llm, reqwest 0.12 (default-features=false; blocking, json, rustls-tls), serde, serde_json, tracing, thiserror plus anyhow (forced by trait return type). wiremock + tokio in [dev-dependencies] only. Out of scope: llama.cpp / candle adapters (P+), Ollama embed endpoint (separate adapter inside kb-embed-local if requested), cancellation / abort tokens (P+), connection-pool tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:28:34 +00:00
altair823	27c669fbf9	feat(p4-1): kb-llm crate — LanguageModel trait re-export + MockLanguageModel Establishes the kb-llm trait crate so concrete LLM adapters (p4-2 Ollama, future llama.cpp / candle) target a stable surface. Pure re- export of kb_core::{LanguageModel, GenerateRequest, TokenChunk, FinishReason, TokenUsage, ModelRef} plus a feature-gated deterministic mock for downstream RAG tests (p4-3) that need an LLM trait object without an Ollama dependency. MockLanguageModel (cfg(feature = "mock"), default OFF): - Holds canned_response + canned_finish + canned_usage + (model_id, provider, context_tokens). Pure in-memory; no I/O. - generate_stream() honors GenerateRequest.stop: scans every non-empty stop string against the canned response, takes the earliest byte position (Iterator::min returns the first equal element on ties so declaration order in req.stop wins), truncates with a direct byte- slice (str::find returns a UTF-8 char boundary by contract). - When a stop matches, finish_reason is overridden to Stop (matches OpenAI / Ollama real-world behaviour); otherwise the caller's canned_finish passes through verbatim. - Emits one TokenChunk::Token per Unicode scalar value (char), NOT per grapheme cluster — Hangul jamo, emoji ZWJ sequences, combining marks split. Acceptable for trait-shape testing; real adapters MAY combine. Documented in module docs. - Always terminates with TokenChunk::Done { finish_reason, usage } even if the canned response is empty. The returned iterator is a boxed Vec<TokenChunk>::into_iter().map(Ok), trivially Send. - Real adapters MAY return Err from generate_stream itself (e.g. connection refused) before any chunk is yielded; the mock never does. Documented for the trait re-exporter consumer audience. Helpers: - assert_finish_chunk(chunks) — asserts the last chunk is a Done. Useful for proptests asserting trait contract over random inputs. Tests: - cargo test -p kb-llm (no features): 2 reexport / dyn-dispatch tests. - cargo test -p kb-llm --features mock: 9 tests including 100-case proptest over random Unicode strings asserting Done terminator, char-count == streamed Token chunks, concat == canned (truncated by stop), plus explicit cases for stop-string truncation, first-stop- match precedence, model_ref dimensions=None invariant, finish reason pass-through. - All 271 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating verified: - cargo build --release -p kb-llm (default): nm shows zero MockLanguageModel symbols. - cargo build --release -p kb-llm --features mock: three trait-impl symbols present. Spec invariant "release builds MUST NOT include MockLanguageModel" enforced at the symbol level. Allowed deps respected: only kb-core (path) and anyhow (workspace, forced by trait return type). Dropped kb-config / serde / thiserror / tracing from the spec's allowed list — they are listed as Allowed but nothing in this skeleton crate references them, and dropping them keeps the dependency graph slim for downstream consumers. p4-2/p4-3 will add what they need at their own dep sites. Forbidden deps (reqwest, ureq, tokio, whisper-rs, kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-, kb-embed, kb-search, kb-rag, kb-tui, kb-desktop) all absent from cargo tree -p kb-llm. Out of scope: real adapter (p4-2 Ollama), token counting against the real tokenizer, server-side cancellation / abort signals (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:37:46 +00:00
altair823	17d52461b2	feat(p3-5): wire kb-app facade — ingest / search / list / inspect Replaces the P0 `bail!("not yet wired")` stubs in kb-app with real bodies that compose the libraries shipped through P3-4. After this commit, `cargo run -p kb-cli -- index` actually walks the workspace and persists chunks (SQLite + optionally LanceDB), and `cargo run -p kb-cli -- search --mode {lexical,vector,hybrid}` returns real SearchHits with citations. `kb-app::ask` stays stubbed; P4-3 owns it. App lifecycle (crates/kb-app/src/app.rs): - Internal pub(crate) struct App holds the Config plus Arc<SqliteStore> eagerly, with embedder + LanceVectorStore behind OnceLock<Arc<...>> for memoization. First call pays the ~470MB fastembed init / Lance open; subsequent calls return the cached Arc::clone. OnceLock::set race losers fall back to get().cloned() so the lazy-init is concurrent-safe. - One-shot CLI invocations pay the cost once at most. The P9 TUI (which holds an App for the session) gets memoization for free. ingest pipeline (lib.rs): - FsSourceConnector::scan(&scope) → per asset: parse_frontmatter → parse_blocks → build_canonical_document → MdHeadingV1Chunker.chunk → put_asset_with_bytes → put_document → put_blocks → put_chunks. One transaction per document per design §5.8 (kb-store-sqlite's put_* methods own the transactions). - When provider != "none" and dimensions > 0: build embedder once, embed each doc's chunks as Document kind, ensure_table once at the top of the run, then upsert the VectorRecord batch. Lexical-only config (provider == "none") skips both — verified by ingest_provider_none_skips_lance test. - Per-asset parse failures recorded as IngestItemKind::Error with the warning attached; the run continues. Only structural failures (DB unreachable etc.) abort. - Aggregate counts (assets_scanned / new / updated / skipped / errors / chunks_indexed / embeddings_indexed / duration_ms) flow into both the JobRepo progress_json AND a dedicated ingest_runs row written via SqliteStore::record_ingest_run (new pub(crate) helper added to kb-store-sqlite — see below). summary_only=true writes items_json=NULL but still populates the count columns. search dispatch: - SearchMode::Lexical → LexicalRetriever directly. - SearchMode::Vector → VectorRetriever with embedder + LanceVectorStore. - SearchMode::Hybrid → HybridRetriever composing the two. - Vector / Hybrid with provider=none returns a clear error naming the config key to flip ("models.embedding.provider"). list_docs / inspect_doc / inspect_chunk delegate straight to DocumentStore trait methods. Returns Err with actionable message on not-found. Test seam: each public free function has a matching #[doc(hidden)] pub fn _with_config(cfg, ...) companion that integration tests invoke directly (the public form internally calls load_config()). pub(crate) would not reach across the integration- tests crate boundary; #[doc(hidden)] keeps it out of rustdoc and the function comment flags it as test-only. kb-store-sqlite additions: - pub struct IngestRunRow + pub fn record_ingest_run on SqliteStore for the kb-app aggregate-counts persistence path. Helper writes the ingest_runs row directly with all aggregate columns; jobs table still gets a JobRepo create/update_progress/finish trio in parallel. Tests (11 default, 2 #[ignore] AVX-gated): - ingest_lexical: round-trip, idempotent, summary_only_drops_items, provider_none_skips_lance (asserts no .lance dir on disk), records_ingest_runs_row_with_aggregate_counts, tags_any filter, inspect_doc_not_found, inspect_chunk_not_found. - search_lexical: lexical hits with embedding_model=None, empty_query_returns_empty, vector_mode_with_provider_none returns clear error. - search_vector: hybrid mode end-to-end (#[ignore], AVX), Vector mode embedding_model assertion (#[ignore], AVX). Both run on the AVX VM in ~21s combined (first run pays the model download). - TestEnv pins workspace.root + storage.{data_dir,model_dir} to a TempDir so tests don't touch the user's $HOME/.local/share. - Fixture workspace at crates/kb-app/tests/fixtures/workspace/ has three small markdown files with varied frontmatter (rust+cargo+ python tags) so the tags_any filter test exercises a non-trivial predicate. Workspace 269 passed / 24 ignored / 0 failed (was 261/22). cargo clippy --workspace --all-targets -- -D warnings clean. CLI smoke verified manually: `cargo run -p kb-cli -- index` returns a real IngestReport JSON; `cargo run -p kb-cli -- search "..."` returns hits with citations; `cargo run -p kb-cli -- list docs` lists the indexed documents. Allowed deps respected: kb-source-fs, kb-parse-md, kb-parse-types, kb-normalize, kb-chunk, kb-store-sqlite, kb-search, kb-store-vector, kb-embed, kb-embed-local plus existing tracing / anyhow / serde / toml / dirs and now blake3 (run_id) + time. Forbidden (kb-llm, kb-rag, kb-tui, kb-desktop, kb-parse-{pdf,image,audio}) absent from cargo tree -p kb-app. Out of scope per spec: ask body (P4-3), --rebuild-fts wiring, --resume checkpointing (P+), --watch (P+), TUI / desktop integration (P9 consumes this facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:11:21 +00:00
altair823	ccd49ef546	feat(p3-4): hybrid-fusion — VectorRetriever + HybridRetriever (RRF) Composes the existing LexicalRetriever (P2-2) with a new VectorRetriever wrapper around LanceVectorStore (P3-3) into a single Retriever that dispatches by SearchMode. For SearchMode::Hybrid, fuses lexical and vector candidates via Reciprocal Rank Fusion and populates the full RetrievalDetail per SearchHit so kb search --explain can attribute scores back to each side. Public surface (kb-search crate): - pub struct VectorRetriever — Arc<dyn VectorStore + Send + Sync>, Arc<dyn Embedder>, Arc<SqliteStore>, IndexVersion at construction. - pub struct HybridRetriever { lexical, vector, fusion, k }. - pub enum FusionPolicy { Rrf { k_rrf: u32 } }. VectorRetriever: - Embeds query.text as EmbeddingKind::Query before delegating to VectorStore::search(query_vec, query.k * 2, &query.filters). Over- fetches by ×2 for filter losses; LanceVectorStore applies the filters internally so they propagate naturally. - Hydrates each VectorHit into a full SearchHit by joining on chunk_id in a single IN-clause batch (no N+1): doc_path, section_label, chunker_version, source_spans for citation, plus embedding_model from embedder.model_id(). - Snippet trimmed to config.search.snippet_chars (vector mode lacks FTS5 highlighting; chunk text prefix is the next-best signal). - Citation built from the chunk's first source span via the shared citation_helper module — extracted from lexical.rs so both retrievers compute citations identically (Byte/empty fallback to Line{1,1} preserved with tracing::warn). - RetrievalDetail.method = Vector for standalone calls; both fusion_score and vector_score set to the LanceVectorStore-shifted cosine score; lexical_* None. HybridRetriever: - Lexical / Vector modes delegate 1:1 — no rebuild of RetrievalDetail. - Hybrid mode runs both retrievers with k * 2 fanout, fuses with RRF (score(c) = Σ 1/(k_rrf + rank_m(c))), sorts fused-score DESC with deterministic tiebreaker (lex_rank ASC then chunk_id ASC), takes top query.k. Fusion math runs in f64 throughout; cast to f32 only at the SearchHit boundary where bounded magnitude (≤ ~0.033 at k_rrf=60) makes f32 precision sufficient for ranking. - Per-hit lexical preferred for snippet/citation/heading_path/ chunker_version/embedding_model when the chunk appears in both retrievers — FTS5 highlighting is more user-relevant than vector's truncated text. Vector-only chunks fall through to vector hit data. - index_version returns format!("hybrid:{}+{}", lex_iv, vec_iv) at construction; mismatched lex/vec versions trigger a tracing::warn so users notice stale indexes (spec line 143). kb-search additions: - citation_helper.rs — pub(crate) citation_from_first_span shared between lexical and vector retrievers. Extracted from lexical.rs; no behavior drift. Tests (38 default + 3 ignored): - 12 unit tests in hybrid.rs covering RRF math (1/61 + 1/62 within f32 epsilon × 10 tolerance), lexical/vector mode delegation, hybrid preserves single-side hits with the missing side's RetrievalDetail None, deterministic tiebreaker on identical fused scores, composite index_version, mismatched-version warn at construction. - 2 unit tests in vector.rs covering the snippet-prefix and citation fallback paths. - 11 unit tests in lexical.rs (unchanged from P2-2). - 13 lexical integration tests (unchanged). - 3 #[ignore] AVX-gated hybrid integration tests: disjoint-corpus recall (lex returns A,B; vec returns C,D; hybrid returns all 4), determinism over two queries, snapshot stability against tests/fixtures/search/hybrid/run-1.json. Snapshot fixture was regenerated against this branch on an AVX-enabled VM and contains 4 real chunks (c1/c2 lex+vec, c3/c4 vec-only). - KB_UPDATE_SNAPSHOTS=1 path now panics after writing instead of silently passing — matches the P3-2/P3-3 fail-loud-instead-of- silent-pass philosophy. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, kb-store-vector, kb-embed, tracing, thiserror) plus pre-existing kb-search deps from P2-2 (rusqlite, globset, serde_json, anyhow). kb-embed-local does NOT appear — VectorRetriever takes Arc<dyn Embedder> trait object; the concrete adapter is runtime-injected by kb-app. Out of scope: reranker (P+), score calibration across modes (RRF is rank-comparable so absolute calibration is P+), multimodal retrieval (P6+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:22:21 +00:00
altair823	3cd5117a7e	feat(p3-3): kb-store-vector — LanceDB VectorStore + V003 embedding status First VectorStore implementation. Per-model Lance tables under config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance MergeInsert → SQLite-committed) with crash-safe retry, search via cosine distance with the spec's score-shift (preserves negative similarity ranking signal that clamping would crush). V003 migration: - Adds status (CHECK constraint pending\|committed\|tombstone, default pending) and vector_committed columns to embedding_records. - BEFORE DELETE trigger on chunks flips dependent rows to tombstone. Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE runs first then row vanishes via CASCADE. Spec-faithful tombstone preservation requires recreating embedding_records to drop the CASCADE — deferred to a P+ migration since no production rows exist yet (P3-3 is the first writer). V003 SQL comment explains. LanceVectorStore: - ensure_table is idempotent: opens existing or creates with the Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32, dim>, model_id, embedding_version, text, heading_path, created_at). - IndexId computed via id_for_index with collection="chunk_embeddings", index_kind="flat", params_hash = blake3(descriptor JSON). Schema bumps automatically rotate the IndexId. - upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status= 'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed on chunk_id (idempotent re-run); phase-3 UPDATE status='committed', vector_committed=1. If phase-2 fails the rows stay 'pending' and the next upsert call retries idempotently. - search joins embedding_records WHERE status='committed' so partial- write rows never surface. Cosine distance from Lance ∈ [0, 2] → similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈ [0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters via SqliteStore::filter_chunks (added in this commit). - Sync trait + async LanceDB bridged by an embedded current-thread tokio runtime. Doc-comment on the struct flags the "do NOT call from inside another tokio runtime" panic (block_on cannot nest). kb-app's job scheduler is sync today. kb-store-sqlite additions: - pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1 INSERT OR REPLACE (status='pending', vector_committed=0). - pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3 single UPDATE … WHERE embedding_id IN (?, ?, …) via params_from_iter, guarded by WHERE status='pending' so tombstones don't get clobbered. - pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId> consolidates the JOIN against documents/document_tags/ embedding_records + path_glob via globset. Lets kb-store-vector honor SearchFilters without depending on rusqlite or globset directly. (kb-search's filter logic is structurally different — interleaved with the FTS5 SELECT — so it stays as-is for now; consolidation is a P+ refactor.) - 4 new unit tests cover the phase-1 round-trip, empty batch, replay reset of pending rows, and the WHERE-status-pending guard. Tests: - 9 lib unit tests in kb-store-vector covering paths/sanitization, arrow_batch dim validation + descriptor hash, bm25-style cosine score shift math. - 4 new kb-store-sqlite unit tests on filter_chunks (committed-only, tags/lang/trust/path_glob, order preservation, empty input). - 4 new kb-store-sqlite unit tests on the embedding_records helpers. - 8 integration tests in upsert_search.rs and 1 snapshot test marked #[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke require_avx_or_panic() at the top of each body so a missing-AVX --ignored run fails loudly instead of silently passing. This dev host (qemu64 model) lacks AVX so these were NOT exercised end-to- end here — first CI lane on AVX hardware will validate them. - Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder with an _comment marker. Snapshot test panics until the placeholder is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware. - Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace --all-targets -- -D warnings clean. Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb, arrow + arrow-array + arrow-schema, serde, serde_json, tracing, thiserror) plus forced waivers — anyhow (trait return type), tokio + futures (LanceDB async-only API), blake3 (params_hash). rusqlite and globset are NOT direct deps of kb-store-vector — confirmed via cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for the test fixture seeder only. Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app embed_index orchestration (P3-4 facade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:01:31 +00:00
altair823	bcbe2b8531	feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small First real Embedder implementation. Wraps fastembed-rs (ONNX runtime) with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME} template expansion so model files land under config.storage.model_dir/ fastembed/ without polluting kb-config's public API. Public surface: - pub struct FastembedEmbedder - pub fn new(config: &Config) -> Result<Self> - impl kb_core::Embedder (via kb-embed re-export) Behavior: - Default model multilingual-e5-small (384 dims). model_id and model_version come from config.models.embedding.{model,version}. - Pre-load dim check via TextEmbedding::get_model_info: dim mismatch bails before paying the ~470MB ONNX init cost. - e5 prefix applied BEFORE tokenization: "passage: " for EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned by prefix_input unit tests. - Batches inputs into chunks of config.models.embedding.batch_size, concatenates results in input order. - L2 normalization is performed by fastembed 4.9's default transformer pipeline (verified at fastembed/src/text_embedding/output.rs:43); we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so a future fastembed bump that drops this invariant fails loudly. - Synchronous (no async runtime). Mutex serializes calls into the underlying ONNX session — conservative; ORT Session is Send+Sync but callers (kb-app indexer) batch sequentially anyway. Revisit if profiling shows contention. - First-run model download surfaces via tracing::info before/after TextEmbedding::try_new — users no longer stare at a silent 30-60s pause during the 470MB pull. Tests: - 11 default-lane tests covering: check_dim match/mismatch (no model load), prefix_input Document/Query/empty, resolve_model known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive ~ expansion). XDG tests serialize on a Mutex + RAII guard since edition 2024 makes set_var/remove_var unsafe. - 7 #[ignore] integration tests covering: full construction with default config, dim-mismatch belt-and-braces, Document vs Query cosine differential, L2 unit norm, byte-equal determinism, batch-64 performance under 5s, snapshot-hash stability over a 5-sentence multilingual fixture. - Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints the captured hash and panics with paste-back instructions, so first --ignored run forces the maintainer to pin the baseline rather than silently passing. - Workspace: 222 tests pass (default lane); clippy clean. Allowed deps respected: kb-config, kb-embed (re-exports kb-core trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and ort enter transitively through fastembed; reqwest/hyper/hf-hub also transitive (model download is fastembed's responsibility per spec carve-out). No direct kb-core dep needed — re-exports cover it. Pinned to fastembed 4.x rather than the recent 5.x to limit blast radius; consider bump when p3-3 (lancedb-store) consumes the embedder output shape. Out of scope: reranker (P+), Ollama embedding endpoint, candle adapter, image embeddings (P6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:39:38 +00:00
altair823	2e3eb8f437	feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder Establishes the kb-embed trait crate so concrete embedding adapters (p3-2 fastembed, future ollama-embed/candle) target a stable surface. Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic mock for downstream tests. MockEmbedder (cfg(feature = "mock"), default OFF): - Per-component hash recipe: blake3(seed_le8 \|\| kind_byte \|\| text_len_le8 \|\| text \|\| i_le8). Length-prefixed text avoids the domain-separation ambiguity where two (text, i) pairs could shift bytes between text tail and the i field. - Document = 0u8, Query = 1u8 — same text different kind yields different vectors (mirrors e5 prefix behaviour). - Per component: blake3 first 8 bytes → u64 → reinterpret as i64 → f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32 rounds to -1.0; range [-1, 1] holds. - L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision loss) before f32 cast. Zero-norm guard skips the divide. - with_seed(...) constructor lets two embedders share identity but produce different vectors — useful for downstream parametric tests. Helpers: - assert_vector_shape(vecs, dims) — len + finite check. - assert_unit_norm(vecs, tolerance) — caller-supplied tolerance; 5e-4 documented as safe for dims=384 under f32 epsilon × √dims. Tests: - cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests. - cargo test -p kb-embed --features mock: 7 tests including 100-case proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text), Doc(text1) ≠ Doc(text2). - All 220 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating: nm on the release rlib confirms zero MockEmbedder symbols under default features; three trait impl symbols under --features mock. Spec invariant "release builds MUST NOT include MockEmbedder" verified at the symbol level. Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing, plus anyhow (forced by trait return type) and blake3 (justified by the determinism contract; already in workspace lockfile via kb-core). No fastembed/ort/tokenizers anywhere. Out of scope: real adapter (p3-2), reranker traits (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:15:44 +00:00
altair823	b335151d18	feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1) to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with bm25-derived ranking, snippet() previews, and W3C-fragment-style Citation built from the chunk's first source_spans entry. New crate kb-search: - LexicalRetriever::new(Arc<SqliteStore>, IndexVersion). - search() builds an FTS5 MATCH expression by escaping every whitespace token into a quoted literal (inner " doubled); single-quote-wrapped text passes through verbatim as raw FTS5 syntax. Empty query short-circuits to Ok(vec![]). - bm25 normalization: score = -bm25 / (1 + \|bm25\|), bounded (0, 1] for any FTS5-returned negative bm25. - Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet enforces the char cap on the way out (chars per design §6.4 — accepts the combining-mark trade-off). - Citation from chunks.source_spans_json first span: Line / Page / Region / Time forwarded; Byte / empty array fall back to Line{1,1} with a tracing::warn so forward-compat regressions surface. - Filters: tags_any (subquery on document_tags), lang (= column), trust_min (CASE-rank in SQL) all applied at SQL level. path_glob uses globset with literal_separator(true) — guarantees '' does not cross '/' per spec Risks/notes — applied as Rust post-filter with +128 row over-fetch when set, then rank reassigned 1..k contiguously. - Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex tiebreaker on identical bm25). Tested explicitly with two chunks of identical text content. - RetrievalDetail: method=Lexical, both lexical_score and fusion_score set, vector_ None. kb-store-sqlite: - Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>. Read-only contract is doc-only — kb-search needs MutexGuard for prepare_cached + iter, which a closure-scoped wrapper would awkwardly constrain. Closure variant left as a P3 follow-up. Tests (26 new): empty corpus, empty query, single hit + citation round-trip, snippet length cap, tags_any exclusion, lang+trust composition, path_glob with '' not crossing '/', citation line round- trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical- score tiebreaker), index_version passthrough, snapshot (crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace: 211 tests pass, cargo clippy --workspace --all-targets -D warnings clean. Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (forced by trait return type), serde_json (parses _json TEXT columns), globset (path_glob '*' boundary). Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4), reranker (P+), Korean morphological tokenizer (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:20:35 +00:00
altair823	111f40ddf0	p1-6: kb-store-sqlite test suite (8 categories) All 8 test categories from the task plan, plus a JobRepo subset: migration — tests/migration.rs: fresh DB after run_migrations exposes every required §5 table + index. unit (copy) — tests/asset_writer.rs: copy mode writes file with mode 0o644 + correct bytes. unit (ref) — tests/asset_writer.rs: reference mode does not write file; row records source path. unit (cs) — tests/asset_writer.rs: tampered checksum returns a Conflict-flavoured anyhow error. unit (idem) — tests/idempotency.rs: same put_document twice → 1 row, doc_version 1→2; tags re-derived. unit (rb) — tests/idempotency.rs: put_blocks with FK violation rolls back; pre-existing rows unchanged. contract — tests/contract_roundtrip.rs: drives kb-parse-md + kb-normalize + kb-chunk on fixtures/markdown/code-and-table.md, persists, then reloads via DocumentStore::get_document / get_chunk and asserts byte-equal round-trip. snapshot — tests/ingest_report_snapshot.rs + snapshots/ingest_report.snapshot.json: pin the wire JSON form of kb_core::IngestReport for an inline fixture run. jobs — tests/jobs.rs: create → progress → finish flow; error message round-trip; list filters on status/kind. Drops the unused `serde` direct dep from Cargo.toml; serde_json brings its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1` to live only in the dev tree.	2026-04-30 17:13:03 +00:00
altair823	a3390d5171	p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL New workspace member crate `kb-store-sqlite` (allowed deps only: kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json, time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md / kb-normalize / kb-chunk for the contract round-trip test). Migration V001 replaces the P0-1 stub with the full §5 DDL (assets, documents, document_tags, blocks, chunks with policy_hash, embedding_records, jobs, ingest_runs, answers, eval_runs, eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers remain deferred to V002 (P2-1). Public surface per task spec: SqliteStore::open / run_migrations / put_asset_with_bytes impl DocumentStore for SqliteStore (7 trait methods) impl JobRepo for SqliteStore (4 trait methods) StoreError { Sqlx, Migration, Conflict } Behavior: - Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY. - Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else reference. blake3(bytes) verified against asset.checksum; mismatch → Conflict. - Idempotency: put_document UPSERTs and bumps doc_version + 1 on conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags re-derived per put_document. - get_document rehydrates blocks via payload_json ordered by stream ordinal. - list_documents builds dynamic WHERE from DocFilter (lang / trust_min / path_glob via GLOB / tags_any via document_tags subquery). - JobRepo: jobs.kind/status are stored as lowercase enum tags; create mints a 32-hex JobId via blake3(kind \|\| payload \|\| nanos). Tests follow in subsequent commits.	2026-04-30 17:08:36 +00:00
altair823	58f7b8573d	p1-5: add long-section fixture + Vec<Chunk> snapshot test Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s, nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table, and a multi-paragraph Gamma section) into the JSON snapshot baseline. Confirms the priority rules end-to-end: - Heading boundaries hold across H1 → H2 → H1 transitions - The code block emits one chunk at 427 tokens > target=200 - The table stays single-chunk - Gamma's paragraph stream splits with one block of overlap seed A second test runs the full parse → normalize → chunk pipeline 5 times and asserts identical chunk_ids each pass. Drops the unused `kb-config` and `serde` from regular dependencies — they were declared but no source path imports them; `serde` flows in transitively via `kb-core` as a public API requirement, and `ChunkingCfg` lives in `kb-config` but the chunker takes `ChunkPolicy` directly. Production deps are now exactly the allowed set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer, tracing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:33:29 +00:00
altair823	8142449eb7	p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:42 +00:00
altair823	e0df42984e	p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path) I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from build_canonical_document are tracked separately and attributed to "kb-normalize", so the I1 mapping change does not lie about kb-normalize-originated drops. I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with an invalid empty AssetId (would violate AssetId::from_str's 32-hex invariant). Block is dropped, Warning surfaces in Provenance with src mention, attributed to kb-normalize (lift-stage decision). TODO(P8) comment marks this as a placeholder until the audio extractor lands. I3: NFC-normalize each heading_path string in lift_block before feeding into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does not NFC heading text and serde_json_canonicalizer v0.3 does not either, so canonically-equivalent NFD/NFC inputs would produce different block_ids without this normalization. Mirrors the existing doc_id NFC handling via to_posix. Minors: - M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer, blake3 (unused); keep tracing (now wired) + unicode-normalization (now used by I3). - M5: determinism_1000_iterations_under_1s now uses the same 5-block fixture as block_ordinals_scoped_per_heading_and_kind (extracted into fixture_blocks_five helper) so the determinism property is exercised on a real lift_block path, not just an empty Vec. Still < 1s. - M6: snapshot integration test now passes BodyHints { first_h1: Some("Code And Table"), .. } and asserts doc.title == "Code And Table" end-to-end. Baseline JSON updated. - M7: title/lang edge-case unit tests pin policy: empty string lifts to empty string; non-stringy values silently drop. Rustdoc updated. - M10: provenance_contains_stage_events_in_order asserts events[1].at == events[2].at to pin the shared-now_utc invariant. New tests (unit, kb-normalize): - provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1) - audio_ref_block_skipped_with_warning (I2) - nfc_nfd_korean_heading_path_same_block_id (I3) - title_empty_string_in_user_map_falls_back_to_default (M7) - title_non_string_in_user_map_silently_drops (M7) - lang_invalid_shape_silently_drops (M7) kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:41:50 +00:00
altair823	c0096ce44b	p1-4: scaffold kb-normalize crate Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a dev-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical entry point per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:53 +00:00
altair823	4e7e9cad87	p1-3: add parse_blocks (pulldown-cmark walker) submodule Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a small frame-based state machine that tracks heading paths, accumulates inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and reports SourceSpan::Line spans in 1-indexed file-line coordinates. Covers headings, paragraphs, code blocks (lang from info string), GFM tables (with malformed fallback to paragraph + MalformedTable warning), lists (nested sub-lists flattened into parent item), and block-level image references. Inline images are dropped silently per the inline filter. Adversarial inputs are caught with `catch_unwind` and degrade to an empty output + ExtractFailed warning. 15 unit tests cover heading-path correctness, code lang, table parsing, malformed-table fallback (driven via synthetic events since pulldown-cmark auto-normalizes table widths), LF/CRLF line-range parity, image refs, nested-list flattening, inline filter, and 100-iteration random-bytes plus hand-crafted adversarial-input no-panic guards.	2026-04-30 14:14:34 +00:00
altair823	a86b463fc4	p1-2: scaffold kb-parse-md crate Add the workspace member with the dep allow-list pinned by design §0 Q9 and the task spec. P1-2 will land the frontmatter submodule in the next commit; P1-3 will add the block parser as a sibling. Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024 so we use serde_yaml_ng, the maintained fork. lingua's per-language features are explicitly enabled (default-features=false) to keep build time + binary size sane — only the languages we need at parse time.	2026-04-30 12:55:20 +00:00
altair823	7c75e10b2c	p1-1: scaffold kb-source-fs crate (FsSourceConnector) Walk config.workspace.root, apply gitignore-style filters (config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for .DS_Store / ._*), stream BLAKE3 over each file, and emit a deterministic Vec<RawAsset> sorted by workspace_path. Modules: - hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file loads); pinned digests for empty input and "hello world". - media: extension → MediaType (markdown/pdf/image/audio/other). - walker: ignore::OverrideBuilder for filter union; walkdir with manual visited-set cycle protection on top of follow_links. - connector: public FsSourceConnector::new(&Config) + SourceConnector::scan(&SourceScope) impl. Uses kb_core::to_posix for WorkspacePath construction (carries P0-1 # rejection through unchanged) and kb_core::id_for_asset for AssetId derivation. Storage variant signals intent only; actual byte copy is P1-6's responsibility. Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:27:34 +00:00
altair823	d2c8728095	p0-1: address review (drop unused thiserror dep, document kb-core reserve) - Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app (unused — none of those crates' src trees reference thiserror; CoreError in kb-core is the only consumer). - kb-config keeps the `kb-core` dep with a one-line comment marking CoreError reserved for P1-* config-error wiring per the review thread. - ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and satisfies `clippy::manual_is_ascii_check` under `-D warnings`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:55:39 +00:00
altair823	f86df99fe9	p0-1: workspace + kb-core domain types, traits, and ID recipe Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core crate per the frozen design (§3, §4, §7, §10). kb-core has zero deps on other kb-* crates and exposes: - Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId / IndexId) with Display + FromStr that reject anything but 32 lower-hex. - id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2; pinned hex test values computed via an independent JCS+blake3 tool. - CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4). - Citation (5 variants) with W3C Media Fragments to_uri / parse; round-trip property holds for every variant. - Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail (§3.7); DocFilter / DocSummary mirrors of wire §2.5. - Answer / AnswerCitation / RefusalReason / ModelRef (§3.8). - IngestReport, JobRepo support types, VectorRecord / VectorHit. - Component traits (SourceConnector / Extractor / Chunker / Embedder / Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo) plus their input helpers (SourceScope / ExtractContext / ChunkPolicy / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason). - CoreError (§10). - nfc + to_posix helpers (§4.1, §6.6). 20 unit tests cover ID determinism (1000-run regression), key-order invariance, two pinned hex values, newtype rejection of bad input, Citation round-trip for all 5 variants, and to_posix collapsing + Korean NFC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 05:16:37 +00:00

41 Commits