kebab

Author	SHA1	Message	Date
altair823	641b92af7d	fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency Identical-content files at different workspace paths share one assets row (assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made twin files overwrite each other's workspace_path on every ingest, so `get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or None) — break idempotent unchanged-detection for both files. Fix: switch try_skip_unchanged to document-centric lookup. `documents. workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)` includes path, so each twin has its own stable document row. Compare `doc.source_asset_id` with the new asset's checksum instead of going through the assets table. Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of 726 docs marked Updated on every idempotent re-ingest — all 27 are twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md same content, duplicate logo PDFs/JPGs). After: re-ingest reports 0 new / 0 updated / 726 unchanged. No schema migration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:27:21 +00:00
altair823	4e8b84c4e0	fix(dogfood): populate schema.v1.repo_breakdown (Task 9 follow-up) Dogfooding (PR #142 1B + multi-root corpus: kebab-docs + httpx + zod + lodash) revealed schema.v1.repo_breakdown is always {} despite the 1A-2 Task 9 having added the code_lang_breakdown sibling. The schema.rs:171 placeholder `BTreeMap::new()` was left in place. Mirror Task 9's code_lang_breakdown query for the repo field — same metadata_json JSON-path pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:09:19 +00:00
altair823	918ee6c0be	fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood-discovered) p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI --code-lang / --repo flags propagate them correctly into SearchFilters, but neither the lexical retriever's FTS SQL nor the shared filter_chunks helper (used by the vector retriever) ever applied them — so a code-lang-filtered search returned all-doc hits (markdown / pdf / code mixed). Discovered while dogfooding p10-1B with httpx + zod + lodash clones: `kebab search 'AsyncClient' --code-lang python --json` returned markdown hits from httpx/docs/ first. Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang') and '$.repo' to both lexical.rs and filters.rs, mirroring the existing media filter pattern. Two regression tests added in each crate covering the new filter behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 03:27:01 +00:00
altair823	da51e59081	feat(p10-1a-2): populate schema.v1 code_lang_breakdown Add `SqliteStore::code_lang_breakdown()` that queries `json_extract(metadata_json, '$.code_lang')`, groups by it, and skips NULL rows — returning `BTreeMap<String, u32>`. Wire it into `collect_stats` in `kebab-app::schema`, replacing the `BTreeMap::new()` placeholder inserted by 1A-1. Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts rust=1 and that a null-code_lang doc does NOT appear in the map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:41:52 +00:00
th-kim0823	bf4ebf8d2a	feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang Four optional, serde-skipped-when-None fields added to `Metadata` for code ingest context. All 11 downstream construction sites patched with `repo: None, git_branch: None, git_commit: None, code_lang: None`. Full workspace check (`--tests`) and per-crate test suite pass clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:44:18 +09:00
th-kim0823	351c7a0826	feat(p10-1a-1): add IngestReport skip counters + SkipExamples Adds five new u32 counters (skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded) and a SkipExamples struct (≤5 sample paths per category) to IngestReport. All new fields are #[serde(default)] for backward-compat deserialization. Downstream literal construction sites patched with zeros/empty; snapshot re-baked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:28:19 +09:00
th-kim0823	126559ce7a	fix(fb-40): update test fixtures for rag-v2 default	2026-05-10 19:15:15 +09:00
th-kim0823	231d80e82d	feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37) Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold for callers that need real stale counts. Mirrors all new fields onto the wire-bound Stats struct in kebab-app::schema with #[serde(default)] for backwards-compat. Also fixes search_budget_integration.rs for the trace field added to SearchOpts in Task 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 12:34:57 +09:00
th-kim0823	69c6e23432	feat(store): breakdowns + index_bytes helpers (fb-37) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 12:24:43 +09:00
th-kim0823	84287d0ef6	fix(fb-36): address PR #127 round 1 review - ingested_after: convert OffsetDateTime to UTC before formatting so non-Z offsets compare correctly against UTC TEXT storage (lexical.rs + filters.rs) - README: --tag is repeatable-only, not csv (only --media is csv) - test(cli): add multi-value --tag OR-within IN-list coverage - test(store): add UTC-offset regression test for ingested_after - mcp: use ERROR_V1_ID const instead of hardcoded "error.v1" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 04:47:55 +09:00
th-kim0823	c6cc1e2bfe	feat(search/vector): media / ingested_after / doc_id filters (fb-36) filter_chunks helper in kebab-store-sqlite extended with the same 3 WHERE clauses as lexical. Vector still over-fetches k*2 then post-filters via SqliteStore::filter_chunks; small k can return < k hits when filters drop a lot — agent is expected to widen k or paginate. AND combinator with existing filters. - kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after lexicographic >= compare, doc_id equality; mirrors lexical SQL arms - 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id) that run without AVX/Lance - common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search helpers on HybridEnv for integration-test use - hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests (vector_filter_by_media, vector_filter_by_doc_id) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 03:50:56 +09:00
th-kim0823	b86b763dfb	fix(fb-35): address PR #126 round 2 review - wire schema: relax effective_end.minimum 1 → 0 + expand description to cover line-clamp + out-of-range sentinel (panic-fix R1 emits Some(0) when line_start=1 and range is beyond doc end — schema must accept it) - tests: tighten first-chunk-target boundary test to assert ≤ 2 total neighbors (3-chunk doc, N=2). Strict "first chunk → context_before empty" not assertable until chunks.ordinal column lands (R1 #9 architectural caveat) - store: trim contradiction in list_chunk_ids_for_doc warning comment — drop "good enough for sequentially chunked markdown" phrase that conflicts with "hash sort dominates" paragraph above Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:55:29 +09:00
th-kim0823	7dddc1d706	fix(fb-35): address PR #126 round 1 review - fetch_span: panic-fix on line_start > total / empty doc (return empty text + effective_end = line_start - 1 instead of out-of-bounds slice) - truncated: reserved for budget-driven truncation only; line range clamp signaled via effective_end < line_end - spec / SKILL.md / README: align rejection wording to "PDF / audio" (matches code; Image OCR allowed for span) - store: warning comment on list_chunk_ids_for_doc — chunk_id hash sort does NOT preserve document position; real fix is a chunks.ordinal column, tracked as follow-up - surrounding_chunks: saturating_add to defend against u32::MAX context arg on 32-bit targets - tests: line_start > total returns empty + chunk context at doc boundary clamps lower bound Deferred nits (follow-up): table-separator strict CommonMark form; MCP per-mode strict validation; CLI chunk_id truncation in plain output. None block correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:45:29 +09:00
th-kim0823	610d29f053	feat(app): App::fetch chunk mode + markdown serializer (fb-35) Chunk mode + +-N context. doc / span modes return placeholder errors (filled by subsequent tasks). fmt_canonical_to_markdown helper introduced now since doc mode (Task 4) consumes it. Errors are typed StructuredError so classify preserves chunk_not_found / doc_not_found through the wire layer. Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive +-N neighbors without leaking direct rusqlite usage into kebab-app. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 23:44:51 +09:00
th-kim0823	61aae1c1d5	🏗️ refactor(kebab-app): consolidate PARSER_VERSION + clarify intent (fb-27) Replace kebab-app's private `KEBAB_PARSE_MD_VERSION` literal with a direct reference to `kebab_parse_md::PARSER_VERSION` so the parser version cascade has a single source of truth (design §9 invariant). Add maintenance comment on schema.rs WIRE_SCHEMAS const pointing to docs/wire-schema/v1/ + kebab-cli wire helpers as the authoritative sources to keep in sync. Tighten open_existing doc comment to match the actual SQLITE_OPEN_READ_WRITE flag (needed for WAL pragma application) — callers should still avoid issuing mutations through this connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:58:06 +09:00
th-kim0823	39b4433549	feat(kebab-app): schema_with_config facade (fb-27) New `SchemaV1` struct + `schema_with_config(&Config)` builder. Surfaces wire schemas list, capabilities (current + future placeholders), model versions (parser/chunker/embedding/prompt_template/index/corpus_revision), and stats (doc/chunk/asset counts + last ingest). kebab-store-sqlite gains `count_summary()` to back the stats block. Deviations from plan: - `cfg.models.embedding.id` → `cfg.models.embedding.model` (actual field name) - No `Config::expand_path` method → free fn `kebab_config::expand_path(&cfg.storage.data_dir, "")` - `PARSER_VERSION` added to `kebab-parse-md/src/lib.rs` (was absent; synced with `KEBAB_PARSE_MD_VERSION` literal in kebab-app) - `INDEX_VERSION_STR` added to `kebab-store-vector/src/store.rs` + re-exported from `lib.rs` (was a private `const`) - `corpus_revision()` returns `u64` directly (not `Result<u64>`) — no `?` in collect_models - `SchemaV1` carries `schema_version: "schema.v1"` field (wire schema convention) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:46:37 +09:00
th-kim0823	1c4d554bf4	🏗️ refactor(kebab-store-sqlite): harden open_existing against silent create (fb-27) Replace `path.exists()` + `Connection::open` (which silently CREATEs on race) with `Connection::open_with_flags` using READ_WRITE\|URI but NOT CREATE. SQLite surfaces `SQLITE_CANTOPEN` for missing files; we wrap as NotIndexed { found: None } as before. Adds open_existing_does_not_create_missing_db regression test pinning the no-side-effect invariant. Also documents read-only intent on open_existing, the format contract on NotIndexed.found, and removes scaffolding comments from kebab-app error_signal that are no longer load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:40:42 +09:00
th-kim0823	d7bfd01ef5	feat(kebab-store-sqlite): add NotIndexed typed error (fb-27) New `SqliteStore::open_existing` API + `NotIndexed` signal for the missing-DB case. kebab-app re-exports the type via its `error_signal` module so kebab-cli's `error_classify` can map it to `error.v1 { code: "not_indexed" }`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:32:04 +09:00
altair823	693f5582f0	feat(kebab-core, kebab-app): p9-fb-25 task 4 — IngestReport.skipped_by_extension + wire schema additive Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:06:34 +00:00
altair823	8d0744c22b	review(p9-fb-23): 회차 1 nit 반영 — named columns + safe byte_len + trait check + count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:33:28 +00:00
altair823	366e89e5e2	feat(kebab-store-sqlite): p9-fb-23 task 4 — get_asset_by_workspace_path Add `DocumentStore::get_asset_by_workspace_path` trait method to `kebab-core` and implement it on `SqliteStore` via a private `asset_from_row` helper. Used by the incremental-ingest skip path to compare a freshly-computed blake3 checksum against the persisted row without a full round-trip through `put_asset_with_bytes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:58:23 +00:00
altair823	4261c8953c	feat(kebab-store-sqlite): p9-fb-23 task 3 — V006 migration + put/get_document round-trip version stamps Add V006__incremental_ingest.sql to persist last_chunker_version and last_embedding_version on the documents table. Wire both columns into upsert_document (INSERT + ON CONFLICT UPDATE) and get_document (SELECT + row mapper), replacing the previous hardcoded None. Add two round-trip tests in tests/incremental_ingest.rs covering the set and None cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:53:30 +00:00
altair823	f867b36afb	feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:50:25 +00:00
altair823	0684b3ad66	review(p9-fb-23-task1): fix missed IngestReport construction sites + snapshot reviewer-flagged: `aa2a6ea` claimed build clean but missed: - crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs (test fixture) - crates/kebab-cli/src/wire.rs (test fixture) - crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json (snapshot) All three add `unchanged: 0` (or `\"unchanged\": 0`) to match the new IngestReport.unchanged field. cargo clippy --workspace --all-targets -- -D warnings now clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:47:13 +00:00
altair823	952ed1615f	review(p9-fb-17): 회차 1 nit 반영 - `append_turn` 의 doc 은 "wrap in one transaction" 보장하지만 실제 코드는 auto-commit `conn.execute` 두 번이라 두번째 실패 시 first row 가 commit 된 채 inconsistent 됨. 진짜 transaction 으로 교체: `conn.transaction()` → `tx.execute(insert)` → `tx.execute(update)` → `tx.commit()`. SQLite BEGIN 으로 감싸 두 statement atomic. `lock_conn()` 이 `MutexGuard<Connection>` 반환하므로 `let mut conn = self.lock_conn(); let tx = conn.transaction()` 패턴 가능 (MutexGuard 의 DerefMut 활용). 9 chat_sessions 테스트 + clippy 통과. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:40:40 +00:00
altair823	c97e8e00ef	feat(kebab-core + kebab-store-sqlite): p9-fb-17 chat session storage (V005) 도그푸딩 item 13/14 (multi-turn 영속화) — TUI Ask 의 "이전 대화 이어가기" + 향후 CLI `--session foo` (p9-fb-18) backing store. session header + per-turn 두 테이블, ON DELETE CASCADE 로 reset --data-only 가 한꺼번에 wipe. ## 핵심 변경 - SQLite V005 migration `chat_sessions` (session_id PK + created_at + updated_at + title + config_snapshot_json) + `chat_turns` (turn_id PK + session_id FK ON DELETE CASCADE + turn_index + question + answer + citations_json + created_at + UNIQUE(session_id, turn_index)) + `idx_chat_turns_session(session_id, turn_index)`. 모두 `STRICT`. - `kebab_core::ChatSessionRepo` trait (6 method): create_session / get_session / list_sessions(limit, ORDER BY updated_at DESC) / delete_session / append_turn / list_turns(ORDER BY turn_index ASC) - `kebab_core::{ChatSessionRow, ChatTurnRow}` structs — Serialize + Deserialize 둘 다 (CLI / wire 출력 호환) - `kebab-store-sqlite::SqliteStore` impl 신규 모듈 `chat_sessions.rs`. `append_turn` 이 insert + parent updated_at bump 같은 connection 에서 처리. - frozen design §5 에 §5.7a chat_sessions / chat_turns 절 신설 (full schema + trait 메서드 6 개 명시). ## HOTFIXES (V004 → V005) spec p9-fb-17 의 `V004__chat_sessions.sql` 가 p9-fb-19 의 `V004__kv.sql` (이미 머지) 와 refinery migration number 충돌. 무중단 정정: `V005__chat_sessions.sql` 로 시프트. schema / 동작 동일, 파일명 만 이동. HOTFIXES entry 추가. ## 테스트 - 9 신규 integration unit (create/get roundtrip, missing→None, PK collision error, append+list ordered, dup turn_index error, append bumps updated_at, delete CASCADE turns, list_sessions ORDER BY updated_at DESC, list_sessions LIMIT) - workspace 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0 - `cargo clippy --workspace --all-targets -- -D warnings` clean ## 문서 - frozen design §5.7a 신설 - HANDOFF: 2026-05-03 entry - HOTFIXES: V004 → V005 rename rationale - spec status planned → in_progress ## Out of scope - session 검색 / 필터 UI (p9-fb-18 의 `kebab ask --session list` 같은 admin command 가 후속) - 다른 store backend (postgres 등) — trait 만 정의, impl 은 SQLite unblocks p9-fb-18 (CLI session/repl). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:37:53 +00:00
altair823	0e408fb1b5	feat(kebab-app + kebab-store-sqlite): p9-fb-19 search LRU cache + corpus_revision 도그푸딩 item 15 — TUI / 같은 process 안에서 동일 query 반복 시 SQLite FTS + Lance + RRF 재계산이 매번 발생하던 비용 해소. in-process LRU 캐시 + 모노토닉 corpus_revision 카운터로 ingest commit 발생 시 모든 entry 자동 stale. ## 핵심 변경 - SQLite V004 migration: `kv (key TEXT PRIMARY KEY, value TEXT) STRICT` + `corpus_revision = '0'` seed. 미래의 다른 scalar 도 같은 테이블에 들어갈 수 있는 generic shape. - `SqliteStore::corpus_revision()` / `bump_corpus_revision()` — `UPDATE ... CAST AS INTEGER + 1` atomic. INSERT-OR-IGNORE 도 함께 실행 (V004 seed 가 무슨 이유로 누락된 케이스 paranoid). - `kebab-app::ingest_with_config_cancellable` — `new + updated > 0` 시 bump, no-op (skipped-only) reingest 는 cache 보존. - `App.search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec< SearchHit>>>>` — `config.search.cache_capacity` (default 256, 0 비활성). `lru = "0.12"` workspace dep 추가. - `SearchCacheKey` = `query_norm` (NFKC + trim + lowercase) + `mode` + `k` + `snippet_chars` + `embedding_version` (vector/hybrid 만, lexical 은 빈 문자열) + `chunker_version` + `corpus_revision` snapshot. - `App::search` rewrite — cache 활성 시 lookup → miss 면 기존 `search_uncached` 호출 후 put. cache 비활성이거나 lock 실패면 straight-line. - `App::search_uncached` (rename of pre-fb-19 `search` body) + `search_uncached_with_config` facade — CLI `kebab search --no-cache` 로 진입. - `Config.search.cache_capacity: usize` field, `#[serde(default)]` 로 기존 config 호환. - CLI `--no-cache` flag — 디버깅용 (CLI 는 매 호출이 새 process 라 사실상 no-op 이지만 spec 명시 + 향후 long-lived process 호환). - frozen design §9 versioning 표에 `corpus_revision` row 추가 (기존 `index_version` 라벨과 다른 차원: 라벨은 retrieval 형상, corpus_revision 은 ingest commit ack). ## 테스트 - `kebab-store-sqlite` 신규 3 unit (fresh=0, monotonic bump, persist across reopen) - `kebab-app` 신규 4 integration (cached repeat 같은 hits, NFKC 정규화 로 case/whitespace collapse, --no-cache parity, first ingest bumps corpus_revision) - 워크스페이스 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0 - `cargo clippy --workspace --all-targets -- -D warnings` clean ## 문서 - README `kebab search` 행: 캐시 동작 + `--no-cache` 안내 + corpus_ revision 무효화 메커니즘 - docs/SMOKE.md `[search]` 절에 `cache_capacity` 라인 추가 - HANDOFF: 2026-05-03 entry - spec status planned → in_progress ## Out of scope - patch-and-merge incremental (RRF 정규화 전체 hit set 기준이라 어려움) - SQLite 영속 cache (P+) - 다른 process 간 cache 공유 (in-process 만 — corpus_revision 이 cross-process 무효화는 O(1)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:01:31 +00:00
altair823	7a49c8a29b	feat(kebab-normalize): p9-fb-07 markdown title fallback chain `kebab-normalize::derive_title(frontmatter_title, blocks, file_stem)` 가 다음 단계로 비어있지 않은 첫 결과를 사용: 1. frontmatter `title` (trim 후) 2. 첫 H1 텍스트 3. 첫 H2 텍스트 4. 첫 Paragraph (Quote / List / Code / Table / ImageRef 제외) 의 첫 80 자 5. 파일 stem (확장자 제외) 6. (sentinel) `"untitled"` — 위 다섯 단계가 모두 blank 인 병적 케이스 선택된 문자열은 NFC 정규화. 빈 문자열은 절대 반환하지 않음. `build_canonical_document` 가 metadata lift 직후 helper 호출. 기존 단순 lift 로직 (metadata.user["title"] → CanonicalDocument.title) 은 fallback chain 의 1 단계 입력으로 자리 이동. `KEBAB_PARSE_MD_VERSION` 상수를 `pulldown-cmark-0.x` → `md-frontmatter-v2` 로 bump. parser_version 변경 → §4.2 doc_id 입력 변화 → 기존 markdown doc 의 `doc_id` 갱신, 다음 ingest 시 idempotent upsert 로 자동 재처리 (design §9 cascade). `kebab-store-sqlite` 의 snapshot fixture 도 같은 literal 로 갱신. 기존 M7 정책 ("metadata.user[\"title\"] = '' 가 빈 title 로 lift") 은 폐기. 빈 문자열 입력은 fallback chain 을 타고 file stem 까지 떨어진다. spec p9-fb-07 line 37: "빈 문자열 반환 금지". 테스트 (kebab-normalize): - 8 개 단위 테스트 (각 fallback 단계 + NFC + sentinel) - `build_canonical_document` 통합 테스트 2 개 (H1 / file stem) - 기존 M7 테스트 2 개를 새 정책에 맞춰 갱신 문서: - README: `kebab ingest` 행에 "title 자동 채움" 안내 + 기존 doc 도 다음 ingest 에서 갱신 - HANDOFF: 2026-05-03 머지 후 발견 entry - spec status: `planned` → `in_progress` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:22:34 +00:00
altair823	2c058ab175	feat(rag): multi-turn ask — Turn struct + ask_with_history + token budget (p9-fb-15) Spec PR #59 의 §3.8 multi-turn behaviour 구현. RAG facade 가 prior turns 받아 prompt 에 prepend, retrieval query expansion 적용, Answer 에 conversation_id / turn_index 채움. 신규 (kebab-core): - Answer 에 conversation_id (Option<String>) / turn_index (Option<u32>) field 추가. serde skip_serializing_if 로 single-shot 의 wire output 변경 0 (기존 외부 wrapper 영향 없음). - Turn struct (question + answer + citations + created_at). - RefusalReason::LlmStreamAborted variant. 신규 (kebab-rag): - AskOpts 에 history (Vec<Turn>) / conversation_id / turn_index 3 field. - AskOpts::single_shot(mode) helper. - RagPipeline::ask_with_history(query, history, conversation_id, turn_index, opts) — combined opts 로 ask 호출. - expand_query_with_history: history.last() 의 answer 첫 200 자 concat 해 SearchQuery.text 확장 (spec §3.8 의 \"cheap concat\"; LLM-based standalone-question rewriting 은 P+). - serialize_history + remaining_history_budget_chars: spec 의 priority enforcement — system+question 필수, retrieved chunks 가 차지한 뒤 남은 char budget 안에서 newest 우선, oldest drop. - ask 본문: history 가 비어있지 않으면 [이전 대화] 블록을 user prompt 위에 prepend. Answer 생성 site 3 곳 (정상 / NoChunks / ScoreGate refuse) 모두 conversation_id / turn_index 채움. 신규 (kebab-store-sqlite): - refusal_reason_label 가 LlmStreamAborted → 'llm_stream_aborted'. 기존 caller 변경 (single-shot 동작 동일): - kebab-cli main.rs Cmd::Ask: AskOpts 에 history=Vec::new(), conversation_id=None, turn_index=None 명시 (CLI multi-turn 은 p9-fb-18 의 --session/--repl 가 채움). - kebab-tui src/ask.rs spawn site 동일 (multi-turn UI 는 p9-fb-16). - kebab-eval runner.rs golden eval 동일 (single-shot per query). - kebab-app tests/ask_smoke.rs / kebab-tui tests/ask.rs / kebab-rag tests/pipeline.rs / kebab-eval metrics.rs Answer literal 갱신. Test: - 9 신규 lib unit (expand_query 4 / serialize_history 3 / remaining_budget 2). - 기존 12 PASS 회귀 0. Plan 갱신: - p9-fb-15 status planned → in_progress. 머지 후 한 줄 commit 으로 completed flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:09:46 +00:00
altair823	565caebec6	review(회차1): 회차 1 critical + nit 반영 - (critical) embeddings.rs: truncate_embedding_records 위치 이동. mark_embedding_records_committed 함수 위에 끼워 넣었더니 위쪽 mark_committed 의 14 줄짜리 doc comment (`WHERE status='pending'` 의 design rationale 등) 가 truncate 의 doc 으로 흡수되고 mark_committed 자체는 doc 없이 남는 버그. impl block 끝 (mark_committed 의 닫는 } 다음) 으로 옮겨 plan 의 원래 의도와도 일치. - (nit) tests/reset_cli.rs: removed_paths 의 idempotency 검증 보강. data dir 은 reported, cache dir 은 omit (생성 안 했으니) 되어야 함을 strict 하게 assert. state dir 은 logging init 의 side-effect 로 자동 생성되어 둘 다 가능하므로 허용. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:43:25 +00:00
altair823	cf65afaef0	feat(store-sqlite): add truncate_embedding_records helper Wipes every row from embedding_records and returns the deleted row count. Used by the upcoming `kebab reset --vector-only` to keep SQLite consistent after the on-disk Lance store is removed. Plan deviation from the original spec (task 1): - Original test plan opened SqliteStore with a raw path; the actual signature is `SqliteStore::open(&Config)`, so the integration test builds a Config with `storage.data_dir` pointed at a tempdir. - Original return type was Result<()>; bumped to Result<u64> so the caller (kebab-app::reset) can surface the truncated count in the reset_report.v1 wire payload without a separate COUNT query. p9-fb-06 task 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:08:22 +00:00
altair823	0c8821f857	fix(kebab-store-vector): close P7-3 vector orphan caveat — delete_by_chunk_ids P7-3 의 storage UNIQUE bug fix 가 SQLite 측 (documents → blocks / chunks / embedding_records) 만 sweep 했음. LanceDB 의 vector 는 별도 store 라 옛 chunk_id 를 가진 row 가 디스크에 잔존. 검색에는 영향 없지만 디스크는 무한 누적. HOTFIXES `2026-05-02 P7-3` caveat 의 "P+ task" 약속을 같은 후속 PR 안에서 닫음. 변경: - `VectorStore::delete_by_chunk_ids(&[ChunkId])` trait method 추가 (default no-op 제공 — 테스트 fake / 기존 impl 이 그대로 컴파일). - `LanceVectorStore::delete_by_chunk_ids` 가 connection 의 모든 `chunk_embeddings_` 테이블을 순회 + `Table::delete("chunk_id IN (...)")` 를 batch=200 단위로 실행. 다중 모델 워크스페이스 (마이그레이션 중간 등) 에서도 안전. - `SqliteStore::stale_chunk_ids_at(workspace_path, new_asset_id)` 가 read-only SELECT 로 옛 chunk_id 들 반환. CASCADE 가 흐르기 전* 에 caller 가 호출. - `kebab-app::purge_vector_orphans_for_workspace_path` 가 위 두 단계를 orchestrate. 세 ingest path (markdown / image / pdf) 의 `put_asset_with_bytes` 호출 직전에 한 줄로 호출. Smoke 검증 (release binary, fastembed enabled): - whitepaper.pdf 첫 ingest → chunk_ids = {f616…, 4e0f…}, vector store 에 그 두 ID 의 row 존재. - byte 변경 후 re-ingest → 새 doc_id (3741…) + 새 chunk_ids (ed0c…, e13c…). vector search "REWRITTEN chapter two" → 새 chunk_ids 만 hit. 옛 query "Edited page two body" 시도해도 옛 chunk_ids 는 vector store 에 더 이상 없음 (의미적으로 가장 가까운 새 chunks 가 hit). HOTFIXES `2026-05-02 P7-3` 의 \"vector store cleanup\" 항목이 \"deferred\" → \"closed by follow-up PR\" 로 갱신. SMOKE.md 의 알려진 동작 (\"옛 vector 잔존\") 도 \"두 store 정합\" 으로 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:32:29 +00:00
altair823	3a57cab1eb	fix(kebab-store-sqlite): purge stale assets row on workspace_path orphan + smoke P7-3 통합 테스트가 노출한 storage 레이어 버그 fix. `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산 re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE 충돌. md / image / pdf 모두 영향. Fix: - 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path` 의 다른 `asset_id` 를 발견하면 documents → assets 순서로 sweep. documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks / embedding_records 정리. copied 모드면 storage_path 의 byte 파일도 best-effort 삭제. - `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore ::put_asset` 모두 호출. - 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전 의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로 대체). - `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 → 9 통합 테스트 모두 default 로 통과. Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게 운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가 SQLite join 통해 surface). Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2): - doctor pass - 첫 ingest: 5 new - list docs: 5 docs all media types - search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit - search hybrid → cross-media 결과 - inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page - 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency) - byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix) - corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0 - 정리 후 다시 ingest: errors=0 - RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 정확히 노출 운영 fixture 보조: - `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로 reportlab/qpdf 없이 in-tree PDF 생성. - `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG fixture 생성. - SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시 errors+=1 → new+=1) 반영. HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로 업데이트, vector store orphan caveat 만 남음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:41:23 +00:00
altair823	f1a448d6dc	refactor(rename): kb → kebab — binary, env vars, XDG paths, file renames 두 번째 commit. 사용자 facing surface (CLI binary, env vars, XDG paths) + 코드 안 single-letter token (`KB_`, `kb.sqlite`, `/kb/`, tracing target) 일괄 rename. 그리고 3 개 file rename: - 디자인 doc `2026-04-27-kb-final-form-design.md` → `2026-04-27-kebab-final-form-design.md` - 최초 보고서 `kb_local_rust_report.md` → `kebab_local_rust_report.md` - workspace ignore `.kbignore` → `.kebabignore` ## 변경 - `crates/kebab-cli/Cargo.toml`: `[[bin]] name = "kb"` → `"kebab"`. - `crates/kebab-cli/src/main.rs`: `#[command(name = "kb", …)]` → `name = "kebab"`. - 모든 `KB_` env var (코드 + doc + 테스트) → `KEBAB_`. apply_env prefix 매칭 + 30+ 개 setting 키 모두. - XDG paths: `~/.config/kb` / `~/.local/share/kb` / `~/.cache/kb` / `~/.local/state/kb` → `~/.config/kebab` 등. config defaults + expand_path tests + paths.rs 의 hardcode 모두. - SQLite filename: `kb.sqlite` → `kebab.sqlite` (`SQLITE_FILE` const + 테스트 hardcode 모두). - tracing target: `target: "kb-"` → `"kebab-"` (10+ 곳). - snapshot fixture: `.kbignore` → `.kebabignore` (`fixtures/source-fs/ tree-1.snapshot.json` 갱신). ## 검증 - `cargo test --workspace -j 1` clean (linker OOM 회피 위해 직렬). - `cargo clippy --workspace --all-targets -- -D warnings` clean. 다음 commit 에서 docs sweep. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:01:35 +00:00
altair823	911fb49550	refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules 프로젝트 이름 `kb` → `kebab` rename 의 첫 단계. - workspace `Cargo.toml`: members `crates/kb-` → `crates/kebab-`, repository URL `altair823/kb` → `altair823/kebab`. - 18 crate 폴더 rename via `git mv` (history 보존). - 각 crate `Cargo.toml`: `name = "kb-"` → `"kebab-"`, path deps `../kb-` → `../kebab-`. - 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`, `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`, `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`, `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`, `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어 경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염). CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths, tracing target, 그리고 docs sweep 은 다음 commit 에서. ## 검증 - `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:28:08 +00:00

35 Commits