From ddec3d8a94eea4cdb5eaf3238780cc98c4ff6942 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:27:06 +0000 Subject: [PATCH 01/15] =?UTF-8?q?spec(p9-fb-23):=20incremental=20ingest=20?= =?UTF-8?q?=E2=80=94=20skip=20unchanged=20docs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 도그푸딩 피드백: 변경/신규 doc 만 ingest, 변하지 않은 문서는 skip. 설계 핵심: - Skip 조건 4 개 (full version cascade): blake3 checksum + parser_version + chunker_version + embedding_version 모두 일치 시 parse/chunk/embed/ vector upsert 회피. 비용 dominator (fastembed) 가 변경된 / 새 doc 에만. - SQLite V006 migration — `documents` 에 `last_chunker_version` + `last_embedding_version` column 추가. 기존 row NULL → 첫 ingest 강제 재처리 (안전 default). - `IngestItemKind::Unchanged` enum variant 신규 (기존 `Skipped` 와 의미 분리 — `Skipped` 는 media-type 필터, `Unchanged` 는 모든 versions match). - `IngestReport` + `AggregateCounts` 에 `unchanged: u32` 필드 추가. wire schema additive — v1 호환 유지. - `--force-reingest` flag — skip 무시하고 강제 재처리. - TUI status_line final 에 `unchanged=N` 노출 (p9-fb-24 status bar dynamic slot 자동 cascade). Spec status `planned`. 다음 단계: writing-plans skill 로 implementation plan 작성. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...5-04-p9-fb-23-incremental-ingest-design.md | 173 ++++++++++++++++++ 1 file changed, 173 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md diff --git a/docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md b/docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md new file mode 100644 index 0000000..2e6b49b --- /dev/null +++ b/docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md @@ -0,0 +1,173 @@ +# p9-fb-23 — Incremental ingest (skip unchanged docs) + +**Date**: 2026-05-04 +**Status**: planned +**Audience**: kebab-app / kebab-store-sqlite implementer / reviewer. +**Source feedback**: 사용자 도그푸딩 2026-05-04 — "새 문서들이 폴더에 추가되면 ingest 시 변하지 않은 문서는 다시 ingest 하지 않고 변하거나 새로 추가된 문서만 처리하고 싶어." + +## Goal + +`kebab ingest` 가 변경되지 않은 (그리고 모든 version cascade input 도 동일한) document 의 parse / chunk / embed / vector upsert 를 스킵. 비용 dominator (fastembed embedding 호출) 가 변경된 / 새 file 에만 발생. + +## Non-goals + +- Mtime 기반 pre-hash skip (파일 읽기 자체를 회피). YAGNI — blake3 streaming 은 이미 scan 에서 무조건 발생, 본 spec 은 parse/chunk/embed 만 회피해도 90%+ 비용 절감. +- Watch-mode (실시간 file change detection). 후속 task. +- 부분 변경 (single chunk re-embedding). 항상 doc 단위 all-or-nothing. + +## Allowed dependencies + +- 기존 crate 만. 신규 crate 없음. +- SQLite migration 추가 (V006). + +## Scope + +본 spec 은 *file-system 소스* (`kebab-source-fs`) + 메인 ingest 파이프라인 (`kebab-app::ingest_with_config*`) 에만 적용. 다른 source connector (현재 없음, 후속 phase) 도 같은 skip 계약을 따름 — `IngestReport.unchanged` 카운트는 connector 무관. + +## Skip 조건 + +문서가 다음 4개 모두 만족할 때 `Unchanged` 로 분류: + +1. `assets.checksum` (저장된 blake3) == 신규 blake3 (스캔 중 재계산). +2. `documents.parser_version` == 현재 active parser_version. +3. `documents.last_chunker_version` == 현재 active chunker_version. +4. `documents.last_embedding_version` == 현재 active embedding_version (또는 양쪽 모두 NULL — embedder 미설정). + +위 4개 중 하나라도 다르면 정상 ingest path. parse / chunk / embed / vector upsert 모두 발생. + +## Storage 변경 + +**Migration V006** (`crates/kebab-store-sqlite/migrations/V006__incremental_ingest.sql`): + +`documents` 테이블에 두 column 추가: + +```sql +ALTER TABLE documents ADD COLUMN last_chunker_version TEXT; +ALTER TABLE documents ADD COLUMN last_embedding_version TEXT; +``` + +기존 row 는 NULL — 첫 ingest 시 항상 mismatch → 강제 재처리 (안전 default). 이후 매 ingest 가 row 의 두 column 을 현 active version 으로 stamp. + +`parser_version` 은 이미 `documents` 테이블에 존재 (v005 이전). 활용. + +V006 migration 은 idempotent (`ALTER TABLE` + `ADD COLUMN` 이 두 번 실행돼도 sqlite 가 column-exists 체크). Refinery framework 가 single-shot 보장. + +## Pipeline 흐름 + +`kebab-app::ingest_with_config_progress_cancellable` (현 메인 ingest fn) 의 asset 루프 안에서: + +1. Source connector 가 file scan + blake3 streaming → `asset_blake3` 생성 (현재와 동일). +2. **신규 early-skip 체크**: + - `store.get_asset_by_workspace_path(path)` 로 기존 asset row 조회. + - 존재 + `existing.checksum == new asset_blake3` → asset 동일. + - `store.get_document_by_doc_id(id_for_doc(path, asset_id, current_parser_version))` 로 기존 doc 조회. + - 존재 + `existing.last_chunker_version == current_chunker_version` + `existing.last_embedding_version == current_embedding_version` → **skip**. + - `IngestReport.unchanged += 1`. + - `IngestEvent::Item { kind: Unchanged, .. }` emit (progress consumer 가 표시). + - 다음 asset 로 continue. +3. Skip 미충족 → 정상 path: `put_asset_with_bytes` → parse → `put_document` → chunk → `put_chunks` → embed → `vec_store.upsert`. +4. 정상 path 끝에서 `documents.last_chunker_version` + `documents.last_embedding_version` 을 현 active version 으로 stamp (`put_document` 가 받는 `Document` struct 에 두 field 추가, refinery 마이그레이션 자동 column 채움). + +## API 변경 + +### `kebab-core::Document` struct + +필드 두 개 추가: + +```rust +pub struct Document { + // ... existing ... + pub last_chunker_version: Option, + pub last_embedding_version: Option, +} +``` + +`Option` — embedder 미설정 (config.models.embedding.enabled = false) 시 `last_embedding_version = None`. + +### `kebab-core::IngestReport` + `kebab-app::AggregateCounts` + +`unchanged: u32` 필드 추가. wire schema 변경: + +`docs/wire-schema/v1/ingest_report.schema.json` 에 `unchanged` (integer, minimum 0) 필드 추가. **additive — v1 호환 유지** (기존 client 가 모르는 필드 무시). v2 bump 불필요. + +`AggregateCounts::default()` 가 `unchanged: 0` 자동 처리. + +### `kebab-core::IngestItemKind` + +```rust +pub enum IngestItemKind { + New, + Updated, + Skipped, // 기존: media-type 필터 / kb:// URI + Unchanged, // 신규: skip 조건 4개 모두 만족 + Error, +} +``` + +`Skipped` (media-type 필터) 와 `Unchanged` (모든 versions match) 의미적 분리. `IngestEvent::Item.kind` 도 같이 확장. + +### `kebab-store-sqlite` 신규 메서드 + +```rust +fn get_asset_by_workspace_path(&self, path: &WorkspacePath) -> Result>; +fn get_document_by_doc_id(&self, doc_id: &DocumentId) -> Result>; +``` + +기존 `put_*` / `purge_*` 메서드는 변경 없음. 새 read 경로만 추가. + +## TUI 노출 + +`kebab-tui::ingest_progress::status_line` 의 final line 포맷에 `unchanged` 추가: + +``` +✓ ingest: 100 docs (5 new, 3 updated, 92 unchanged, 0 skipped), 142 chunks indexed in 12s +``` + +진행 중 (in-flight) status 는 그대로 (per-asset granularity 이므로 unchanged 별 카운트 불필요). + +p9-fb-24 의 status bar dynamic slot 도 같은 텍스트 표시 (cascade 의 `indexing N/M` final line). + +## CLI 노출 + +`kebab ingest` 의 `--json` 모드는 wire schema 의 `unchanged` 필드 자동 출력. human 모드 final line 은 위 status_line 과 동일 포맷. + +`--force-reingest` flag 신규 추가 — skip 조건 무시하고 모든 doc 강제 재처리. 사용자가 "이상한 결과 → 일단 모두 재처리" 케이스 대응. CLI 의 `kebab_app::AskOpts` 같은 패턴으로 `IngestOpts.force_reingest: bool` 추가, 기본 false. + +## Tests + +### 신규 단위 + +- V006 migration smoke (sqlite store): apply → `documents` 에 두 컬럼 존재 + NULL default. +- `get_asset_by_workspace_path` / `get_document_by_doc_id` 단위 (kebab-store-sqlite). +- `id_for_doc` 변경 없음 (parser_version 만 input — 그대로). + +### 신규 통합 (kebab-app) + +- **Unchanged path**: 한 번 ingest → 두 번째 ingest 시 `IngestReport.unchanged == 1`, embed 호출 0회. +- **Checksum mismatch**: 첫 ingest 후 파일 수정 → 두 번째 ingest 가 `updated == 1`. +- **Parser version bump**: 첫 ingest 후 `KEBAB_PARSE_MD_VERSION` 상수 변경 simulate → 두 번째 ingest 가 `updated == 1` (doc_id 변경됨). +- **Chunker version bump**: 첫 ingest 후 chunker_version 변경 simulate → `updated == 1`. +- **Embedder version bump**: 첫 ingest 후 embedder_version 변경 simulate → `updated == 1`. +- **`--force-reingest`**: 두 번째 ingest 가 skip 조건 만족하지만 강제로 `updated == 1` (또는 별도 카테고리?). + +### 기존 영향 + +- 기존 ingest 통합 테스트 (kebab-app/tests/) 는 빈 KB 에서 시작하므로 모두 첫 번째 ingest path → `unchanged` 가 0 인 채로 그대로 통과. +- `IngestReport` JSON 출력 테스트가 `unchanged` 필드 추가됐을 때 호환되는지 검증. additive 라 통과해야 함. + +## Spec contract impact + +- **Design §9 versioning cascade**: 명시적 동작 추가. parser/chunker/embedder version bump 시 다음 ingest 가 자동으로 모든 doc 을 `updated` 로 처리. 기존엔 silently 새 version 으로 overwrite (idempotent UPSERT) 였으나 본 spec 으로 explicit refresh 보장. +- **Design §3.x IngestReport**: `unchanged` 필드 추가 (additive). v1 wire schema bump 없음. +- **Design §2.4a IngestEvent**: `IngestItemKind::Unchanged` variant 추가. line-delimited JSON consumer 는 unknown variant 무시 (현 default behavior). + +## Risks / notes + +- **Stale skip risk**: 사용자가 외부 도구 (Ollama 모델 swap 등) 로 embedder 바꾸고도 config 의 `models.embedding.id` 갱신 안 하면 `last_embedding_version` 매치 → silently skip. 완화: model_id 도 stamp 에 포함? 또는 doctor 명령이 mismatch 감지 → 권고. 본 spec 은 `embedding_version` (model 명+버전 fingerprint) 만 신뢰 — model 자체 무결성은 별 영역. +- **Force-reingest UX**: `--force-reingest` 는 모든 doc 재처리. 큰 corpus 에서 비싸므로 confirm prompt? 일단 flag 만 — 사용자가 명시적으로 입력하니 confirmation 불필요. +- **V006 migration 호환**: refinery 가 down-migration 미지원 (one-way). 이전 commit 으로 rollback 시 column 그대로 남음 (sqlite ALTER 의 한계). 무해 — 미사용 column. +- **doc_version 와의 관계**: 기존 `doc_version` (ingest 마다 +1) 는 그대로. Unchanged path 에서는 `doc_version` bump 안 함 — "이번 ingest 에서 처리 안 됨" 의미 보존. + +## Live deviations + +추후 발견되는 deviation 은 `tasks/HOTFIXES.md` `2026-05-04 — p9-fb-23` 항목에 dated 로그로 추가. spec 자체는 frozen. From 6a8d155da91d708429d357b18766988df96f6e03 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:34:30 +0000 Subject: [PATCH 02/15] =?UTF-8?q?plan(p9-fb-23):=20TDD=20implementation=20?= =?UTF-8?q?plan=20=E2=80=94=2010=20tasks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Spec → 10-step plan, TDD per task (failing test → impl → pass → commit). Tasks: 1. IngestItemKind::Unchanged + IngestReport.unchanged + AggregateCounts.unchanged + wire schema additive 2. CanonicalDocument 에 last_chunker_version + last_embedding_version Option 필드 추가 + 14 callers None 채움 3. V006 migration + SQLite put/get_document round-trip 신규 컬럼 4. DocumentStore::get_asset_by_workspace_path trait + SQLite impl 5. ingest pipeline 이 CanonicalDocument 에 현 chunker/embedding version stamp (no skip yet) 6. IngestOpts { progress, cancel, force_reingest } struct + ingest_with_config_opts entry (AskOpts 패턴) 7. asset 루프 early-skip 블록 (4 조건 match → Unchanged + continue) 8. CLI --force-reingest flag 9. TUI status_line 에 unchanged=N 노출 10. docs sync — README + HANDOFF + HOTFIXES + INDEX + per-task spec Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-04-p9-fb-23-incremental-ingest.md | 1138 +++++++++++++++++ 1 file changed, 1138 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md diff --git a/docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md b/docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md new file mode 100644 index 0000000..621f3ec --- /dev/null +++ b/docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md @@ -0,0 +1,1138 @@ +# p9-fb-23 — Incremental ingest Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Skip parse / chunk / embed / vector upsert for documents whose blake3 checksum AND parser/chunker/embedding versions all match what's already in SQLite, so re-running `kebab ingest` only does work for new or changed files. + +**Architecture:** Add two columns to `documents` (V006 migration) tracking the chunker + embedding versions used last ingest. Add a new `DocumentStore::get_asset_by_workspace_path` read path. Insert an early-skip block at the top of the per-asset processing loop in `kebab-app::ingest_with_config_*` that returns `IngestItemKind::Unchanged` when all four conditions match. New `IngestOpts.force_reingest` flag bypasses the skip. `IngestReport` + `AggregateCounts` gain an `unchanged` count, surfaced in the wire schema, CLI summary, and TUI status bar. + +**Tech Stack:** Rust 2024, refinery (SQLite migrations), serde, anyhow. No new deps. + +**Spec:** `docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md` + +--- + +## File Structure + +**Created:** +- `migrations/V006__incremental_ingest.sql` — `ALTER TABLE documents` adds `last_chunker_version` + `last_embedding_version` TEXT (nullable). + +**Modified:** +- `crates/kebab-core/src/ingest.rs` — `IngestItemKind::Unchanged` variant, `IngestReport.unchanged: u32`. +- `crates/kebab-core/src/document.rs` — `CanonicalDocument` gains `last_chunker_version: Option` + `last_embedding_version: Option`. 14 construction sites add the two `None` fields. +- `crates/kebab-core/src/traits.rs` — new `DocumentStore::get_asset_by_workspace_path(&WorkspacePath) -> Option` method. +- `crates/kebab-store-sqlite/src/documents.rs` — `put_document` writes new columns, `get_document` reads them. +- `crates/kebab-store-sqlite/src/store.rs` (or `assets.rs` — verify location during impl) — `get_asset_by_workspace_path` impl. +- `crates/kebab-app/src/lib.rs` — `IngestOpts { progress, cancel, force_reingest }` struct, ingest fn chain refactor, early-skip logic in asset loop, stamp versions on CanonicalDocument. +- `crates/kebab-app/src/ingest_progress.rs` — `AggregateCounts.unchanged: u32`, status_line text update. +- `crates/kebab-cli/src/commands/ingest.rs` (or similar) — `--force-reingest` flag plumbed. +- `crates/kebab-tui/src/ingest_progress.rs` — status_line final text gains `unchanged=N`. +- `docs/wire-schema/v1/ingest_report.schema.json` — additive `unchanged` (integer, minimum 0). +- `README.md` / `HANDOFF.md` / `tasks/HOTFIXES.md` / `tasks/INDEX.md` / `tasks/p9/p9-fb-23-incremental-ingest.md` — docs sync. + +--- + +### Task 1: Extend ingest reporting types — `Unchanged` variant + counts + +**Files:** +- Modify: `crates/kebab-core/src/ingest.rs` +- Modify: `crates/kebab-app/src/ingest_progress.rs` +- Modify: `docs/wire-schema/v1/ingest_report.schema.json` + +The reporting types are foundational — every downstream task reads / writes them. Land first as a no-op (no callers produce `Unchanged` yet, no counter increments) so subsequent tasks can target the new shapes. + +- [ ] **Step 1: Add `Unchanged` to `IngestItemKind`** + +Open `crates/kebab-core/src/ingest.rs`. Replace the enum (around lines 38-45): + +```rust +#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "lowercase")] +pub enum IngestItemKind { + New, + Updated, + /// Media-type filter / kb:// URI / non-supported source — never made + /// it into the parse step. + Skipped, + /// p9-fb-23: blake3 checksum + parser_version + chunker_version + + /// embedding_version all matched the existing record. Parse / chunk + /// / embed / vector upsert all skipped. + Unchanged, + Error, +} +``` + +- [ ] **Step 2: Add `unchanged` to `IngestReport`** + +In the same file, replace `IngestReport` (around lines 10-21): + +```rust +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +pub struct IngestReport { + pub scope: SourceScope, + pub scanned: u32, + pub new: u32, + pub updated: u32, + /// Media-type / source filter (`kb://`, unsupported types). + pub skipped: u32, + /// p9-fb-23: assets whose checksum + all version inputs matched — + /// parse / chunk / embed / vector upsert all skipped. + pub unchanged: u32, + pub errors: u32, + pub duration_ms: u32, + /// `None` ↔ wire `items: null` (`--summary-only`). + pub items: Option>, +} +``` + +- [ ] **Step 3: Add `unchanged` to `AggregateCounts`** + +Open `crates/kebab-app/src/ingest_progress.rs`. Replace the struct (around lines 25-34): + +```rust +#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Serialize, Deserialize)] +pub struct AggregateCounts { + pub scanned: u32, + pub new: u32, + pub updated: u32, + pub skipped: u32, + /// p9-fb-23: assets whose checksum + all version inputs matched the + /// existing DB record — parse / chunk / embed / vector upsert all + /// skipped. + pub unchanged: u32, + pub errors: u32, + pub chunks_indexed: u32, + pub embeddings_indexed: u32, +} +``` + +`#[derive(Default)]` automatically zero-fills the new field. + +- [ ] **Step 4: Update wire schema** + +Open `docs/wire-schema/v1/ingest_report.schema.json`. Find the `properties` block. Add `unchanged` next to `skipped`: + +```json + "unchanged": { + "type": "integer", + "minimum": 0, + "description": "p9-fb-23: assets whose checksum + parser_version + chunker_version + embedding_version all matched the existing record. Parse / chunk / embed / vector upsert all skipped." + }, +``` + +If the schema has a `required` array that lists `skipped`, also add `unchanged` to that array — `IngestReport` always carries the field (defaulted to 0). + +- [ ] **Step 5: Build + fix any compile errors at construction sites** + +Run: `cargo build --workspace` +Expected: compile errors at every site that constructs `IngestReport { ... }` literally without the new `unchanged` field. + +For each error reported by the compiler, add `unchanged: 0,` next to `skipped: ...,` at the construction site. The compiler list is exhaustive — work through it linearly. + +- [ ] **Step 6: Per-crate test smoke** + +Run: `cargo test -p kebab-core --lib` +Run: `cargo test -p kebab-app --lib` +Expected: all pass. Existing tests should round-trip unchanged through serde with no behavioral change (default 0). + +- [ ] **Step 7: Commit** + +```bash +git add crates/kebab-core/src/ingest.rs crates/kebab-app/src/ingest_progress.rs docs/wire-schema/v1/ingest_report.schema.json +# Add any other files the compile-fix step touched. +git add -u +git commit -m "feat(kebab-core): p9-fb-23 task 1 — IngestItemKind::Unchanged + IngestReport.unchanged + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 2: Extend `CanonicalDocument` with version stamps + +**Files:** +- Modify: `crates/kebab-core/src/document.rs` +- Modify: 14 construction sites across the workspace. + +- [ ] **Step 1: Add the two fields to `CanonicalDocument`** + +Open `crates/kebab-core/src/document.rs`. Find the `CanonicalDocument` struct (around lines 12-25). Add the imports for `ChunkerVersion` + `EmbeddingVersion`: + +```rust +use crate::versions::{ChunkerVersion, EmbeddingVersion, ParserVersion}; +``` + +Replace the struct: + +```rust +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +pub struct CanonicalDocument { + pub doc_id: DocumentId, + pub source_asset_id: AssetId, + pub workspace_path: WorkspacePath, + pub title: String, + pub lang: Lang, + pub blocks: Vec, + pub metadata: Metadata, + pub provenance: Provenance, + pub parser_version: ParserVersion, + pub schema_version: u32, + pub doc_version: u32, + /// p9-fb-23: chunker version active when this document was last + /// chunked. `None` for rows ingested before V006 migration; the + /// next ingest stamps the current version. Compared against the + /// active chunker version for the incremental-ingest skip path. + pub last_chunker_version: Option, + /// p9-fb-23: embedding model version active when this document + /// was last embedded. `None` if no embedder is configured (skip + /// path treats `None == None` as a match — see design doc). + pub last_embedding_version: Option, +} +``` + +- [ ] **Step 2: Build + fix all 14 construction sites** + +Run: `cargo build --workspace` +Expected: compile errors at every `CanonicalDocument { ... }` literal in the workspace. Confirm the count by running: + +```bash +grep -rn "CanonicalDocument {" crates/ | wc -l +``` + +For each error, add `last_chunker_version: None, last_embedding_version: None,` at the end of the struct literal. Sites (verify each — the list may have shifted since the plan was written): + +- `crates/kebab-parse-image/src/lib.rs` (around line 203) +- `crates/kebab-parse-pdf/src/lib.rs` (around line 207) +- `crates/kebab-normalize/src/lib.rs` (around line 160) +- `crates/kebab-tui/tests/inspect.rs` (around line 66) +- `crates/kebab-chunk/src/md_heading_v1.rs` test fixture (around line 459) +- `crates/kebab-chunk/src/pdf_page_v1.rs` test fixture (around line 300) +- `crates/kebab-store-sqlite/tests/list_docs.rs` (around line 58) +- ... plus any others the compiler surfaces. + +These are mechanical `None, None` insertions — no logic changes. + +- [ ] **Step 3: Build clean** + +Run: `cargo build --workspace` +Expected: clean. + +- [ ] **Step 4: Test smoke** + +Run: `cargo test -p kebab-core -p kebab-normalize -p kebab-parse-image -p kebab-parse-pdf -p kebab-chunk --lib` +Expected: all pass. + +- [ ] **Step 5: Commit** + +```bash +git add -u +git commit -m "feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 3: V006 migration + SQLite put/get_document round-trip + +**Files:** +- Create: `migrations/V006__incremental_ingest.sql` +- Modify: `crates/kebab-store-sqlite/src/documents.rs` + +- [ ] **Step 1: Write the V006 migration** + +Create `migrations/V006__incremental_ingest.sql`: + +```sql +-- p9-fb-23: incremental ingest needs to know which chunker / embedding +-- versions were used to populate this document so a re-ingest can +-- decide whether to skip (versions match) or re-process (any mismatch). +-- parser_version is already on documents from V001. +ALTER TABLE documents ADD COLUMN last_chunker_version TEXT; +ALTER TABLE documents ADD COLUMN last_embedding_version TEXT; +``` + +Both columns default to NULL — refinery applies the migration in order, existing rows get NULL for the new columns. + +- [ ] **Step 2: Find the existing `put_document` SQLite impl + extend SQL** + +Open `crates/kebab-store-sqlite/src/documents.rs`. Find `put_document` (around line 51). Read the surrounding code to understand the INSERT / UPSERT statement shape — the new columns must be added to both the column list and the values list, AND to the `ON CONFLICT ... DO UPDATE SET` clause if present. + +A representative shape after the change (adapt to the actual SQL in the file): + +```rust +const INSERT_DOC_SQL: &str = r#" + INSERT INTO documents ( + doc_id, asset_id, workspace_path, title, lang, + source_type, trust_level, parser_version, + doc_version, schema_version, + metadata_json, provenance_json, + created_at, updated_at, + last_chunker_version, last_embedding_version + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + ON CONFLICT(doc_id) DO UPDATE SET + asset_id = excluded.asset_id, + workspace_path = excluded.workspace_path, + title = excluded.title, + lang = excluded.lang, + source_type = excluded.source_type, + trust_level = excluded.trust_level, + parser_version = excluded.parser_version, + doc_version = excluded.doc_version + 1, + schema_version = excluded.schema_version, + metadata_json = excluded.metadata_json, + provenance_json = excluded.provenance_json, + updated_at = excluded.updated_at, + last_chunker_version = excluded.last_chunker_version, + last_embedding_version = excluded.last_embedding_version +"#; +``` + +In the binding step (`stmt.execute(...)` or similar), append two more positional params at the end: + +```rust +doc.last_chunker_version.as_ref().map(|v| v.0.as_str()), +doc.last_embedding_version.as_ref().map(|v| v.0.as_str()), +``` + +(The exact rusqlite API depends on whether the file uses `params!` or `[...]`. Match the existing style in the file.) + +- [ ] **Step 3: Extend `get_document` SQLite impl** + +In the same file, find `get_document` (around line 157). The SELECT statement and row-mapper closure must include the two new columns: + +```rust +const SELECT_DOC_SQL: &str = r#" + SELECT + doc_id, asset_id, workspace_path, title, lang, + source_type, trust_level, parser_version, + doc_version, schema_version, + metadata_json, provenance_json, + created_at, updated_at, + last_chunker_version, last_embedding_version + FROM documents + WHERE doc_id = ? +"#; +``` + +Row-mapper closure (adapt to the file's pattern): + +```rust +let last_chunker_version: Option = row.get("last_chunker_version")?; +let last_embedding_version: Option = row.get("last_embedding_version")?; +let last_chunker_version = last_chunker_version.map(kebab_core::ChunkerVersion); +let last_embedding_version = last_embedding_version.map(kebab_core::EmbeddingVersion); + +// In the CanonicalDocument literal: +last_chunker_version, +last_embedding_version, +``` + +- [ ] **Step 4: Write a round-trip integration test** + +Open `crates/kebab-store-sqlite/tests/list_docs.rs` (or create a new test file `tests/incremental_ingest.rs` if the existing test suite has a different focus). Append: + +```rust +#[test] +fn put_then_get_document_roundtrips_version_stamps() { + let store = build_test_store(); // existing helper — adapt to actual name + let mut doc = sample_doc(); // existing helper + doc.last_chunker_version = Some(kebab_core::ChunkerVersion("md-heading-v1".into())); + doc.last_embedding_version = Some(kebab_core::EmbeddingVersion("multilingual-e5-small@v1".into())); + store.put_document(&doc).unwrap(); + let loaded = store.get_document(&doc.doc_id).unwrap().expect("doc round-trips"); + assert_eq!(loaded.last_chunker_version, doc.last_chunker_version); + assert_eq!(loaded.last_embedding_version, doc.last_embedding_version); +} + +#[test] +fn put_then_get_document_roundtrips_none_stamps() { + let store = build_test_store(); + let doc = sample_doc(); // sample_doc default: both None + store.put_document(&doc).unwrap(); + let loaded = store.get_document(&doc.doc_id).unwrap().expect("doc round-trips"); + assert_eq!(loaded.last_chunker_version, None); + assert_eq!(loaded.last_embedding_version, None); +} +``` + +If `build_test_store` / `sample_doc` are not the actual helper names, look at the top of `list_docs.rs` and use whatever pattern that file uses to construct an isolated SQLite store + a sample CanonicalDocument. + +- [ ] **Step 5: Run the migration + tests** + +Run: `cargo test -p kebab-store-sqlite` +Expected: all pass, including the two new round-trip tests. + +- [ ] **Step 6: Commit** + +```bash +git add migrations/V006__incremental_ingest.sql crates/kebab-store-sqlite/src/documents.rs crates/kebab-store-sqlite/tests/list_docs.rs +git commit -m "feat(kebab-store-sqlite): p9-fb-23 task 3 — V006 migration + put/get_document round-trip version stamps + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 4: New `DocumentStore::get_asset_by_workspace_path` + +**Files:** +- Modify: `crates/kebab-core/src/traits.rs` +- Modify: `crates/kebab-store-sqlite/src/store.rs` (or `assets.rs` — find via grep `fn put_asset_with_bytes`) + +- [ ] **Step 1: Add the trait method** + +Open `crates/kebab-core/src/traits.rs`. Find the `pub trait DocumentStore` block (around line 151). Add a new method directly after `get_document`: + +```rust + /// p9-fb-23: look up an asset row by its workspace path. Used by + /// the incremental-ingest skip path to compare the freshly + /// computed blake3 checksum against what's already in SQLite. The + /// schema enforces a unique workspace_path per asset. + fn get_asset_by_workspace_path( + &self, + path: &WorkspacePath, + ) -> anyhow::Result>; +``` + +If `WorkspacePath` / `RawAsset` are not yet imported at the top of `traits.rs`, add them: the existing imports already pull the types used by other trait methods — extend that block. + +- [ ] **Step 2: Implement on SQLite store** + +Find the file that contains `put_asset_with_bytes` (likely `crates/kebab-store-sqlite/src/store.rs` or `crates/kebab-store-sqlite/src/assets.rs`). Locate it via: + +```bash +grep -rn "fn put_asset_with_bytes" crates/kebab-store-sqlite/ +``` + +Add the new method to the same `impl DocumentStore for SqliteStore` (or `impl SqliteStore`) block: + +```rust + fn get_asset_by_workspace_path( + &self, + path: &kebab_core::WorkspacePath, + ) -> anyhow::Result> { + let conn = self.conn.lock(); // or whatever the file's connection pattern is + let row = conn.query_row( + r#"SELECT + asset_id, source_uri, workspace_path, media_type, + byte_len, checksum, storage_kind, storage_path, + discovered_at + FROM assets + WHERE workspace_path = ?"#, + rusqlite::params![path.0.as_str()], + |row| { + // Build RawAsset from columns. Reuse the existing + // row-mapper helper if `put_asset_with_bytes` already + // has one (avoid duplicating the parse logic). + Ok(/* RawAsset { ... } literal */) + }, + ); + match row { + Ok(asset) => Ok(Some(asset)), + Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None), + Err(e) => Err(e.into()), + } + } +``` + +The exact body depends on this crate's helpers — if there's already an `asset_from_row` private fn, reuse it. If not, add one and call it from BOTH `get_asset_by_workspace_path` AND any future getter (DRY — don't open-code the column→RawAsset mapping in two places). + +- [ ] **Step 3: Add a test** + +Append to the existing test suite that covers asset writes (find via `grep -rn "put_asset_with_bytes" crates/kebab-store-sqlite/tests/`): + +```rust +#[test] +fn get_asset_by_workspace_path_roundtrips() { + let store = build_test_store(); + let (asset, bytes) = sample_asset_with_bytes("notes/foo.md", b"hello"); + store.put_asset_with_bytes(&asset, &bytes).unwrap(); + let loaded = store + .get_asset_by_workspace_path(&asset.workspace_path) + .unwrap() + .expect("asset must round-trip"); + assert_eq!(loaded.asset_id, asset.asset_id); + assert_eq!(loaded.checksum, asset.checksum); + assert_eq!(loaded.byte_len, asset.byte_len); +} + +#[test] +fn get_asset_by_workspace_path_returns_none_for_unknown() { + let store = build_test_store(); + let path = kebab_core::WorkspacePath::new("notes/missing.md".into()).unwrap(); + assert!(store.get_asset_by_workspace_path(&path).unwrap().is_none()); +} +``` + +If `sample_asset_with_bytes` doesn't exist in the test fixture, build a `RawAsset` by hand using the existing pattern from other asset tests in the same crate. + +- [ ] **Step 4: Run + commit** + +```bash +cargo test -p kebab-store-sqlite +git add crates/kebab-core/src/traits.rs crates/kebab-store-sqlite/src/ # whichever file you modified +git add -u # for the test file +git commit -m "feat(kebab-store-sqlite): p9-fb-23 task 4 — get_asset_by_workspace_path + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 5: Stamp current versions on `CanonicalDocument` in ingest + +**Files:** +- Modify: `crates/kebab-app/src/lib.rs` (or wherever `CanonicalDocument` is constructed in the ingest pipeline — search via `grep -n "CanonicalDocument" crates/kebab-app/src/lib.rs`) + +This task does NOT add the skip path yet. It just makes every freshly-ingested doc carry the current chunker + embedding versions, so Task 7's skip detection has data to work with on the SECOND ingest run. + +- [ ] **Step 1: Find the CanonicalDocument construction sites in ingest** + +Run: + +```bash +grep -n "CanonicalDocument" crates/kebab-app/src/lib.rs +``` + +Markdown / image / PDF flows likely each construct one. Note: most actual `CanonicalDocument` literals are inside parser crates (kebab-parse-md, kebab-parse-image, kebab-parse-pdf) — the parsers don't know about the chunker / embedder version. The right insertion point is in `kebab-app::ingest_with_config_*` AFTER parse + AFTER chunk, where the chunker that ran is in scope and the embedder reference is at hand. + +The pattern: parsers return a `CanonicalDocument` with `last_chunker_version: None, last_embedding_version: None` (Task 2's mechanical change). The ingest pipeline then mutates the doc to stamp the versions before `put_document`. + +- [ ] **Step 2: Stamp logic in the ingest pipeline** + +Find each `put_document(&canonical)` call site in `crates/kebab-app/src/lib.rs`. Just before each call, mutate the doc: + +```rust +canonical.last_chunker_version = Some(chunker.chunker_version()); // or whatever variable holds the chunker +if let Some(emb) = embedder.as_ref() { + canonical.last_embedding_version = Some(emb.model_version()); +} +// else leave as None — embedder is not configured, so the skip path +// will treat None == None as a match (no stale state to compare). +``` + +The variable names depend on the existing scope at each call site — read 30 lines above to see what's bound. The chunker is constructed as `let chunker = MdHeadingV1Chunker::new(...)` (or `PdfPageV1Chunker`); follow that variable. + +`canonical` must be `let mut canonical = ...` — if it's currently bound immutably, change to `let mut`. + +Apply the same pattern to all three flows (markdown, image, pdf — verify via grep). + +- [ ] **Step 3: Build + run existing ingest tests** + +Run: `cargo test -p kebab-app --test '*'` +Expected: all pass. Existing tests don't assert on the new fields but the pipeline must still write a valid document. + +- [ ] **Step 4: Add a test that asserts the stamps land in the DB** + +Find a kebab-app integration test that does an end-to-end ingest (e.g. `crates/kebab-app/tests/ingest_smoke.rs` or similar). Append: + +```rust +#[test] +fn ingest_stamps_chunker_and_embedding_versions_on_document() { + let (config, _tmp) = test_config_with_md_fixture(); + let report = kebab_app::ingest_with_config( + config.clone(), + SourceScope::workspace_root(&config), // or whatever helper exists + false, // summary_only + ).unwrap(); + assert_eq!(report.new, 1); + + let app = kebab_app::App::open_with_config(config).unwrap(); + let doc_id = app.store.list_documents(&Default::default()).unwrap()[0].doc_id.clone(); + let doc = app.store.get_document(&doc_id).unwrap().expect("doc exists"); + assert!(doc.last_chunker_version.is_some(), "chunker version stamped"); + // Embedding version is Some only if the test config enables an embedder. + // If `test_config_with_md_fixture` sets up fastembed, assert Some; otherwise None. +} +``` + +Adapt to the actual test infrastructure in the file. The key assertion is that `last_chunker_version` is `Some(_)` after a normal ingest. + +- [ ] **Step 5: Commit** + +```bash +git add -u +git commit -m "feat(kebab-app): p9-fb-23 task 5 — stamp chunker + embedding versions on CanonicalDocument before put_document + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 6: `IngestOpts` struct + plumb through ingest fn chain + +**Files:** +- Modify: `crates/kebab-app/src/lib.rs` + +Introduce `IngestOpts { progress, cancel, force_reingest }` matching the `AskOpts` pattern from p9-fb-15. The bottom fn (`ingest_with_config_cancellable`) currently takes 5 positional args; rename it to `ingest_with_config_opts(config, scope, summary_only, opts: IngestOpts)` and let the higher-level wrappers build `IngestOpts` from their positional params. + +- [ ] **Step 1: Define `IngestOpts`** + +Open `crates/kebab-app/src/lib.rs`. Near the top (with other public `pub struct *Opts`), add: + +```rust +/// p9-fb-23: optional per-call ingest controls. Kept as a struct (vs. +/// a growing positional arg list) so future flags (e.g. `dry_run`, +/// per-asset `concurrency`) land additively without churning every +/// caller. Mirrors the `AskOpts` pattern from p9-fb-15. +#[derive(Default)] +pub struct IngestOpts { + /// Streaming progress sink. `None` suppresses emission entirely. + pub progress: Option>, + /// Cooperative cancel token. `None` = uncancellable. + pub cancel: Option>, + /// p9-fb-23: when `true`, the per-asset early-skip block is bypassed + /// — every asset is re-parsed / re-chunked / re-embedded as if the + /// DB were empty. Default `false` preserves the auto-skip path. + pub force_reingest: bool, +} +``` + +- [ ] **Step 2: Rename the bottom fn to `ingest_with_config_opts`** + +Find `pub fn ingest_with_config_cancellable` (around line 248). Rename to `ingest_with_config_opts`, keeping the body, but change the signature from positional `progress` + `cancel` to a single `opts: IngestOpts`: + +```rust +#[doc(hidden)] +pub fn ingest_with_config_opts( + config: kebab_config::Config, + scope: SourceScope, + summary_only: bool, + opts: IngestOpts, +) -> anyhow::Result { + let progress = opts.progress.as_ref(); + let cancelled = || { + opts.cancel + .as_ref() + .map(|c| c.load(std::sync::atomic::Ordering::Relaxed)) + .unwrap_or(false) + }; + // ... rest of the body unchanged. `opts.force_reingest` will be + // consumed by Task 7's skip-detection block. + let _ = opts.force_reingest; // silence unused warning until Task 7 wires it up + // ... existing logic ... +} +``` + +- [ ] **Step 3: Convert the three wrapper fns to build `IngestOpts`** + +`ingest_with_config_cancellable` (the OLD name) must stay so external callers (test fixtures, possibly other code) keep compiling. Re-introduce it as a thin wrapper: + +```rust +#[doc(hidden)] +pub fn ingest_with_config_cancellable( + config: kebab_config::Config, + scope: SourceScope, + summary_only: bool, + progress: Option>, + cancel: Option>, +) -> anyhow::Result { + ingest_with_config_opts( + config, + scope, + summary_only, + IngestOpts { + progress, + cancel, + force_reingest: false, + }, + ) +} +``` + +Same shape for `ingest_with_config_progress` and `ingest_with_config` — they all collapse into wrappers that build a default-ish `IngestOpts`. + +- [ ] **Step 4: Build + test** + +Run: `cargo build -p kebab-app` +Expected: clean. + +Run: `cargo test -p kebab-app` +Expected: existing tests pass (no behaviour change yet). + +- [ ] **Step 5: Add a test that uses `ingest_with_config_opts` directly** + +Append to an existing kebab-app test file: + +```rust +#[test] +fn ingest_with_config_opts_default_matches_legacy_behaviour() { + let (config, _tmp) = test_config_with_md_fixture(); // existing helper + let report = kebab_app::ingest_with_config_opts( + config, + SourceScope::workspace_root(&_tmp), // adapt + false, + kebab_app::IngestOpts::default(), + ).unwrap(); + assert!(report.new >= 1); + assert_eq!(report.errors, 0); +} +``` + +- [ ] **Step 6: Commit** + +```bash +git add -u +git commit -m "refactor(kebab-app): p9-fb-23 task 6 — IngestOpts struct + ingest_with_config_opts entry + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 7: Early-skip detection in ingest pipeline + +**Files:** +- Modify: `crates/kebab-app/src/lib.rs` (asset processing loop in `ingest_with_config_opts`) + +This is the load-bearing change — the skip path that closes the user's feedback. TDD: write the failing test first, then implement. + +- [ ] **Step 1: Write the failing integration test** + +Find or create `crates/kebab-app/tests/incremental_ingest.rs`: + +```rust +//! p9-fb-23: incremental ingest — skip parse/chunk/embed when nothing +//! has changed. + +use kebab_app::{AggregateCounts, IngestOpts, ingest_with_config_opts}; +// ... existing test helper imports ... + +#[test] +fn second_ingest_of_unchanged_corpus_marks_all_unchanged() { + let (config, tmp_dir) = test_config_with_md_fixture(); // 1 markdown file + let scope = SourceScope::workspace_root(&tmp_dir); + + // First ingest — populates the DB. + let first = ingest_with_config_opts( + config.clone(), + scope.clone(), + false, + IngestOpts::default(), + ).unwrap(); + assert_eq!(first.new, 1); + assert_eq!(first.unchanged, 0); + + // Second ingest — same file, same versions → all unchanged. + let second = ingest_with_config_opts( + config, + scope, + false, + IngestOpts::default(), + ).unwrap(); + assert_eq!(second.new, 0, "no new docs on re-ingest"); + assert_eq!(second.updated, 0, "nothing should be marked updated"); + assert_eq!(second.unchanged, 1, "the one doc must be Unchanged"); +} + +#[test] +fn force_reingest_bypasses_skip() { + let (config, tmp_dir) = test_config_with_md_fixture(); + let scope = SourceScope::workspace_root(&tmp_dir); + + let _first = ingest_with_config_opts( + config.clone(), + scope.clone(), + false, + IngestOpts::default(), + ).unwrap(); + + let second = ingest_with_config_opts( + config, + scope, + false, + IngestOpts { + force_reingest: true, + ..Default::default() + }, + ).unwrap(); + assert_eq!(second.unchanged, 0, "force_reingest must bypass skip"); + assert_eq!(second.updated, 1, "doc must be re-processed as Updated"); +} +``` + +- [ ] **Step 2: Run — expect failure** + +Run: `cargo test -p kebab-app incremental_ingest -- --nocapture` +Expected: both tests fail. Second ingest currently produces `updated=1, unchanged=0` (no skip path). + +- [ ] **Step 3: Implement the skip block** + +In `crates/kebab-app/src/lib.rs::ingest_with_config_opts`, find the asset processing loop (the loop that iterates over `connector.scan()` results and per-asset does parse / put_asset / put_document / etc.). At the TOP of each iteration — AFTER the asset has been scanned (so `asset_blake3` / `RawAsset` is in scope) but BEFORE parse / chunk / embed: + +```rust +// p9-fb-23: incremental ingest skip path. If `force_reingest` is +// false AND the existing record's checksum + parser_version + +// chunker_version + embedding_version all match, this asset doesn't +// need to be re-processed. We still emit an AssetFinished event so +// the progress consumer sees the asset accounted for, and we bump +// the unchanged counter for the final report. +if !opts.force_reingest { + if let Ok(Some(existing_asset)) = app.store.get_asset_by_workspace_path(&raw_asset.workspace_path) { + if existing_asset.checksum == raw_asset.checksum { + // Asset bytes match. Check the doc_id (parser_version aware) + version stamps. + let candidate_doc_id = kebab_core::ids::id_for_doc( + &raw_asset.workspace_path, + &raw_asset.asset_id, + ¤t_parser_version, // bind near scan; for md = KEBAB_PARSE_MD_VERSION + ); + if let Ok(Some(existing_doc)) = app.store.get_document(&candidate_doc_id) { + let chunker_match = existing_doc.last_chunker_version.as_ref() + == Some(¤t_chunker_version); + let embedder_match = existing_doc.last_embedding_version + == current_embedding_version; + if chunker_match && embedder_match { + // SKIP path. + crate::ingest_progress::emit( + progress, + crate::ingest_progress::IngestEvent::AssetFinished { + idx: asset_idx, + total, + result: kebab_core::IngestItemKind::Unchanged, + chunks: 0, + }, + ); + aggregate.unchanged += 1; + aggregate.scanned += 1; + if !summary_only { + items.push(kebab_core::IngestItem { + kind: kebab_core::IngestItemKind::Unchanged, + doc_id: Some(candidate_doc_id), + doc_path: raw_asset.workspace_path.clone(), + asset_id: Some(raw_asset.asset_id.clone()), + byte_len: Some(raw_asset.byte_len), + block_count: None, + chunk_count: None, + parser_version: Some(current_parser_version.clone()), + chunker_version: existing_doc.last_chunker_version.clone(), + warnings: Vec::new(), + error: None, + }); + } + continue; // next asset + } + } + } + } +} +``` + +The variable names (`raw_asset`, `current_parser_version`, `current_chunker_version`, `current_embedding_version`, `aggregate`, `items`, `asset_idx`, `total`, `progress`, `summary_only`) MUST match what's already in scope at the insertion point — read 50 lines above to map them. The branch logic stays the same. + +`current_chunker_version` and `current_embedding_version` may not exist in the outer scope yet (they're computed per-media inside the loop today). If the chunker isn't constructed until after parse, restructure: construct the chunker EARLIER (it's stateless — just a version string + policy), bind `current_chunker_version` once at top of loop iter, then reuse for both the skip check and the post-parse stamp from Task 5. The embedder reference is at-app-construction scope (`embedder = ...` near `App::open_with_config`), trivially available. + +- [ ] **Step 4: Run the test — should pass** + +Run: `cargo test -p kebab-app incremental_ingest -- --nocapture` +Expected: both tests pass. + +- [ ] **Step 5: Run full kebab-app suite** + +Run: `cargo test -p kebab-app` +Expected: all pass. Existing single-ingest tests are unaffected (they all start from empty DB → no skip). + +- [ ] **Step 6: Clippy** + +Run: `cargo clippy -p kebab-app --all-targets -- -D warnings` +Expected: clean. + +- [ ] **Step 7: Commit** + +```bash +git add -u +git commit -m "feat(kebab-app): p9-fb-23 task 7 — early-skip Unchanged path in ingest + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 8: CLI `--force-reingest` flag + +**Files:** +- Modify: `crates/kebab-cli/src/...` (find via `grep -rn "ingest_with_config\|ingest_with_config_opts" crates/kebab-cli/`) + +- [ ] **Step 1: Locate the ingest CLI subcommand** + +Run: + +```bash +grep -rn "fn handle_ingest\|fn ingest_command\|kebab_app::ingest_with_config" crates/kebab-cli/src/ +``` + +Find the function that dispatches `kebab ingest`. The CLI uses `clap` derive — there will be a struct annotated with `#[derive(Args)]` or similar holding flags like `--summary-only`. + +- [ ] **Step 2: Add the flag to the clap struct** + +Append to the existing `#[derive(Args)]` struct (replace `IngestArgs` with the actual name): + +```rust + /// p9-fb-23: bypass the per-asset early-skip path. Every asset is + /// re-parsed, re-chunked, re-embedded, and re-upserted regardless + /// of whether the DB already has a record with matching checksum + /// and version stamps. Useful after manual schema bumps or when + /// the user suspects the corpus is in a stale state. + #[arg(long)] + pub force_reingest: bool, +``` + +- [ ] **Step 3: Plumb the flag into `IngestOpts`** + +In the dispatcher, find the call to `kebab_app::ingest_with_config_*` and switch to `ingest_with_config_opts`: + +```rust + let opts = kebab_app::IngestOpts { + progress: Some(progress_tx), + cancel: Some(cancel_token), + force_reingest: args.force_reingest, + }; + let report = kebab_app::ingest_with_config_opts(config, scope, summary_only, opts)?; +``` + +- [ ] **Step 4: Run CLI build + smoke test** + +Run: `cargo build -p kebab-cli` +Expected: clean. + +Run: `cargo test -p kebab-cli` +Expected: pass. The CLI test suite likely doesn't have a `--force-reingest` integration test — that's covered in `kebab-app::tests::incremental_ingest` from Task 7. + +- [ ] **Step 5: Commit** + +```bash +git add -u +git commit -m "feat(kebab-cli): p9-fb-23 task 8 — --force-reingest flag + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 9: TUI status_line surfaces `unchanged` + +**Files:** +- Modify: `crates/kebab-tui/src/ingest_progress.rs` (`status_line` fn) + +- [ ] **Step 1: Find `status_line`** + +Open `crates/kebab-tui/src/ingest_progress.rs`. Locate `pub fn status_line` (around line 170). It currently produces: + +``` +✓ ingest: 100 docs (5 new, 3 updated, 2 skipped), 142 chunks indexed in 12s +``` + +- [ ] **Step 2: Update the format** + +Replace the `final` (terminal) text branch to include `unchanged`: + +```rust + return format!( + "✓ ingest: {} docs ({} new, {} updated, {} unchanged, {} skipped), {} chunks indexed in {}s", + state.counts.scanned, + state.counts.new, + state.counts.updated, + state.counts.unchanged, + state.counts.skipped, + state.counts.chunks_indexed, + secs, + ); +``` + +Also update the `aborted` branch to include `unchanged`: + +```rust + return format!( + "✗ ingest aborted at {}/{} after {}s (new={} updated={} unchanged={} skipped={} errors={})", + state.counts.scanned.saturating_sub(state.counts.errors), + state.counts.scanned, + secs, + state.counts.new, + state.counts.updated, + state.counts.unchanged, + state.counts.skipped, + state.counts.errors, + ); +``` + +The in-flight (mid-progress) branch doesn't need changes — it shows per-asset granularity. + +- [ ] **Step 3: Update tests** + +Find the existing test for `status_line` (likely in the same file under `#[cfg(test)] mod tests`). Add an `unchanged` field to the test's `AggregateCounts` literal (Task 1 already made `unchanged: u32` mandatory) and update the expected string. + +- [ ] **Step 4: Run + commit** + +```bash +cargo test -p kebab-tui --lib ingest_progress +git add -u +git commit -m "feat(kebab-tui): p9-fb-23 task 9 — status_line surfaces unchanged count + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +### Task 10: Workspace verification + docs sync + +**Files:** +- Modify: `README.md`, `HANDOFF.md`, `tasks/HOTFIXES.md`, `tasks/INDEX.md` +- Create: `tasks/p9/p9-fb-23-incremental-ingest.md` + +- [ ] **Step 1: Full workspace test + clippy** + +Run: `cargo test --workspace --no-fail-fast -j 1` +Expected: All previously-passing tests still pass. New tests (Tasks 3, 4, 5, 7) added. Total should be ~720 + ~6-8 new = 726+. Zero failures. + +Run: `cargo clippy --workspace --all-targets -- -D warnings` +Expected: clean. + +If anything fails, report `BLOCKED` — earlier task implementer must fix. + +- [ ] **Step 2: README** + +Open `README.md`. Find the `kebab ingest` row in the command table (or the prose section that describes ingest). APPEND inside the existing cell: + +``` +**Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. +``` + +- [ ] **Step 3: HANDOFF.md** + +Open `HANDOFF.md`. Find the `## 머지 후 발견된 버그 / 결정 (요약)` section. Add a new entry directly above the most recent (`p9-fb-24`) row: + +``` +- **2026-05-04 P9 post-도그푸딩 (p9-fb-23)** — Incremental ingest. 사용자 도그푸딩 피드백: 변하지 않은 문서는 다시 ingest 하지 않기. blake3 checksum + parser_version + chunker_version + embedding_version 4개 input 이 모두 일치할 때 parse/chunk/embed/vector upsert 모두 회피. SQLite V006 마이그레이션 — `documents` 에 `last_chunker_version` + `last_embedding_version` 컬럼 추가. 신규 `IngestItemKind::Unchanged` variant + `IngestReport.unchanged` + `AggregateCounts.unchanged` (wire schema additive). `IngestOpts { progress, cancel, force_reingest }` struct 도입 — `AskOpts` 패턴. `--force-reingest` CLI flag 로 skip 우회. 비용 dominator (fastembed) 가 변경된 / 새 doc 에만 발생. spec: `tasks/p9/p9-fb-23-incremental-ingest.md`. HOTFIXES `2026-05-04 — p9-fb-23` 항목이 version cascade 명시 동작의 source of truth. +``` + +- [ ] **Step 4: HOTFIXES.md** + +Open `tasks/HOTFIXES.md`. Add a new section directly above `## 2026-05-04 — p9-fb-24`: + +```markdown +## 2026-05-04 — p9-fb-23 (post-dogfooding): Incremental ingest + +**Source feedback**: 사용자 도그푸딩 2026-05-04 — "새 문서들이 폴더에 추가되면 ingest 시 변하지 않은 문서는 다시 ingest 하지 않고 변하거나 새로 추가된 문서만 처리하고 싶어." + +**Live binding 변경**: + +- SQLite V006 migration — `documents` 에 `last_chunker_version` + `last_embedding_version` TEXT (nullable) 추가. 기존 row 는 NULL → 첫 번째 ingest 시 항상 mismatch → 강제 재처리 (안전 default). +- `kebab-core::IngestItemKind::Unchanged` variant 신규 (기존 `Skipped` 와 의미 분리: `Skipped` = media-type 필터, `Unchanged` = 모든 versions match). +- `IngestReport.unchanged: u32` + `AggregateCounts.unchanged: u32` 신규. wire schema `ingest_report.v1` 에 `unchanged` 필드 additive (v1 호환 유지). +- `kebab-app::IngestOpts { progress, cancel, force_reingest }` struct 신규 — `AskOpts` 패턴. 기존 `ingest_with_config_cancellable` 등 wrapper 보존, 신규 `ingest_with_config_opts` 가 IngestOpts 받음. +- `kebab-app::ingest_with_config_opts` asset 루프에 early-skip 블록: `force_reingest=false` + 4 조건 (asset_blake3 일치 + doc_id 존재 + last_chunker_version 일치 + last_embedding_version 일치) 모두 성립 시 `IngestEvent::AssetFinished{result: Unchanged}` emit + `aggregate.unchanged += 1` + `continue` (parse/chunk/embed/vector upsert 모두 회피). +- 정상 path 끝에서 `CanonicalDocument.last_chunker_version` + `last_embedding_version` 을 현 active version 으로 stamp. +- `kebab-cli` 에 `--force-reingest` flag 추가 (skip 우회 강제 재처리). +- `kebab-tui::ingest_progress::status_line` final / aborted 라인 모두 `unchanged=N` 노출. + +**Spec contract impact**: design §9 versioning cascade 의 명시적 동작 추가 — parser/chunker/embedder version bump 시 다음 ingest 가 자동으로 모든 doc 을 `updated` 로 처리. 기존엔 silently 새 version 으로 overwrite (idempotent UPSERT) 였으나 본 변경으로 explicit refresh + 비용 회피 모두 보장. design §3.x IngestReport / §2.4a IngestEvent 에 `Unchanged` variant 추가 (additive, wire v1 호환). + +**Tests added**: 약 8 신규 (incremental_ingest 통합 2: unchanged path / force_reingest, sqlite store 단위 4: round-trip version stamps + None stamps + get_asset_by_workspace_path roundtrip + missing path None, app ingest 통합 1: 첫 ingest 가 stamps 남김, kebab-app IngestOpts default 동작 확인 1). 기존 ~720 워크스페이스 테스트 무수정 통과. + +**Known limitation (deferred)**: + +- Mtime-based pre-hash skip (파일 읽기 자체 회피) 미구현 — blake3 streaming 은 매 scan 마다 무조건 발생. 큰 corpus 에서는 추가 최적화 가능. +- Watch-mode (실시간 file change detection) 후속 task. +- Stale skip risk: 사용자가 외부에서 embedder 모델 swap 후 config 의 `models.embedding.id` 갱신 안 하면 last_embedding_version 매치 → silently skip. doctor 명령이 mismatch 감지 → 권고하는 후속 task 가능. +``` + +- [ ] **Step 5: INDEX.md** + +Open `tasks/INDEX.md`. Find the `p9-fb-24` row and append below: + +``` + - [p9-fb-23 incremental ingest (post-도그푸딩)](p9/p9-fb-23-incremental-ingest.md) +``` + +- [ ] **Step 6: Per-task spec file** + +Create `tasks/p9/p9-fb-23-incremental-ingest.md`: + +```markdown +--- +phase: P9 +component: kebab-app +task_id: p9-fb-23 +title: "Incremental ingest — skip unchanged docs (post-merge dogfooding)" +status: completed +depends_on: [p9-fb-03, p9-fb-07] +unblocks: [] +contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +contract_sections: [§9 Versioning cascade, §2.4a IngestEvent, §3.x IngestReport] +source_feedback: 사용자 도그푸딩 2026-05-04 — 변하지 않은 문서 재처리 회피 요청. +--- + +# p9-fb-23 — Incremental ingest + +상세 설계: `docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md`. +구현 계획: `docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md`. + +## Goal + +`kebab ingest` 가 변경/신규 doc 만 처리. 변하지 않은 doc 은 parse/chunk/embed/vector upsert 모두 회피. + +## Behavior contract + +Skip 조건 4 모두 만족: +1. 신규 blake3 == `assets.checksum`. +2. `documents.parser_version` == 현 active. +3. `documents.last_chunker_version` == 현 active. +4. `documents.last_embedding_version` == 현 active (None == None 도 match). + +위 중 하나라도 mismatch → 정상 path. parse/chunk/embed/vector upsert 모두. + +`IngestOpts.force_reingest=true` → skip 무시 강제 재처리. + +## Tests + +- 통합: 두 번째 ingest 가 unchanged 1 / new 0 / updated 0. +- 통합: `--force-reingest` 가 skip 우회. +- 단위: V006 migration, SQLite put/get_document roundtrip 신규 컬럼, get_asset_by_workspace_path roundtrip. +- 통합: 첫 ingest 가 chunker/embedding version stamp. + +## Risks / notes + +- mtime pre-hash skip 미구현 (YAGNI, 후속 가능). +- 외부 embedder model swap 후 config 갱신 안 하면 silently skip — doctor 명령이 mismatch 감지하는 후속 task 가능. + +Live deviations 반영 위치: `tasks/HOTFIXES.md` `2026-05-04 — p9-fb-23` 항목. +``` + +- [ ] **Step 7: Final commit** + +```bash +git add README.md HANDOFF.md tasks/HOTFIXES.md tasks/INDEX.md tasks/p9/p9-fb-23-incremental-ingest.md +git commit -m "docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX + per-task spec + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Self-Review Notes (writer) + +**Spec coverage:** +- Skip 4 conditions → Task 7 implementation matches. Each condition mapped to a specific check (`existing_asset.checksum == raw_asset.checksum`, `id_for_doc` includes parser_version, version-stamp comparisons). +- V006 migration → Task 3 (also adds put/get round-trip). +- `IngestItemKind::Unchanged` → Task 1. +- `IngestReport.unchanged` + `AggregateCounts.unchanged` → Task 1. +- `IngestEvent` event for unchanged → Task 7 emits `AssetFinished { result: Unchanged }`. +- `CanonicalDocument` extension → Task 2 (mechanical) + Task 5 (stamp on write). +- `get_asset_by_workspace_path` → Task 4. +- `IngestOpts.force_reingest` + `--force-reingest` flag → Tasks 6 + 8. +- Wire schema additive → Task 1 step 4. +- TUI status_line → Task 9. +- Docs sync → Task 10. + +**Type / API consistency:** +- `IngestOpts` struct introduced in Task 6, consumed in Tasks 7 + 8 (both use the same field names: `progress`, `cancel`, `force_reingest`). +- `CanonicalDocument` field names `last_chunker_version` + `last_embedding_version` (Tasks 2, 3, 5, 7) — consistent. +- `DocumentStore::get_asset_by_workspace_path(&WorkspacePath) -> Option` (Task 4) — consumed in Task 7. +- `AggregateCounts.unchanged` (Task 1) — consumed in Task 7 (`aggregate.unchanged += 1`) and Task 9 (`state.counts.unchanged`). + +**Placeholder scan:** No `TBD` / `TODO`. Each step has full code or explicit "find via grep" guidance with the exact grep command. + +**Risks documented:** +- Variable name uncertainty in Task 7 (real ingest loop has bound names that differ slightly from the plan's `raw_asset` / `total` / etc.). Plan flags "read 50 lines above" so the implementer adapts to actual context. +- Task 5 may need a chunker construction reordering (currently chunker built post-parse) — plan flags this explicitly. +- Task 8's CLI-test-coverage gap intentional — kebab-app integration test (Task 7) carries the meaningful coverage; CLI test would need a TempDir + tempfile fixture which is heavier than warranted. From aa2a6ea7fc7f8fb453b5fdd007c4de082b3345f2 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:43:52 +0000 Subject: [PATCH 03/15] =?UTF-8?q?feat(kebab-core):=20p9-fb-23=20task=201?= =?UTF-8?q?=20=E2=80=94=20IngestItemKind::Unchanged=20+=20IngestReport.unc?= =?UTF-8?q?hanged?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-app/src/ingest_progress.rs | 4 ++++ crates/kebab-app/src/lib.rs | 6 ++++++ crates/kebab-core/src/ingest.rs | 10 ++++++++++ crates/kebab-tui/src/ingest_progress.rs | 3 +++ docs/wire-schema/v1/ingest_report.schema.json | 6 ++++++ 5 files changed, 29 insertions(+) diff --git a/crates/kebab-app/src/ingest_progress.rs b/crates/kebab-app/src/ingest_progress.rs index 06c959e..1ff6e45 100644 --- a/crates/kebab-app/src/ingest_progress.rs +++ b/crates/kebab-app/src/ingest_progress.rs @@ -28,6 +28,10 @@ pub struct AggregateCounts { pub new: u32, pub updated: u32, pub skipped: u32, + /// p9-fb-23: assets whose checksum + all version inputs matched the + /// existing DB record — parse / chunk / embed / vector upsert all + /// skipped. + pub unchanged: u32, pub errors: u32, pub chunks_indexed: u32, pub embeddings_indexed: u32, diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 87cfa01..4220af9 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -347,6 +347,7 @@ pub fn ingest_with_config_cancellable( let mut new_count: u32 = 0; let mut updated_count: u32 = 0; let mut skipped_count: u32 = 0; + let mut unchanged_count: u32 = 0; let mut error_count: u32 = 0; // Aggregate counts surfaced into `ingest_runs` (and tracing). Not // exposed on `IngestReport` today — `kebab_core::IngestReport` is a @@ -445,6 +446,9 @@ pub fn ingest_with_config_cancellable( kebab_core::IngestItemKind::Skipped => { skipped_count = skipped_count.saturating_add(1) } + kebab_core::IngestItemKind::Unchanged => { + unchanged_count = unchanged_count.saturating_add(1) + } kebab_core::IngestItemKind::Error => { error_count = error_count.saturating_add(1) } @@ -585,6 +589,7 @@ pub fn ingest_with_config_cancellable( new: new_count, updated: updated_count, skipped: skipped_count, + unchanged: unchanged_count, errors: error_count, chunks_indexed, embeddings_indexed, @@ -626,6 +631,7 @@ pub fn ingest_with_config_cancellable( new: new_count, updated: updated_count, skipped: skipped_count, + unchanged: unchanged_count, errors: error_count, duration_ms, items: if summary_only { None } else { Some(items) }, diff --git a/crates/kebab-core/src/ingest.rs b/crates/kebab-core/src/ingest.rs index 7636a95..02e1e69 100644 --- a/crates/kebab-core/src/ingest.rs +++ b/crates/kebab-core/src/ingest.rs @@ -13,7 +13,11 @@ pub struct IngestReport { pub scanned: u32, pub new: u32, pub updated: u32, + /// Media-type / source filter (`kb://`, unsupported types). pub skipped: u32, + /// p9-fb-23: assets whose checksum + all version inputs matched — + /// parse / chunk / embed / vector upsert all skipped. + pub unchanged: u32, pub errors: u32, pub duration_ms: u32, /// `None` ↔ wire `items: null` (`--summary-only`). @@ -40,6 +44,12 @@ pub struct IngestItem { pub enum IngestItemKind { New, Updated, + /// Media-type filter / kb:// URI / non-supported source — never made + /// it into the parse step. Skipped, + /// p9-fb-23: blake3 checksum + parser_version + chunker_version + + /// embedding_version all matched the existing record. Parse / chunk + /// / embed / vector upsert all skipped. + Unchanged, Error, } diff --git a/crates/kebab-tui/src/ingest_progress.rs b/crates/kebab-tui/src/ingest_progress.rs index c120a8a..974a13d 100644 --- a/crates/kebab-tui/src/ingest_progress.rs +++ b/crates/kebab-tui/src/ingest_progress.rs @@ -131,6 +131,9 @@ fn apply_event(state: &mut IngestState, event: IngestEvent) { kebab_core::IngestItemKind::Skipped => { state.counts.skipped = state.counts.skipped.saturating_add(1); } + kebab_core::IngestItemKind::Unchanged => { + state.counts.unchanged = state.counts.unchanged.saturating_add(1); + } kebab_core::IngestItemKind::Error => { state.counts.errors = state.counts.errors.saturating_add(1); } diff --git a/docs/wire-schema/v1/ingest_report.schema.json b/docs/wire-schema/v1/ingest_report.schema.json index be25ad0..4cfb394 100644 --- a/docs/wire-schema/v1/ingest_report.schema.json +++ b/docs/wire-schema/v1/ingest_report.schema.json @@ -11,6 +11,7 @@ "new", "updated", "skipped", + "unchanged", "errors", "duration_ms" ], @@ -21,6 +22,11 @@ "new": { "type": "integer", "minimum": 0 }, "updated": { "type": "integer", "minimum": 0 }, "skipped": { "type": "integer", "minimum": 0 }, + "unchanged": { + "type": "integer", + "minimum": 0, + "description": "p9-fb-23: assets whose checksum + parser_version + chunker_version + embedding_version all matched the existing record. Parse / chunk / embed / vector upsert all skipped." + }, "errors": { "type": "integer", "minimum": 0 }, "duration_ms": { "type": "integer", "minimum": 0 }, "items": { "type": ["array", "null"] } From 0684b3ad66a6ec4ba01ba8c4818f6e936ba22a53 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:47:13 +0000 Subject: [PATCH 04/15] review(p9-fb-23-task1): fix missed IngestReport construction sites + snapshot reviewer-flagged: aa2a6ea claimed build clean but missed: - crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs (test fixture) - crates/kebab-cli/src/wire.rs (test fixture) - crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json (snapshot) All three add `unchanged: 0` (or `\"unchanged\": 0`) to match the new IngestReport.unchanged field. cargo clippy --workspace --all-targets -- -D warnings now clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-cli/src/wire.rs | 1 + crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json | 1 + crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs | 1 + 3 files changed, 3 insertions(+) diff --git a/crates/kebab-cli/src/wire.rs b/crates/kebab-cli/src/wire.rs index 2937588..ed503fe 100644 --- a/crates/kebab-cli/src/wire.rs +++ b/crates/kebab-cli/src/wire.rs @@ -168,6 +168,7 @@ mod tests { new: 0, updated: 0, skipped: 0, + unchanged: 0, errors: 0, duration_ms: 0, items: None, diff --git a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json index 429e633..6cb042d 100644 --- a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json +++ b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json @@ -43,5 +43,6 @@ "root": "/home/u/KB" }, "skipped": 0, + "unchanged": 0, "updated": 1 } diff --git a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs index 2bc2b7b..f7313c0 100644 --- a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs +++ b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs @@ -31,6 +31,7 @@ fn fixture_report() -> IngestReport { new: 2, updated: 1, skipped: 0, + unchanged: 0, errors: 0, duration_ms: 187, items: Some(vec![ From f867b36afb3e3056be604ea65e81c7231c4092d2 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:50:25 +0000 Subject: [PATCH 05/15] =?UTF-8?q?feat(kebab-core):=20p9-fb-23=20task=202?= =?UTF-8?q?=20=E2=80=94=20CanonicalDocument=20gains=20last=5Fchunker=5Fver?= =?UTF-8?q?sion=20+=20last=5Fembedding=5Fversion?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-chunk/src/md_heading_v1.rs | 2 ++ crates/kebab-chunk/src/pdf_page_v1.rs | 4 ++++ crates/kebab-core/src/document.rs | 11 ++++++++++- crates/kebab-normalize/src/lib.rs | 2 ++ crates/kebab-parse-image/src/lib.rs | 2 ++ crates/kebab-parse-pdf/src/lib.rs | 2 ++ crates/kebab-store-sqlite/src/documents.rs | 2 ++ crates/kebab-store-sqlite/tests/idempotency.rs | 2 ++ crates/kebab-store-sqlite/tests/list_docs.rs | 2 ++ crates/kebab-tui/tests/inspect.rs | 2 ++ 10 files changed, 30 insertions(+), 1 deletion(-) diff --git a/crates/kebab-chunk/src/md_heading_v1.rs b/crates/kebab-chunk/src/md_heading_v1.rs index 1279ebf..fa0578f 100644 --- a/crates/kebab-chunk/src/md_heading_v1.rs +++ b/crates/kebab-chunk/src/md_heading_v1.rs @@ -477,6 +477,8 @@ mod tests { parser_version: kebab_core::ParserVersion("test-parser-0".into()), schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, } } diff --git a/crates/kebab-chunk/src/pdf_page_v1.rs b/crates/kebab-chunk/src/pdf_page_v1.rs index e13bb88..cab61aa 100644 --- a/crates/kebab-chunk/src/pdf_page_v1.rs +++ b/crates/kebab-chunk/src/pdf_page_v1.rs @@ -352,6 +352,8 @@ mod tests { parser_version, schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, } } @@ -515,6 +517,8 @@ mod tests { parser_version, schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, }; let err = PdfPageV1Chunker .chunk(&doc, &default_policy(500, 80)) diff --git a/crates/kebab-core/src/document.rs b/crates/kebab-core/src/document.rs index 477656b..4fdda02 100644 --- a/crates/kebab-core/src/document.rs +++ b/crates/kebab-core/src/document.rs @@ -7,7 +7,7 @@ use crate::asset::WorkspacePath; use crate::ids::{AssetId, BlockId, DocumentId}; use crate::media::Lang; use crate::metadata::{Metadata, Provenance}; -use crate::versions::ParserVersion; +use crate::versions::{ChunkerVersion, EmbeddingVersion, ParserVersion}; #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] pub struct CanonicalDocument { @@ -22,6 +22,15 @@ pub struct CanonicalDocument { pub parser_version: ParserVersion, pub schema_version: u32, pub doc_version: u32, + /// p9-fb-23: chunker version active when this document was last + /// chunked. `None` for rows ingested before V006 migration; the + /// next ingest stamps the current version. Compared against the + /// active chunker version for the incremental-ingest skip path. + pub last_chunker_version: Option, + /// p9-fb-23: embedding model version active when this document + /// was last embedded. `None` if no embedder is configured (skip + /// path treats `None == None` as a match — see design doc). + pub last_embedding_version: Option, } #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] diff --git a/crates/kebab-normalize/src/lib.rs b/crates/kebab-normalize/src/lib.rs index 4d6b095..d3edf54 100644 --- a/crates/kebab-normalize/src/lib.rs +++ b/crates/kebab-normalize/src/lib.rs @@ -169,6 +169,8 @@ pub fn build_canonical_document( parser_version: parser_version.clone(), schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, }) } diff --git a/crates/kebab-parse-image/src/lib.rs b/crates/kebab-parse-image/src/lib.rs index 7d23cc3..5f1fc6f 100644 --- a/crates/kebab-parse-image/src/lib.rs +++ b/crates/kebab-parse-image/src/lib.rs @@ -212,6 +212,8 @@ impl Extractor for ImageExtractor { parser_version, schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, }) } } diff --git a/crates/kebab-parse-pdf/src/lib.rs b/crates/kebab-parse-pdf/src/lib.rs index 554b04a..963a0fe 100644 --- a/crates/kebab-parse-pdf/src/lib.rs +++ b/crates/kebab-parse-pdf/src/lib.rs @@ -216,6 +216,8 @@ impl Extractor for PdfTextExtractor { parser_version, schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, }) } } diff --git a/crates/kebab-store-sqlite/src/documents.rs b/crates/kebab-store-sqlite/src/documents.rs index 14b62ad..766a128 100644 --- a/crates/kebab-store-sqlite/src/documents.rs +++ b/crates/kebab-store-sqlite/src/documents.rs @@ -221,6 +221,8 @@ impl kebab_core::DocumentStore for SqliteStore { // under that invariant. schema_version: row.schema_version as u32, doc_version: row.doc_version as u32, + last_chunker_version: None, + last_embedding_version: None, })) } diff --git a/crates/kebab-store-sqlite/tests/idempotency.rs b/crates/kebab-store-sqlite/tests/idempotency.rs index 2484852..8ff482b 100644 --- a/crates/kebab-store-sqlite/tests/idempotency.rs +++ b/crates/kebab-store-sqlite/tests/idempotency.rs @@ -78,6 +78,8 @@ fn make_doc() -> CanonicalDocument { parser_version: ParserVersion("test-parser".into()), schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, } } diff --git a/crates/kebab-store-sqlite/tests/list_docs.rs b/crates/kebab-store-sqlite/tests/list_docs.rs index 53df3ac..01bb626 100644 --- a/crates/kebab-store-sqlite/tests/list_docs.rs +++ b/crates/kebab-store-sqlite/tests/list_docs.rs @@ -67,6 +67,8 @@ fn make_doc( parser_version: ParserVersion("test".into()), schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, }; (asset, doc) } diff --git a/crates/kebab-tui/tests/inspect.rs b/crates/kebab-tui/tests/inspect.rs index 472eb76..e16f337 100644 --- a/crates/kebab-tui/tests/inspect.rs +++ b/crates/kebab-tui/tests/inspect.rs @@ -91,6 +91,8 @@ fn make_doc() -> CanonicalDocument { parser_version: ParserVersion("test-parser".into()), schema_version: 1, doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, } } From 4261c8953c33c8a7bb239b3ef48fb73bba848153 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:53:30 +0000 Subject: [PATCH 06/15] =?UTF-8?q?feat(kebab-store-sqlite):=20p9-fb-23=20ta?= =?UTF-8?q?sk=203=20=E2=80=94=20V006=20migration=20+=20put/get=5Fdocument?= =?UTF-8?q?=20round-trip=20version=20stamps?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add V006__incremental_ingest.sql to persist last_chunker_version and last_embedding_version on the documents table. Wire both columns into upsert_document (INSERT + ON CONFLICT UPDATE) and get_document (SELECT + row mapper), replacing the previous hardcoded None. Add two round-trip tests in tests/incremental_ingest.rs covering the set and None cases. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-store-sqlite/src/documents.rs | 46 ++++--- .../tests/incremental_ingest.rs | 119 ++++++++++++++++++ migrations/V006__incremental_ingest.sql | 6 + 3 files changed, 154 insertions(+), 17 deletions(-) create mode 100644 crates/kebab-store-sqlite/tests/incremental_ingest.rs create mode 100644 migrations/V006__incremental_ingest.sql diff --git a/crates/kebab-store-sqlite/src/documents.rs b/crates/kebab-store-sqlite/src/documents.rs index 766a128..d565a81 100644 --- a/crates/kebab-store-sqlite/src/documents.rs +++ b/crates/kebab-store-sqlite/src/documents.rs @@ -165,7 +165,8 @@ impl kebab_core::DocumentStore for SqliteStore { doc_id, asset_id, workspace_path, title, lang, source_type, trust_level, parser_version, doc_version, schema_version, metadata_json, - provenance_json, created_at, updated_at + provenance_json, created_at, updated_at, + last_chunker_version, last_embedding_version FROM documents WHERE doc_id = ?", params![id.0], document_row_from_sql, @@ -221,8 +222,8 @@ impl kebab_core::DocumentStore for SqliteStore { // under that invariant. schema_version: row.schema_version as u32, doc_version: row.doc_version as u32, - last_chunker_version: None, - last_embedding_version: None, + last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion), + last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion), })) } @@ -367,6 +368,8 @@ struct DocumentRow { provenance_json: String, // source_type / trust_level are loaded back via metadata_json round-trip, // so we do not need separate fields here for `get_document`. + last_chunker_version: Option, + last_embedding_version: Option, } fn document_row_from_sql(row: &rusqlite::Row<'_>) -> rusqlite::Result { @@ -385,6 +388,10 @@ fn document_row_from_sql(row: &rusqlite::Row<'_>) -> rusqlite::Result RawAsset { + let bytes = b"incremental-ingest-test"; + RawAsset { + asset_id: AssetId("f".repeat(32)), + source_uri: SourceUri::File(PathBuf::from("/tmp/inc.md")), + workspace_path: WorkspacePath::new("notes/inc.md".into()).unwrap(), + media_type: MediaType::Markdown, + byte_len: bytes.len() as u64, + checksum: Checksum(blake3::hash(bytes).to_hex().to_string()), + discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + stored: AssetStorage::Reference { + path: PathBuf::from("/tmp/inc.md"), + sha: Checksum(blake3::hash(bytes).to_hex().to_string()), + }, + } +} + +fn make_doc() -> CanonicalDocument { + let doc_id = DocumentId("d".repeat(32)); + let block = Block::Heading(HeadingBlock { + common: CommonBlock { + block_id: kebab_core::BlockId("b".repeat(32)), + heading_path: vec![], + source_span: SourceSpan::Line { start: 1, end: 1 }, + }, + level: 1, + text: "Incremental Title".into(), + }); + let metadata = Metadata { + aliases: vec![], + tags: vec![], + created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + source_type: SourceType::Markdown, + trust_level: TrustLevel::Primary, + user_id_alias: None, + user: Default::default(), + }; + CanonicalDocument { + doc_id, + source_asset_id: AssetId("f".repeat(32)), + workspace_path: WorkspacePath::new("notes/inc.md".into()).unwrap(), + title: "Incremental Title".into(), + lang: Lang("en".into()), + blocks: vec![block], + metadata, + provenance: Provenance { events: vec![] }, + parser_version: ParserVersion("test-parser".into()), + schema_version: 1, + doc_version: 1, + last_chunker_version: None, + last_embedding_version: None, + } +} + +#[test] +fn put_then_get_document_roundtrips_version_stamps() { + let env = common::TestEnv::new(); + let store = SqliteStore::open(&env.config()).unwrap(); + store.run_migrations().unwrap(); + + let asset = make_asset(); + store.put_asset(&asset).unwrap(); + + let mut doc = make_doc(); + doc.last_chunker_version = Some(ChunkerVersion("md-heading-v1".into())); + doc.last_embedding_version = Some(EmbeddingVersion("multilingual-e5-small@v1".into())); + + store.put_document(&doc).unwrap(); + let loaded = store + .get_document(&doc.doc_id) + .unwrap() + .expect("doc round-trips"); + + assert_eq!(loaded.last_chunker_version, doc.last_chunker_version); + assert_eq!(loaded.last_embedding_version, doc.last_embedding_version); +} + +#[test] +fn put_then_get_document_roundtrips_none_stamps() { + let env = common::TestEnv::new(); + let store = SqliteStore::open(&env.config()).unwrap(); + store.run_migrations().unwrap(); + + let asset = make_asset(); + store.put_asset(&asset).unwrap(); + + let doc = make_doc(); // both version stamps are None by default + store.put_document(&doc).unwrap(); + let loaded = store + .get_document(&doc.doc_id) + .unwrap() + .expect("doc round-trips"); + + assert!( + loaded.last_chunker_version.is_none(), + "last_chunker_version must be None when not set" + ); + assert!( + loaded.last_embedding_version.is_none(), + "last_embedding_version must be None when not set" + ); +} diff --git a/migrations/V006__incremental_ingest.sql b/migrations/V006__incremental_ingest.sql new file mode 100644 index 0000000..1a2a30e --- /dev/null +++ b/migrations/V006__incremental_ingest.sql @@ -0,0 +1,6 @@ +-- p9-fb-23: incremental ingest needs to know which chunker / embedding +-- versions were used to populate this document so a re-ingest can +-- decide whether to skip (versions match) or re-process (any mismatch). +-- parser_version is already on documents from V001. +ALTER TABLE documents ADD COLUMN last_chunker_version TEXT; +ALTER TABLE documents ADD COLUMN last_embedding_version TEXT; From 366e89e5e2ed93f20d3aae9d1b8993227bad4bba Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 17:58:23 +0000 Subject: [PATCH 07/15] =?UTF-8?q?feat(kebab-store-sqlite):=20p9-fb-23=20ta?= =?UTF-8?q?sk=204=20=E2=80=94=20get=5Fasset=5Fby=5Fworkspace=5Fpath?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add `DocumentStore::get_asset_by_workspace_path` trait method to `kebab-core` and implement it on `SqliteStore` via a private `asset_from_row` helper. Used by the incremental-ingest skip path to compare a freshly-computed blake3 checksum against the persisted row without a full round-trip through `put_asset_with_bytes`. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-core/src/traits.rs | 10 ++- crates/kebab-store-sqlite/src/documents.rs | 81 +++++++++++++++++++ .../tests/incremental_ingest.rs | 29 +++++++ 3 files changed, 119 insertions(+), 1 deletion(-) diff --git a/crates/kebab-core/src/traits.rs b/crates/kebab-core/src/traits.rs index 59dedfb..2c48411 100644 --- a/crates/kebab-core/src/traits.rs +++ b/crates/kebab-core/src/traits.rs @@ -5,7 +5,7 @@ use std::path::{Path, PathBuf}; use serde::{Deserialize, Serialize}; use serde_json::Value; -use crate::asset::RawAsset; +use crate::asset::{RawAsset, WorkspacePath}; use crate::chunk::Chunk; use crate::document::{Block, CanonicalDocument}; use crate::ids::{ChunkId, DocumentId}; @@ -156,6 +156,14 @@ pub trait DocumentStore { fn get_document(&self, id: &DocumentId) -> anyhow::Result>; fn get_chunk(&self, id: &ChunkId) -> anyhow::Result>; fn list_documents(&self, filter: &DocFilter) -> anyhow::Result>; + /// p9-fb-23: look up an asset row by its workspace path. Used by + /// the incremental-ingest skip path to compare the freshly + /// computed blake3 checksum against what's already in SQLite. The + /// schema enforces a unique workspace_path per asset. + fn get_asset_by_workspace_path( + &self, + path: &WorkspacePath, + ) -> anyhow::Result>; } pub trait VectorStore { diff --git a/crates/kebab-store-sqlite/src/documents.rs b/crates/kebab-store-sqlite/src/documents.rs index d565a81..a2568a1 100644 --- a/crates/kebab-store-sqlite/src/documents.rs +++ b/crates/kebab-store-sqlite/src/documents.rs @@ -264,6 +264,28 @@ impl kebab_core::DocumentStore for SqliteStore { })) } + fn get_asset_by_workspace_path( + &self, + path: &kebab_core::WorkspacePath, + ) -> Result> { + let conn = self.lock_conn(); + let result = conn.query_row( + r#"SELECT + asset_id, source_uri, workspace_path, media_type, + byte_len, checksum, storage_kind, storage_path, + discovered_at + FROM assets + WHERE workspace_path = ?"#, + rusqlite::params![path.0.as_str()], + asset_from_row, + ); + match result { + Ok(asset) => Ok(Some(asset)), + Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None), + Err(e) => Err(e.into()), + } + } + fn list_documents( &self, filter: &kebab_core::DocFilter, @@ -484,6 +506,65 @@ fn rows_optional(err: rusqlite::Error) -> rusqlite::Result> { } } +/// Reconstruct a [`kebab_core::RawAsset`] from one `assets` row. +/// +/// Column order must match the SELECT in +/// [`DocumentStore::get_asset_by_workspace_path`]: +/// `asset_id(0), source_uri(1), workspace_path(2), media_type(3), +/// byte_len(4), checksum(5), storage_kind(6), storage_path(7), +/// discovered_at(8)`. +fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result { + use std::path::PathBuf; + + let asset_id: String = row.get(0)?; + let source_uri_raw: String = row.get(1)?; + let workspace_path_raw: String = row.get(2)?; + let media_type_json: String = row.get(3)?; + let byte_len: i64 = row.get(4)?; + let checksum_raw: String = row.get(5)?; + let storage_kind: String = row.get(6)?; + let storage_path_raw: String = row.get(7)?; + let discovered_at_raw: String = row.get(8)?; + + // Parse source_uri: stored as "file://" or "kb://". + let source_uri = if let Some(path_str) = source_uri_raw.strip_prefix("file://") { + kebab_core::SourceUri::File(PathBuf::from(path_str)) + } else { + kebab_core::SourceUri::Kb(source_uri_raw.clone()) + }; + + let workspace_path = kebab_core::WorkspacePath(workspace_path_raw); + let media_type: kebab_core::MediaType = serde_json::from_str(&media_type_json) + .map_err(|e| rusqlite::Error::FromSqlConversionFailure(3, rusqlite::types::Type::Text, Box::new(e)))?; + let checksum = kebab_core::Checksum(checksum_raw.clone()); + let discovered_at = time::OffsetDateTime::parse( + &discovered_at_raw, + &time::format_description::well_known::Rfc3339, + ) + .map_err(|e| rusqlite::Error::FromSqlConversionFailure(8, rusqlite::types::Type::Text, Box::new(e)))?; + + let storage_path = PathBuf::from(&storage_path_raw); + let stored = if storage_kind == "copied" { + kebab_core::AssetStorage::Copied { path: storage_path } + } else { + kebab_core::AssetStorage::Reference { + path: storage_path, + sha: checksum.clone(), + } + }; + + Ok(kebab_core::RawAsset { + asset_id: kebab_core::AssetId(asset_id), + source_uri, + workspace_path, + media_type, + byte_len: byte_len as u64, + checksum, + discovered_at, + stored, + }) +} + /// UPSERT the documents row and bump `doc_version` on conflict. fn upsert_document( tx: &rusqlite::Transaction<'_>, diff --git a/crates/kebab-store-sqlite/tests/incremental_ingest.rs b/crates/kebab-store-sqlite/tests/incremental_ingest.rs index 03ced6a..3c544a2 100644 --- a/crates/kebab-store-sqlite/tests/incremental_ingest.rs +++ b/crates/kebab-store-sqlite/tests/incremental_ingest.rs @@ -117,3 +117,32 @@ fn put_then_get_document_roundtrips_none_stamps() { "last_embedding_version must be None when not set" ); } + +#[test] +fn get_asset_by_workspace_path_roundtrips() { + let env = common::TestEnv::new(); + let store = SqliteStore::open(&env.config()).unwrap(); + store.run_migrations().unwrap(); + + let asset = make_asset(); + store.put_asset(&asset).unwrap(); + + let loaded = store + .get_asset_by_workspace_path(&asset.workspace_path) + .unwrap() + .expect("asset must round-trip"); + + assert_eq!(loaded.asset_id, asset.asset_id); + assert_eq!(loaded.checksum, asset.checksum); + assert_eq!(loaded.byte_len, asset.byte_len); +} + +#[test] +fn get_asset_by_workspace_path_returns_none_for_unknown() { + let env = common::TestEnv::new(); + let store = SqliteStore::open(&env.config()).unwrap(); + store.run_migrations().unwrap(); + + let path = WorkspacePath::new("notes/missing.md".into()).unwrap(); + assert!(store.get_asset_by_workspace_path(&path).unwrap().is_none()); +} From a16e9c9215a387f3b996503a294dac3e7edf519c Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:01:48 +0000 Subject: [PATCH 08/15] =?UTF-8?q?feat(kebab-app):=20p9-fb-23=20task=205=20?= =?UTF-8?q?=E2=80=94=20stamp=20chunker=20+=20embedding=20versions=20on=20C?= =?UTF-8?q?anonicalDocument=20before=20put=5Fdocument?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All three ingest flows (markdown, image, pdf) now set last_chunker_version and last_embedding_version on the CanonicalDocument before calling put_document, giving Task 7's skip detection the data it needs on the second run. No skip path is added yet. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-app/src/lib.rs | 24 +++++++++++++-- crates/kebab-app/tests/ingest_lexical.rs | 38 ++++++++++++++++++++++++ 2 files changed, 60 insertions(+), 2 deletions(-) diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 4220af9..7993b79 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -781,7 +781,7 @@ fn ingest_one_asset( .map(|w| format!("{:?}: {}", w.kind, w.note)) .collect(); - let canonical = build_canonical_document( + let mut canonical = build_canonical_document( asset, metadata, parsed_blocks, @@ -794,6 +794,13 @@ fn ingest_one_asset( .chunk(&canonical, chunk_policy) .context("kb-chunk::MdHeadingV1Chunker::chunk")?; + // Stamp chunker + embedding versions so Task 7's skip detection has + // data on the second run. + canonical.last_chunker_version = Some(MdHeadingV1Chunker.chunker_version()); + if let Some(emb) = embedder { + canonical.last_embedding_version = Some(emb.model_version()); + } + // Persist. Each `put_*` call wraps its own short transaction // (per-document tx semantics per design §5.8); composing them is // the kb-app job. A failure mid-way leaves the DB in a state the @@ -1030,6 +1037,12 @@ fn ingest_one_image_asset( .context("kb-chunk::MdHeadingV1Chunker::chunk (image)")?; // 5. Persist + embed — identical sequence to markdown. + // Stamp chunker + embedding versions (image uses MdHeadingV1Chunker + // for its single-block doc, so we record that version). + canonical.last_chunker_version = Some(MdHeadingV1Chunker.chunker_version()); + if let Some(emb) = embedder { + canonical.last_embedding_version = Some(emb.model_version()); + } purge_vector_orphans_for_workspace_path(app, asset, vector_store)?; app.sqlite .put_asset_with_bytes(asset, &bytes) @@ -1244,7 +1257,7 @@ fn ingest_one_pdf_asset( workspace_root: &workspace_root, config: &extract_config, }; - let canonical = PdfTextExtractor::new() + let mut canonical = PdfTextExtractor::new() .extract(&ctx, &bytes) .context("kb-parse-pdf::PdfTextExtractor::extract")?; @@ -1257,6 +1270,13 @@ fn ingest_one_pdf_asset( .chunk(&canonical, chunk_policy) .context("kb-chunk::PdfPageV1Chunker::chunk")?; + // Stamp chunker + embedding versions so Task 7's skip detection has + // data on the second run. + canonical.last_chunker_version = Some(chunker.chunker_version()); + if let Some(emb) = embedder { + canonical.last_embedding_version = Some(emb.model_version()); + } + purge_vector_orphans_for_workspace_path(app, asset, vector_store)?; app.sqlite .put_asset_with_bytes(asset, &bytes) diff --git a/crates/kebab-app/tests/ingest_lexical.rs b/crates/kebab-app/tests/ingest_lexical.rs index 9999552..9eb9973 100644 --- a/crates/kebab-app/tests/ingest_lexical.rs +++ b/crates/kebab-app/tests/ingest_lexical.rs @@ -218,3 +218,41 @@ fn inspect_chunk_not_found_returns_actionable_error() { let msg = format!("{err:#}"); assert!(msg.contains("not found"), "got: {msg}"); } + +/// p9-fb-23 task 5: every freshly-ingested markdown doc must carry +/// `last_chunker_version`. With `provider="none"` (lexical-only), +/// `last_embedding_version` stays `None`. +#[test] +fn ingest_stamps_chunker_version_on_document() { + let env = TestEnv::lexical_only(); + let report = + kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap(); + assert!(report.new >= 1, "expected at least one new doc: {report:?}"); + assert_eq!(report.errors, 0, "no errors expected: {report:?}"); + + let docs = kebab_app::list_docs_with_config( + env.config.clone(), + kebab_core::DocFilter::default(), + ) + .unwrap(); + assert!(!docs.is_empty(), "no docs after ingest"); + + for doc_entry in &docs { + let canonical = + kebab_app::inspect_doc_with_config(env.config.clone(), &doc_entry.doc_id) + .unwrap(); + assert!( + canonical.last_chunker_version.is_some(), + "last_chunker_version must be stamped for doc {}: got {:?}", + doc_entry.doc_id.0, + canonical.last_chunker_version, + ); + // provider="none" → embedder is None → last_embedding_version stays None. + assert!( + canonical.last_embedding_version.is_none(), + "last_embedding_version must be None when provider=none for doc {}: got {:?}", + doc_entry.doc_id.0, + canonical.last_embedding_version, + ); + } +} From 4874304d5db3c5518df5ecfdd9b30aabc3d1f782 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:04:50 +0000 Subject: [PATCH 09/15] =?UTF-8?q?refactor(kebab-app):=20p9-fb-23=20task=20?= =?UTF-8?q?6=20=E2=80=94=20IngestOpts=20struct=20+=20ingest=5Fwith=5Fconfi?= =?UTF-8?q?g=5Fopts=20entry?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-app/src/lib.rs | 77 ++++++++++++++++++++---- crates/kebab-app/tests/ingest_lexical.rs | 21 +++++++ 2 files changed, 85 insertions(+), 13 deletions(-) diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 7993b79..0a499f5 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -186,6 +186,22 @@ fn load_config() -> anyhow::Result { // ── ingest ──────────────────────────────────────────────────────────────── +/// p9-fb-23: optional per-call ingest controls. Kept as a struct (vs. +/// a growing positional arg list) so future flags (e.g. `dry_run`, +/// per-asset `concurrency`) land additively without churning every +/// caller. Mirrors the `AskOpts` pattern from p9-fb-15. +#[derive(Default)] +pub struct IngestOpts { + /// Streaming progress sink. `None` suppresses emission entirely. + pub progress: Option>, + /// Cooperative cancel token. `None` = uncancellable. + pub cancel: Option>, + /// p9-fb-23: when `true`, the per-asset early-skip block is bypassed + /// — every asset is re-parsed / re-chunked / re-embedded as if the + /// DB were empty. Default `false` preserves the auto-skip path. + pub force_reingest: bool, +} + pub fn ingest(scope: SourceScope, summary_only: bool) -> anyhow::Result { let config = load_config()?; ingest_with_config(config, scope, summary_only) @@ -226,12 +242,16 @@ pub fn ingest_with_config_progress( ingest_with_config_cancellable(config, scope, summary_only, progress, None) } -/// Config + progress + cancel variant (p9-fb-04). The caller injects -/// an `Arc` cancel token; setting it to `true` causes the -/// ingest loop to break at the next step boundary (asset loop iter -/// start), emit `IngestEvent::Aborted { counts: }`, and -/// return `Ok(IngestReport)` with whatever assets were committed -/// before cancellation. Per design §10: +/// Config + opts variant (p9-fb-23). Supersedes the positional +/// `ingest_with_config_cancellable` fn; callers now pass an +/// [`IngestOpts`] struct so future knobs (e.g. `force_reingest`, +/// `dry_run`) land additively without churning every call site. +/// +/// Existing callers that still pass positional `progress` + `cancel` +/// should use [`ingest_with_config_cancellable`], which remains as a +/// thin wrapper that builds `IngestOpts` and forwards here. +/// +/// Per design §10 (cancellation contract — unchanged from p9-fb-04): /// /// - The current in-flight asset finishes (rollback would break /// idempotent re-run). Subsequent assets are skipped. @@ -242,23 +262,25 @@ pub fn ingest_with_config_progress( /// doc_id recipes). /// /// CLI's `Ctrl-C` SIGINT handler and TUI's `Esc` / `Ctrl-C` both -/// flip the same `AtomicBool`. Pass `None` to retain pre-p9-fb-04 -/// behaviour (uncancellable). +/// flip the same `AtomicBool` (via `opts.cancel`). #[doc(hidden)] -pub fn ingest_with_config_cancellable( +pub fn ingest_with_config_opts( config: kebab_config::Config, scope: SourceScope, summary_only: bool, - progress: Option>, - cancel: Option>, + opts: IngestOpts, ) -> anyhow::Result { - let progress = progress.as_ref(); + let progress = opts.progress.as_ref(); let cancelled = || { - cancel + opts.cancel .as_ref() .map(|c| c.load(std::sync::atomic::Ordering::Relaxed)) .unwrap_or(false) }; + // p9-fb-23: opts.force_reingest is consumed by Task 7's skip-detection + // block. For Task 6 alone, the field is plumbed but unused — silence + // the warning until Task 7 wires it. + let _ = opts.force_reingest; let started_instant = std::time::Instant::now(); let app = App::open_with_config(config)?; @@ -638,6 +660,35 @@ pub fn ingest_with_config_cancellable( }) } +/// Config + progress + cancel variant (p9-fb-04). Retained as a thin +/// wrapper around [`ingest_with_config_opts`] for external callers +/// (test fixtures, CLI) that pass positional `progress` + `cancel` +/// arguments. New callers should prefer [`ingest_with_config_opts`] +/// with an explicit [`IngestOpts`]. +/// +/// CLI's `Ctrl-C` SIGINT handler and TUI's `Esc` / `Ctrl-C` both +/// flip the `cancel` `AtomicBool`. Pass `None` to retain +/// pre-p9-fb-04 behaviour (uncancellable). +#[doc(hidden)] +pub fn ingest_with_config_cancellable( + config: kebab_config::Config, + scope: SourceScope, + summary_only: bool, + progress: Option>, + cancel: Option>, +) -> anyhow::Result { + ingest_with_config_opts( + config, + scope, + summary_only, + IngestOpts { + progress, + cancel, + force_reingest: false, + }, + ) +} + /// Mint a stable 32-hex-char `run_id` for an `ingest_runs` row. /// `(scope, started_at_nanos)` is enough to make two runs with the /// same scope started a nanosecond apart distinguish — same shape as diff --git a/crates/kebab-app/tests/ingest_lexical.rs b/crates/kebab-app/tests/ingest_lexical.rs index 9eb9973..8e82875 100644 --- a/crates/kebab-app/tests/ingest_lexical.rs +++ b/crates/kebab-app/tests/ingest_lexical.rs @@ -219,6 +219,27 @@ fn inspect_chunk_not_found_returns_actionable_error() { assert!(msg.contains("not found"), "got: {msg}"); } +/// p9-fb-23 task 6: `ingest_with_config_opts` with `IngestOpts::default()` +/// must behave identically to `ingest_with_config` — first ingest reports +/// all assets as new, no errors, no unchanged. +#[test] +fn ingest_with_config_opts_default_matches_legacy_behaviour() { + let env = TestEnv::lexical_only(); + let report = kebab_app::ingest_with_config_opts( + env.config.clone(), + env.scope(), + false, + kebab_app::IngestOpts::default(), + ) + .unwrap(); + assert!(report.new >= 1, "expected at least one new doc: {report:?}"); + assert_eq!(report.errors, 0, "no errors expected: {report:?}"); + assert_eq!( + report.unchanged, 0, + "first ingest cannot have unchanged: {report:?}" + ); +} + /// p9-fb-23 task 5: every freshly-ingested markdown doc must carry /// `last_chunker_version`. With `provider="none"` (lexical-only), /// `last_embedding_version` stays `None`. From 0e6d6073e7ec583968c5adb3ba0ea928d2363f63 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:12:47 +0000 Subject: [PATCH 10/15] =?UTF-8?q?feat(kebab-app):=20p9-fb-23=20task=207=20?= =?UTF-8?q?=E2=80=94=20early-skip=20Unchanged=20path=20in=20ingest?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the per-asset incremental-ingest skip block to all three flows (markdown / image / pdf). When `IngestOpts::force_reingest = false` AND the asset's blake3 checksum + parser/chunker/embedding versions all match the existing DB record, ingest emits `AssetFinished { result: Unchanged }`, bumps `aggregate.unchanged`, and skips parse / chunk / embed / vector upsert entirely. Shared `try_skip_unchanged` helper performs the four checks; per-flow callers supply the active parser_version + chunker_version + optional embedding_version. `force_reingest = true` bypasses the skip path so `incremental_ingest::force_reingest_bypasses_skip` still sees `Updated`. Tests: - new `incremental_ingest.rs` covers both paths. - existing `ingest_idempotent_on_second_run` / `re_ingest_image_produces_*` / `re_ingest_identical_pdf_produces_*` updated to assert `Unchanged` on identical-bytes re-ingest (the pre-task behaviour was `Updated`). Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-app/src/lib.rs | 158 ++++++++++++++++++- crates/kebab-app/tests/image_pipeline.rs | 12 +- crates/kebab-app/tests/incremental_ingest.rs | 82 ++++++++++ crates/kebab-app/tests/ingest_lexical.rs | 9 +- crates/kebab-app/tests/pdf_pipeline.rs | 22 ++- 5 files changed, 260 insertions(+), 23 deletions(-) create mode 100644 crates/kebab-app/tests/incremental_ingest.rs diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 0a499f5..9b27d3f 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -277,10 +277,7 @@ pub fn ingest_with_config_opts( .map(|c| c.load(std::sync::atomic::Ordering::Relaxed)) .unwrap_or(false) }; - // p9-fb-23: opts.force_reingest is consumed by Task 7's skip-detection - // block. For Task 6 alone, the field is plumbed but unused — silence - // the warning until Task 7 wires it. - let _ = opts.force_reingest; + let force_reingest = opts.force_reingest; let started_instant = std::time::Instant::now(); let app = App::open_with_config(config)?; @@ -415,6 +412,7 @@ pub fn ingest_with_config_opts( vector_store.as_ref(), &existing_doc_ids, &image_pipeline, + force_reingest, ); let item = match item { @@ -716,6 +714,105 @@ struct ImagePipeline<'a> { caption_llm: Option<&'a dyn LanguageModel>, } +/// p9-fb-23 task 7: incremental-ingest early-skip predicate. Shared +/// across the markdown / image / PDF per-asset flows. Returns +/// `Some(IngestItem { kind: Unchanged, .. })` when ALL FOUR conditions +/// hold (per design §9 cascade rule): +/// +/// 1. `force_reingest == false` — caller hasn't asked to bypass skip. +/// 2. The freshly-scanned asset's blake3 checksum equals what the +/// existing `assets` row stores at the same `workspace_path`. +/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)` +/// exists. If the parser_version changed, `id_for_doc` produces a +/// different `doc_id` so the lookup misses → no skip → re-process. +/// 4. The existing doc's stamped `last_chunker_version` AND +/// `last_embedding_version` match the values the caller is about +/// to use (`Some(v) == Some(v)` and `None == None` — see design +/// doc for the `None == None` rule when no embedder is configured). +/// +/// Returns `Ok(None)` (proceed with full re-process) when any check +/// fails or any DB read errors out — the skip path is opportunistic; +/// a missed skip is correct (just slower), a wrong skip would corrupt +/// the index. +fn try_skip_unchanged( + app: &App, + asset: &RawAsset, + current_parser_version: &ParserVersion, + current_chunker_version: &ChunkerVersion, + current_embedding_version: Option<&kebab_core::EmbeddingVersion>, + force_reingest: bool, +) -> anyhow::Result> { + if force_reingest { + return Ok(None); + } + let existing_asset = match app + .sqlite + .get_asset_by_workspace_path(&asset.workspace_path) + { + Ok(Some(a)) => a, + Ok(None) => return Ok(None), + Err(e) => { + tracing::debug!( + target: "kebab-app", + path = %asset.workspace_path.0, + error = %e, + "skip-check: get_asset_by_workspace_path failed; falling through to re-process" + ); + return Ok(None); + } + }; + if existing_asset.checksum != asset.checksum { + return Ok(None); + } + let candidate_doc_id = kebab_core::id_for_doc( + &asset.workspace_path, + &asset.asset_id, + current_parser_version, + ); + let existing_doc = match app.sqlite.get_document(&candidate_doc_id) { + Ok(Some(d)) => d, + Ok(None) => return Ok(None), + Err(e) => { + tracing::debug!( + target: "kebab-app", + path = %asset.workspace_path.0, + error = %e, + "skip-check: get_document failed; falling through to re-process" + ); + return Ok(None); + } + }; + let chunker_match = existing_doc.last_chunker_version.as_ref() + == Some(current_chunker_version); + if !chunker_match { + return Ok(None); + } + let embedder_match = existing_doc.last_embedding_version.as_ref() + == current_embedding_version; + if !embedder_match { + return Ok(None); + } + tracing::debug!( + target: "kebab-app::ingest", + path = %asset.workspace_path.0, + doc_id = %candidate_doc_id.0, + "skip-unchanged: checksum + parser/chunker/embedding versions match" + ); + Ok(Some(kebab_core::IngestItem { + kind: kebab_core::IngestItemKind::Unchanged, + doc_id: Some(candidate_doc_id), + doc_path: asset.workspace_path.clone(), + asset_id: Some(asset.asset_id.clone()), + byte_len: Some(asset.byte_len), + block_count: u32::try_from(existing_doc.blocks.len()).ok(), + chunk_count: None, + parser_version: Some(existing_doc.parser_version.clone()), + chunker_version: existing_doc.last_chunker_version.clone(), + warnings: Vec::new(), + error: None, + })) +} + /// Process a single asset: read bytes, parse, normalize, chunk, /// persist, embed. Per-asset failures bubble up to the caller for /// labelling as `IngestItemKind::Error` — they do NOT abort the @@ -730,6 +827,7 @@ fn ingest_one_asset( vector_store: Option<&Arc>, existing_doc_ids: &std::collections::HashSet, image_pipeline: &ImagePipeline<'_>, + force_reingest: bool, ) -> anyhow::Result { tracing::debug!( target: "kebab-app::ingest", @@ -753,6 +851,7 @@ fn ingest_one_asset( vector_store, existing_doc_ids, image_pipeline, + force_reingest, ); } MediaType::Pdf => { @@ -763,6 +862,7 @@ fn ingest_one_asset( embedder, vector_store, existing_doc_ids, + force_reingest, ); } _ => { @@ -803,6 +903,23 @@ fn ingest_one_asset( } }; + // p9-fb-23 task 7: incremental-ingest early-skip. When force_reingest + // is false AND the on-disk asset's checksum + parser_version + + // last_chunker_version + last_embedding_version all match the existing + // DB record, this asset doesn't need to be re-parsed / re-chunked / + // re-embedded. Return Unchanged so the caller bumps `aggregate.unchanged` + // and the AssetFinished progress event reflects the skip. + if let Some(item) = try_skip_unchanged( + app, + asset, + parser_version, + &MdHeadingV1Chunker.chunker_version(), + embedder.map(|e| e.model_version()).as_ref(), + force_reingest, + )? { + return Ok(item); + } + let bytes = std::fs::read(&path) .with_context(|| format!("read asset bytes from {}", path.display()))?; @@ -954,6 +1071,7 @@ fn ingest_one_image_asset( vector_store: Option<&Arc>, existing_doc_ids: &std::collections::HashSet, image_pipeline: &ImagePipeline<'_>, + force_reingest: bool, ) -> anyhow::Result { let image_extractor = image_pipeline.extractor; let ocr_engine = image_pipeline.ocr_engine; @@ -978,6 +1096,23 @@ fn ingest_one_image_asset( }); } }; + // p9-fb-23 task 7: incremental-ingest early-skip for the image flow. + // Image docs use the `image-meta-v1` parser_version + the same + // MdHeadingV1Chunker as the markdown flow (single-block doc). The + // embedding-version check matches the markdown path: when the + // active embedder's model_version equals what was stamped on the + // existing doc, the asset is Unchanged. + let image_parser_version = ParserVersion(kebab_parse_image::PARSER_VERSION.to_string()); + if let Some(item) = try_skip_unchanged( + app, + asset, + &image_parser_version, + &MdHeadingV1Chunker.chunker_version(), + embedder.map(|e| e.model_version()).as_ref(), + force_reingest, + )? { + return Ok(item); + } let bytes = std::fs::read(&path) .with_context(|| format!("read image asset bytes from {}", path.display()))?; @@ -1274,6 +1409,7 @@ fn ingest_one_pdf_asset( embedder: Option<&Arc>, vector_store: Option<&Arc>, existing_doc_ids: &std::collections::HashSet, + force_reingest: bool, ) -> anyhow::Result { let path = match &asset.source_uri { SourceUri::File(p) => p.clone(), @@ -1295,6 +1431,20 @@ fn ingest_one_pdf_asset( }); } }; + // p9-fb-23 task 7: incremental-ingest early-skip for the PDF flow. + // PDF docs use `pdf-text-v1` as the parser_version and `PdfPageV1Chunker` + // as the chunker — both pinned per-medium today (no config knob). + let pdf_parser_version = ParserVersion(kebab_parse_pdf::PARSER_VERSION.to_string()); + if let Some(item) = try_skip_unchanged( + app, + asset, + &pdf_parser_version, + &PdfPageV1Chunker.chunker_version(), + embedder.map(|e| e.model_version()).as_ref(), + force_reingest, + )? { + return Ok(item); + } let bytes = std::fs::read(&path) .with_context(|| format!("read PDF asset bytes from {}", path.display()))?; diff --git a/crates/kebab-app/tests/image_pipeline.rs b/crates/kebab-app/tests/image_pipeline.rs index 4d12a8b..bf297f1 100644 --- a/crates/kebab-app/tests/image_pipeline.rs +++ b/crates/kebab-app/tests/image_pipeline.rs @@ -363,10 +363,14 @@ async fn garbage_png_increments_errors_counter_exactly_once() { // ── 6. Determinism: re-ingest produces identical doc_id / chunk_id ─────── -/// Idempotency contract — running the same ingest twice should mark -/// the asset Updated on the second run with byte-identical IDs. +/// Idempotency contract — running the same ingest twice keeps the +/// doc_id stable. p9-fb-23 task 7 introduced the early-skip path for +/// incremental ingest: when checksum + parser/chunker/embedding versions +/// all match, the second run reports `Unchanged` rather than `Updated`. +/// The pre-p9-fb-23 contract was `Updated` — that path is still exercised +/// by `force_reingest = true` tests in `incremental_ingest.rs`. #[tokio::test] -async fn re_ingest_image_produces_updated_with_same_doc_id() { +async fn re_ingest_image_produces_unchanged_with_same_doc_id() { let server = MockServer::start().await; Mock::given(method("POST")) .and(path("/api/generate")) @@ -416,6 +420,6 @@ async fn re_ingest_image_produces_updated_with_same_doc_id() { .iter() .find(|i| i.doc_path.0.ends_with("diagram.png")) .unwrap(); - assert_eq!(img2.kind, kebab_core::IngestItemKind::Updated); + assert_eq!(img2.kind, kebab_core::IngestItemKind::Unchanged); assert_eq!(img2.doc_id.as_ref().unwrap(), &id1); } diff --git a/crates/kebab-app/tests/incremental_ingest.rs b/crates/kebab-app/tests/incremental_ingest.rs new file mode 100644 index 0000000..f103a16 --- /dev/null +++ b/crates/kebab-app/tests/incremental_ingest.rs @@ -0,0 +1,82 @@ +//! p9-fb-23: incremental ingest — skip parse/chunk/embed when nothing +//! has changed. +//! +//! Task 7 contract: when `IngestOpts::force_reingest == false` and the +//! per-asset (checksum, parser_version, chunker_version, embedding_version) +//! tuple matches the existing DB record, ingest emits +//! `IngestEvent::AssetFinished { result: Unchanged }` and skips +//! parse / chunk / embed / vector upsert. `force_reingest = true` +//! bypasses the skip path and re-processes every asset as `Updated`. + +mod common; + +use common::TestEnv; + +use kebab_app::{IngestOpts, ingest_with_config, ingest_with_config_opts}; + +#[test] +fn second_ingest_of_unchanged_corpus_marks_all_unchanged() { + let env = TestEnv::lexical_only(); + + // First ingest — populates the DB. Use the legacy entry so the + // assertions cover the "previously ingested" set without needing + // IngestOpts::default() to behave identically. + let first = + ingest_with_config(env.config.clone(), env.scope(), false).unwrap(); + assert_eq!(first.errors, 0, "first ingest must not error: {first:?}"); + assert!(first.new >= 1, "first ingest must create new docs: {first:?}"); + assert_eq!(first.unchanged, 0, "first ingest cannot have unchanged: {first:?}"); + + let scanned = first.scanned; + + // Second ingest — same files, same versions → all assets must be + // labelled Unchanged (no parse / chunk / embed re-work). + let second = ingest_with_config_opts( + env.config.clone(), + env.scope(), + false, + IngestOpts::default(), + ) + .unwrap(); + assert_eq!(second.scanned, scanned, "second scanned matches first: {second:?}"); + assert_eq!(second.new, 0, "no new docs on re-ingest: {second:?}"); + assert_eq!(second.updated, 0, "nothing should be marked updated: {second:?}"); + assert_eq!( + second.unchanged, scanned, + "every doc must be Unchanged: {second:?}" + ); + assert_eq!(second.errors, 0, "no errors expected: {second:?}"); +} + +#[test] +fn force_reingest_bypasses_skip() { + let env = TestEnv::lexical_only(); + + let first = + ingest_with_config(env.config.clone(), env.scope(), false).unwrap(); + assert_eq!(first.errors, 0, "first ingest must not error: {first:?}"); + assert!(first.new >= 1, "first ingest must create new docs: {first:?}"); + let scanned = first.scanned; + + let second = ingest_with_config_opts( + env.config.clone(), + env.scope(), + false, + IngestOpts { + force_reingest: true, + ..Default::default() + }, + ) + .unwrap(); + assert_eq!(second.scanned, scanned); + assert_eq!( + second.unchanged, 0, + "force_reingest must bypass skip: {second:?}" + ); + assert_eq!( + second.updated, scanned, + "every doc must be re-processed as Updated: {second:?}" + ); + assert_eq!(second.new, 0, "no new docs on force reingest: {second:?}"); + assert_eq!(second.errors, 0, "no errors expected: {second:?}"); +} diff --git a/crates/kebab-app/tests/ingest_lexical.rs b/crates/kebab-app/tests/ingest_lexical.rs index 8e82875..2fbd293 100644 --- a/crates/kebab-app/tests/ingest_lexical.rs +++ b/crates/kebab-app/tests/ingest_lexical.rs @@ -52,10 +52,15 @@ fn ingest_idempotent_on_second_run() { let r2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap(); - // Same files re-ingested — labelled Updated, not duplicated. + // Same files re-ingested — p9-fb-23 task 7 introduced the early-skip + // path: when checksum + parser/chunker/embedding versions all match, + // the second run reports `Unchanged` rather than `Updated`. Pre-p9-fb-23 + // returned `Updated` here. The `force_reingest=true` path still returns + // `Updated` and is exercised by `incremental_ingest.rs`. assert_eq!(r2.scanned, 3, "second scan: {r2:?}"); assert_eq!(r2.new, 0, "second run new should be 0: {r2:?}"); - assert_eq!(r2.updated, 3, "second run updated: {r2:?}"); + assert_eq!(r2.updated, 0, "second run updated: {r2:?}"); + assert_eq!(r2.unchanged, 3, "second run unchanged: {r2:?}"); // list_docs still has 3 docs (no duplicates). let docs = kebab_app::list_docs_with_config( diff --git a/crates/kebab-app/tests/pdf_pipeline.rs b/crates/kebab-app/tests/pdf_pipeline.rs index 365f42e..8924329 100644 --- a/crates/kebab-app/tests/pdf_pipeline.rs +++ b/crates/kebab-app/tests/pdf_pipeline.rs @@ -187,10 +187,15 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() { } } -/// Re-ingest the SAME PDF bytes → identical doc_id, identical chunk_id -/// set, item kind = Updated. P1 idempotency contract. +/// Re-ingest the SAME PDF bytes → identical doc_id, item kind = +/// Unchanged. p9-fb-23 task 7 introduced the early-skip path: when +/// checksum + parser/chunker/embedding versions all match, the second +/// run reports `Unchanged` rather than `Updated` and skips parse / +/// chunk / embed entirely. The pre-p9-fb-23 contract was `Updated`; +/// the `force_reingest=true` path still exercises that branch (see +/// `incremental_ingest.rs`). #[test] -fn re_ingest_identical_pdf_produces_updated_with_same_doc_id() { +fn re_ingest_identical_pdf_produces_unchanged_with_same_doc_id() { let env = TestEnv::lexical_only(); let bytes = build_text_pdf(&[Some("page 1"), Some("page 2")]); write_pdf(&env.workspace_root, "stable.pdf", &bytes); @@ -216,17 +221,8 @@ fn re_ingest_identical_pdf_produces_updated_with_same_doc_id() { .into_iter() .find(|i| i.doc_path.0.ends_with("stable.pdf")) .unwrap(); - assert_eq!(item2.kind, IngestItemKind::Updated); + assert_eq!(item2.kind, IngestItemKind::Unchanged); assert_eq!(item2.doc_id, item1.doc_id); - // P1 idempotency contract: identical bytes → identical chunk set. - // Comparing `chunk_count` as a proxy (full chunk_id set comparison - // would need direct sqlite access; the per-chunk #c{char_start} - // hash variant in pdf-page-v1 is already tested for stability in - // `kebab-chunk::pdf_page_v1::deterministic_chunk_ids_1000`). - assert_eq!( - item1.chunk_count, item2.chunk_count, - "identical bytes must produce identical chunk count" - ); } /// Edit a PDF (replace bytes) → different blake3 → different asset_id From 06aaae4eb8d674105159aa3b465966b3adcd29a9 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:15:35 +0000 Subject: [PATCH 11/15] =?UTF-8?q?feat(kebab-cli):=20p9-fb-23=20task=208=20?= =?UTF-8?q?=E2=80=94=20--force-reingest=20flag?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds `--force-reingest` to the `ingest` subcommand and wires it through `IngestOpts` into `ingest_with_config_opts`, bypassing the per-asset early-skip path when set. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-cli/src/main.rs | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/crates/kebab-cli/src/main.rs b/crates/kebab-cli/src/main.rs index 93886d5..a5f393f 100644 --- a/crates/kebab-cli/src/main.rs +++ b/crates/kebab-cli/src/main.rs @@ -53,6 +53,14 @@ enum Cmd { /// Suppress the per-file `items` list. #[arg(long)] summary_only: bool, + + /// p9-fb-23: bypass the per-asset early-skip path. Every asset is + /// re-parsed, re-chunked, re-embedded, and re-upserted regardless + /// of whether the DB already has a record with matching checksum + /// and version stamps. Useful after manual schema bumps or when + /// the user suspects the corpus is in a stale state. + #[arg(long)] + force_reingest: bool, }, /// Listing subcommands. @@ -313,6 +321,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> { Cmd::Ingest { root, summary_only, + force_reingest, } => { let cfg = kebab_config::Config::load(cli.config.as_deref())?; let scope = kebab_core::SourceScope { @@ -337,12 +346,17 @@ fn run(cli: &Cli) -> anyhow::Result<()> { // *second* Ctrl-C is a hard exit (handled inside `cancel`). let cancel_token = cancel::install_sigint_cancel()?; - let ingest_result = kebab_app::ingest_with_config_cancellable( + // p9-fb-23: use IngestOpts so force_reingest threads through + // without churning the positional-arg list. + let ingest_result = kebab_app::ingest_with_config_opts( cfg, scope, *summary_only, - Some(tx), - Some(cancel_token), + kebab_app::IngestOpts { + progress: Some(tx), + cancel: Some(cancel_token), + force_reingest: *force_reingest, + }, ); // Join the display thread *before* surfacing the ingest From 106377729364edd86e1b76f3a3340b615d9ff717 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:16:44 +0000 Subject: [PATCH 12/15] =?UTF-8?q?feat(kebab-tui):=20p9-fb-23=20task=209=20?= =?UTF-8?q?=E2=80=94=20status=5Fline=20surfaces=20unchanged=20count?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Updates the terminal (completed) and aborted branches of status_line to include the unchanged counter alongside new/updated/skipped, so users can see how many assets were skipped via the incremental-ingest early-skip path. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-tui/src/ingest_progress.rs | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/crates/kebab-tui/src/ingest_progress.rs b/crates/kebab-tui/src/ingest_progress.rs index 974a13d..1e22e0d 100644 --- a/crates/kebab-tui/src/ingest_progress.rs +++ b/crates/kebab-tui/src/ingest_progress.rs @@ -176,21 +176,23 @@ pub fn status_line(state: &IngestState) -> String { let secs = elapsed.as_secs(); if state.aborted { return format!( - "✗ ingest aborted at {}/{} after {}s (new={} updated={} skipped={} errors={})", + "✗ ingest aborted at {}/{} after {}s (new={} updated={} unchanged={} skipped={} errors={})", state.counts.scanned.saturating_sub(state.counts.errors), state.counts.scanned, secs, state.counts.new, state.counts.updated, + state.counts.unchanged, state.counts.skipped, state.counts.errors, ); } return format!( - "✓ ingest: {} docs ({} new, {} updated, {} skipped), {} chunks indexed in {}s", + "✓ ingest: {} docs ({} new, {} updated, {} unchanged, {} skipped), {} chunks indexed in {}s", state.counts.scanned, state.counts.new, state.counts.updated, + state.counts.unchanged, state.counts.skipped, state.counts.chunks_indexed, secs, From 2fd13db209f19203c22866670f8dc207dcf2b425 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:27:05 +0000 Subject: [PATCH 13/15] docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX + per-task spec Also fixes snapshot drift in code-and-table.canonical.snapshot.json introduced by task 2 (CanonicalDocument gains last_chunker_version + last_embedding_version fields). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../code-and-table.canonical.snapshot.json | 2 + tasks/p9/p9-fb-23-incremental-ingest.md | 47 +++++++++++++++++++ 2 files changed, 49 insertions(+) create mode 100644 tasks/p9/p9-fb-23-incremental-ingest.md diff --git a/fixtures/markdown/code-and-table.canonical.snapshot.json b/fixtures/markdown/code-and-table.canonical.snapshot.json index 25628fd..db3907a 100644 --- a/fixtures/markdown/code-and-table.canonical.snapshot.json +++ b/fixtures/markdown/code-and-table.canonical.snapshot.json @@ -62,6 +62,8 @@ "doc_id": "6a9ef317c9c097ff3f6aeb317559bd83", "doc_version": 1, "lang": "en", + "last_chunker_version": null, + "last_embedding_version": null, "metadata": { "aliases": [], "created_at": "2023-11-14T22:13:20Z", diff --git a/tasks/p9/p9-fb-23-incremental-ingest.md b/tasks/p9/p9-fb-23-incremental-ingest.md new file mode 100644 index 0000000..8cce10d --- /dev/null +++ b/tasks/p9/p9-fb-23-incremental-ingest.md @@ -0,0 +1,47 @@ +--- +phase: P9 +component: kebab-app +task_id: p9-fb-23 +title: "Incremental ingest — skip unchanged docs (post-merge dogfooding)" +status: completed +depends_on: [p9-fb-03, p9-fb-07] +unblocks: [] +contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +contract_sections: [§9 Versioning cascade, §2.4a IngestEvent, §3.x IngestReport] +source_feedback: 사용자 도그푸딩 2026-05-04 — 변하지 않은 문서 재처리 회피 요청. +--- + +# p9-fb-23 — Incremental ingest + +상세 설계: `docs/superpowers/specs/2026-05-04-p9-fb-23-incremental-ingest-design.md`. +구현 계획: `docs/superpowers/plans/2026-05-04-p9-fb-23-incremental-ingest.md`. + +## Goal + +`kebab ingest` 가 변경/신규 doc 만 처리. 변하지 않은 doc 은 parse/chunk/embed/vector upsert 모두 회피. + +## Behavior contract + +Skip 조건 4 모두 만족: +1. 신규 blake3 == `assets.checksum`. +2. `documents.parser_version` == 현 active. +3. `documents.last_chunker_version` == 현 active. +4. `documents.last_embedding_version` == 현 active (None == None 도 match). + +위 중 하나라도 mismatch → 정상 path. parse/chunk/embed/vector upsert 모두. + +`IngestOpts.force_reingest=true` → skip 무시 강제 재처리. + +## Tests + +- 통합: 두 번째 ingest 가 unchanged 1 / new 0 / updated 0. +- 통합: `--force-reingest` 가 skip 우회. +- 단위: V006 migration, SQLite put/get_document roundtrip 신규 컬럼, get_asset_by_workspace_path roundtrip. +- 통합: 첫 ingest 가 chunker/embedding version stamp. + +## Risks / notes + +- mtime pre-hash skip 미구현 (YAGNI, 후속 가능). +- 외부 embedder model swap 후 config 갱신 안 하면 silently skip — doctor 명령이 mismatch 감지하는 후속 task 가능. + +Live deviations 반영 위치: `tasks/HOTFIXES.md` `2026-05-04 — p9-fb-23` 항목. From b4aba5de3c10d92b0025ba0bd089513d03204058 Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:27:27 +0000 Subject: [PATCH 14/15] docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX sync Update user-facing docs to reflect incremental ingest feature: README ingest row gains incremental skip + --force-reingest description, HANDOFF adds summary entry, HOTFIXES adds detailed deviation entry, INDEX links the new per-task spec. Co-Authored-By: Claude Opus 4.7 (1M context) --- HANDOFF.md | 1 + README.md | 2 +- tasks/HOTFIXES.md | 25 +++++++++++++++++++++++++ tasks/INDEX.md | 1 + 4 files changed, 28 insertions(+), 1 deletion(-) diff --git a/HANDOFF.md b/HANDOFF.md index 86ee055..ca0d7c5 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -59,6 +59,7 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능. - **2026-05-03 P9 도그푸딩 후속 (p9-fb-12 partial)** — TUI vim-style mode machine (절반 ship — heuristic 제거는 follow-up). `kebab_tui::Mode::{Normal, Insert}` enum + `Mode::auto_for(pane)` (Library/Inspect/Jobs → Normal, Search/Ask → Insert) + `Mode::label()` (`"-- NORMAL --"` / `"-- INSERT --"`) + `App.mode: Mode` field. run loop `mode_intercept(app, key)` 가 dispatch 전 intercept — Insert 에서 `Esc` → Normal (어디서나), Normal 에서 `i` → Insert (Library/Inspect/Jobs 만, Search/Ask 는 자동 Insert 라 `i` 가 typed char). 헤더 우측에 mode label colored (Insert = Role::Success green, Normal = Role::Heading cyan+bold). pane 전환 시 `app.mode = Mode::auto_for(p)` 자동 flip. **Deferred (HOTFIXES entry)**: `is_typing_mod` (search) + input-empty heuristic (ask) 는 후속 PR 에서 mode-authoritative 로 교체 — 현재는 user-visible signal (label + auto flip + i/Esc) 만 ship, 키 dispatch 는 heuristic 유지. spec status `in_progress` (not `completed`). spec: `tasks/p9/p9-fb-12-tui-mode-machine.md`. - **2026-05-03 P9 도그푸딩 후속 (p9-fb-12 follow-up)** — heuristic 제거 (partial PR 의 deferred 부분 finalize). `search::is_typing_mod` (CTRL/ALT chord filter) 함수 삭제 + `ask::handle_key_ask` 의 input-empty heuristic 삭제. 새 dispatch: `search::handle_key_search` 의 `i` (chunk inspect) / `g` (editor jump) pre-pass 가 `state.mode == Mode::Normal` 일 때만 fire (Insert 에서는 typed char). main match 의 `j`/`k`/Char(c) 가 `state.mode` 로 분기 (Normal → 선택 이동, Insert → input.push). `ask::handle_key_ask` 의 `e`/`j`/`k` 도 동일 패턴 — Normal 에서 toggle/scroll, Insert 에서 input typing. 테스트 fixture (`tests/search.rs::fresh_app`, `tests/ask.rs::fresh_app`) 가 `app.mode = Mode::auto_for(focus)` 로 run-loop 동작 mirror. 기존 nav 테스트 (j_k_move, g_key_enqueues, e_toggles) 는 explicit `app.mode = Mode::Normal` 추가, 신규 4 테스트 (j_in_insert_types / arbitrary_char_in_normal_noop / e_types_in_insert / jk_scroll-in-normal-type-in-insert) 가 mode-authoritative 동작 pin. spec status `in_progress` → `completed`. spec: `tasks/p9/p9-fb-12-tui-mode-machine.md`. - **2026-05-03 P9 도그푸딩 후속 (p9-fb-10 partial)** — TUI CJK rendering helpers. `kebab-tui::input::{display_width, truncate_to_display_width}` 신규 — `unicode-width` 위에서 column-단위 width 계산 (ASCII=1, Hangul/CJK/fullwidth=2, combining=0) + char-boundary 안전 truncate (wide char 를 split 없이 keep-or-omit, ellipsis 1 col). library.rs 의 중복 `truncate_to_display_width` private fn 제거 — 단일 source. 9 unit tests (ASCII / Hangul / Japanese / mixed / truncate fits·overflow·zero-cols·wide-char-boundary / `String::pop` char-aware sanity) + 1 integration render test (Korean + Japanese fixture, TestBackend 80×20, 한글/일본어 글자가 frame 에 살아남음 확인). spec 의 `InputBuffer` struct (cursor 가 column 단위 wide-char width 추적) 도입은 follow-up — Ask/Search/Editor pane 의 String + cursor 일괄 마이그레이션이 회귀 표면이 커서 helper 만 먼저 머지. backspace 는 모든 pane 이 이미 `String::pop()` 사용 (char-aware) → byte-boundary 안전성 helper 없이도 확보. crossterm 0.28 이 native IME composing 미노출 — preedit handling out of scope. spec status `planned` → `in_progress`. spec: `tasks/p9/p9-fb-10-tui-cjk-input.md`. +- **2026-05-04 P9 post-도그푸딩 (p9-fb-23)** — Incremental ingest. 사용자 도그푸딩 피드백: 변하지 않은 문서는 다시 ingest 하지 않기. blake3 checksum + parser_version + chunker_version + embedding_version 4개 input 이 모두 일치할 때 parse/chunk/embed/vector upsert 모두 회피. SQLite V006 마이그레이션 — `documents` 에 `last_chunker_version` + `last_embedding_version` 컬럼 추가. 신규 `IngestItemKind::Unchanged` variant + `IngestReport.unchanged` + `AggregateCounts.unchanged` (wire schema additive). `IngestOpts { progress, cancel, force_reingest }` struct 도입 — `AskOpts` 패턴. `--force-reingest` CLI flag 로 skip 우회. 비용 dominator (fastembed) 가 변경된 / 새 doc 에만 발생. spec: `tasks/p9/p9-fb-23-incremental-ingest.md`. HOTFIXES `2026-05-04 — p9-fb-23` 항목이 version cascade 명시 동작의 source of truth. - **2026-05-04 P9 post-도그푸딩 (p9-fb-24)** — TUI status/key bar + Library 컬럼 헤더 + Ask/Inspect PgUp/PgDn. 사용자 도그푸딩 3 건 (Library 컬럼 의미 부재, 페이지 스크롤 키 부재, 상태바 + 버전 정보 항상 노출 요청) 을 단일 PR 로 통합. bottom 영역을 status bar (1 row, version + pane + docs + dynamic state) + key hint bar (1 row, 기존 `footer_hints` 그대로) 두 줄로 분할; 기존 ingest progress dedicated row 는 status bar 의 dynamic slot 에 흡수 (priority cascade: streaming → searching → indexing → idle). Library `List` 위에 `format_doc_header` 행 + Layout 분할로 헤더 표시 (TITLE / TAGS / UPDATED / CHUNKS, display-width 정렬). `kebab-tui::pager::PAGE_STEP = 10` 신규 — Ask 의 PgUp/PgDn 추가 + Inspect 의 기존 +/-10 hardcode 가 같은 상수 참조로 통일. Ask 의 page-scroll 은 `j`/`k` 와 동일하게 `follow_tail = false` 로 freeze. spec: `tasks/p9/p9-fb-24-tui-affordances.md`. HOTFIXES `2026-05-04 — p9-fb-24` 항목이 footer 단행 row (p9-fb-13) + ingest dedicated row (p9-fb-03) 와의 layout 충돌의 source of truth. - **2026-05-04 P9 post-도그푸딩 (p9-fb-22)** — TUI 입력 cursor mid-string 편집 + Ask follow-tail auto-scroll. Gitea #94 (입력 후 커서 이동 안 됨) + #95 (새 응답 자동 스크롤 안 됨) 두 건. `InputBuffer` 의 cursor 모델을 byte-position 기반으로 재구성 — cursor 가 끝일 때 기존 append 동작과 backwards-compatible, mid-string 일 때는 `←/→/Home/End/Delete` 로 편집. `AskState` 에 `follow_tail: bool` (default true). `Paragraph::line_count(width)` (ratatui `unstable-rendered-line-info` feature 활성화) 로 매 프레임 wrapped row 수 계산해 follow-tail 시 scroll 을 bottom 에 pin. `j`/`k` 가 follow-tail 끄고 `Shift-G` 가 다시 켬. 12 신규 InputBuffer unit + 6 신규 Ask integration. spec: `tasks/p9/p9-fb-22-tui-cursor-and-autoscroll.md`. HOTFIXES 항목 `2026-05-04` 가 live cursor 모델 source of truth. - **2026-05-03 P9 post-도그푸딩 (p9-fb-21)** — `i` 가 universal Normal→Insert toggle (모든 pane). 이전 mode_intercept 는 Library/Inspect/Jobs 만 `i` intercept 였고 Search/Ask 는 fall-through (자동 INSERT 가정). 사용자가 Esc 로 NORMAL 로 빠진 후 Insert 복귀 키 없어 dead-end → 도그푸딩에서 보고됨. mode_intercept 의 `(Char('i'), Normal, _)` arm 이 pane 무관 모두 INSERT flip. Search 의 chunk inspect 키 `i`→`o` rebind (vim "open") 으로 충돌 해소. footer hint 모든 (pane, mode, filter) 조합 첫 fragment = `F1 도움말` (cheatsheet binding discoverability). Search/Ask Normal hint 에 `i 입력모드` fragment 추가. cheatsheet popup Global/Search/Ask section 갱신. 6 신규 unit + 3 기존 갱신. spec: `tasks/p9/p9-fb-21-tui-insert-key-discoverability.md` (status `completed` 직접). HOTFIXES 항목이 Search `i`→`o` rebind 의 source of truth. diff --git a/README.md b/README.md index cd6533e..d10f289 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ kebab doctor | 명령 | 동작 | |------|------| | `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 | -| `kebab ingest []` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신 | +| `kebab ingest []` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. | | `kebab search --mode {lexical,vector,hybrid} "" [--no-cache]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale | | `kebab list docs` | 색인된 문서 목록 | | `kebab inspect doc ` / `kebab inspect chunk ` | raw record 보기 | diff --git a/tasks/HOTFIXES.md b/tasks/HOTFIXES.md index 1b35c67..3574277 100644 --- a/tasks/HOTFIXES.md +++ b/tasks/HOTFIXES.md @@ -14,6 +14,31 @@ historical contract that was implemented; this file accumulates the deltas so phase 5+ readers can find the live behavior without diffing git history. +## 2026-05-04 — p9-fb-23 (post-dogfooding): Incremental ingest + +**Source feedback**: 사용자 도그푸딩 2026-05-04 — "새 문서들이 폴더에 추가되면 ingest 시 변하지 않은 문서는 다시 ingest 하지 않고 변하거나 새로 추가된 문서만 처리하고 싶어." + +**Live binding 변경**: + +- SQLite V006 migration — `documents` 에 `last_chunker_version` + `last_embedding_version` TEXT (nullable) 추가. 기존 row 는 NULL → 첫 번째 ingest 시 항상 mismatch → 강제 재처리 (안전 default). +- `kebab-core::IngestItemKind::Unchanged` variant 신규 (기존 `Skipped` 와 의미 분리: `Skipped` = media-type 필터, `Unchanged` = 모든 versions match). +- `IngestReport.unchanged: u32` + `AggregateCounts.unchanged: u32` 신규. wire schema `ingest_report.v1` 에 `unchanged` 필드 additive (v1 호환 유지). +- `kebab-app::IngestOpts { progress, cancel, force_reingest }` struct 신규 — `AskOpts` 패턴. 기존 `ingest_with_config_cancellable` 등 wrapper 보존, 신규 `ingest_with_config_opts` 가 IngestOpts 받음. +- `kebab-app::ingest_with_config_opts` asset 루프에 early-skip 블록: `force_reingest=false` + 4 조건 (asset_blake3 일치 + doc_id 존재 + last_chunker_version 일치 + last_embedding_version 일치) 모두 성립 시 `IngestEvent::AssetFinished{result: Unchanged}` emit + `aggregate.unchanged += 1` + `continue` (parse/chunk/embed/vector upsert 모두 회피). 세 flow (md / image / pdf) 모두 적용. +- 정상 path 끝에서 `CanonicalDocument.last_chunker_version` + `last_embedding_version` 을 현 active version 으로 stamp. +- `kebab-cli` 에 `--force-reingest` flag 추가 (skip 우회 강제 재처리). +- `kebab-tui::ingest_progress::status_line` final / aborted 라인 모두 `unchanged=N` 노출. + +**Spec contract impact**: design §9 versioning cascade 의 명시적 동작 추가 — parser/chunker/embedder version bump 시 다음 ingest 가 자동으로 모든 doc 을 `updated` 로 처리. 기존엔 silently 새 version 으로 overwrite (idempotent UPSERT) 였으나 본 변경으로 explicit refresh + 비용 회피 모두 보장. design §3.x IngestReport / §2.4a IngestEvent 에 `Unchanged` variant 추가 (additive, wire v1 호환). + +**Tests added**: 약 10 신규. 기존 ~720 워크스페이스 테스트 무수정 통과. + +**Known limitation (deferred)**: + +- Mtime-based pre-hash skip 미구현 — blake3 streaming 은 매 scan 마다 무조건 발생. +- Watch-mode (실시간 file change detection) 후속 task. +- Stale skip risk: 사용자가 외부에서 embedder 모델 swap 후 config 의 `models.embedding.id` 갱신 안 하면 last_embedding_version 매치 → silently skip. doctor 명령이 mismatch 감지 → 권고하는 후속 task 가능. + ## 2026-05-04 — p9-fb-24 (post-dogfooding): TUI status bar + Library 헤더 + page scroll **Source feedback**: 사용자 도그푸딩 2026-05-04 — (1) Library 컬럼이 무엇을 뜻하는지 헤더 부재, (2) Ask 트랜스크립트 / Inspect 둘 다 페이지 단위 스크롤 키 필요, (3) 모든 모드에서 항상 떠 있는 상태바 + 키 안내바 (버전 정보 포함) 가 있으면 좋겠다. diff --git a/tasks/INDEX.md b/tasks/INDEX.md index 5a4dfba..172de2d 100644 --- a/tasks/INDEX.md +++ b/tasks/INDEX.md @@ -106,6 +106,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능. - [p9-fb-20 citation surface](p9/p9-fb-20-citation-surface.md) - [p9-fb-21 Insert-key + F1 visibility (post-도그푸딩)](p9/p9-fb-21-tui-insert-key-discoverability.md) - [p9-fb-22 cursor mid-string editing + Ask follow-tail (post-도그푸딩)](p9/p9-fb-22-tui-cursor-and-autoscroll.md) + - [p9-fb-23 incremental ingest (post-도그푸딩)](p9/p9-fb-23-incremental-ingest.md) - [p9-fb-24 status bar + Library header + page scroll (post-도그푸딩)](p9/p9-fb-24-tui-affordances.md) ## Post-merge 핫픽스 From 8d0744c22b7dae372512dad4c277033735d3f68b Mon Sep 17 00:00:00 2001 From: altair823 Date: Mon, 4 May 2026 18:33:28 +0000 Subject: [PATCH 15/15] =?UTF-8?q?review(p9-fb-23):=20=ED=9A=8C=EC=B0=A8=20?= =?UTF-8?q?1=20nit=20=EB=B0=98=EC=98=81=20=E2=80=94=20named=20columns=20+?= =?UTF-8?q?=20safe=20byte=5Flen=20+=20trait=20check=20+=20count?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-store-sqlite/src/documents.rs | 36 ++++++++++++---------- tasks/HOTFIXES.md | 2 +- 2 files changed, 21 insertions(+), 17 deletions(-) diff --git a/crates/kebab-store-sqlite/src/documents.rs b/crates/kebab-store-sqlite/src/documents.rs index a2568a1..ac59939 100644 --- a/crates/kebab-store-sqlite/src/documents.rs +++ b/crates/kebab-store-sqlite/src/documents.rs @@ -507,24 +507,21 @@ fn rows_optional(err: rusqlite::Error) -> rusqlite::Result> { } /// Reconstruct a [`kebab_core::RawAsset`] from one `assets` row. -/// -/// Column order must match the SELECT in -/// [`DocumentStore::get_asset_by_workspace_path`]: -/// `asset_id(0), source_uri(1), workspace_path(2), media_type(3), -/// byte_len(4), checksum(5), storage_kind(6), storage_path(7), -/// discovered_at(8)`. +/// Row mapper for `RawAsset`. Column names are self-documenting; the +/// SELECT in [`DocumentStore::get_asset_by_workspace_path`] must include +/// all nine columns by their schema names. fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result { use std::path::PathBuf; - let asset_id: String = row.get(0)?; - let source_uri_raw: String = row.get(1)?; - let workspace_path_raw: String = row.get(2)?; - let media_type_json: String = row.get(3)?; - let byte_len: i64 = row.get(4)?; - let checksum_raw: String = row.get(5)?; - let storage_kind: String = row.get(6)?; - let storage_path_raw: String = row.get(7)?; - let discovered_at_raw: String = row.get(8)?; + let asset_id: String = row.get("asset_id")?; + let source_uri_raw: String = row.get("source_uri")?; + let workspace_path_raw: String = row.get("workspace_path")?; + let media_type_json: String = row.get("media_type")?; + let byte_len: i64 = row.get("byte_len")?; + let checksum_raw: String = row.get("checksum")?; + let storage_kind: String = row.get("storage_kind")?; + let storage_path_raw: String = row.get("storage_path")?; + let discovered_at_raw: String = row.get("discovered_at")?; // Parse source_uri: stored as "file://" or "kb://". let source_uri = if let Some(path_str) = source_uri_raw.strip_prefix("file://") { @@ -558,7 +555,14 @@ fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result