fix(kebab-app): p9-fb-23 — Incremental ingest (skip unchanged docs) #98

Merged
altair823 merged 15 commits from fix/p9-fb-23-incremental-ingest into main 2026-05-05 02:22:38 +00:00

15 Commits

Author SHA1 Message Date
8d0744c22b review(p9-fb-23): 회차 1 nit 반영 — named columns + safe byte_len + trait check + count
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:33:28 +00:00
b4aba5de3c docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX sync
Update user-facing docs to reflect incremental ingest feature:
README ingest row gains incremental skip + --force-reingest description,
HANDOFF adds summary entry, HOTFIXES adds detailed deviation entry,
INDEX links the new per-task spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:27:27 +00:00
2fd13db209 docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX + per-task spec
Also fixes snapshot drift in code-and-table.canonical.snapshot.json
introduced by task 2 (CanonicalDocument gains last_chunker_version +
last_embedding_version fields).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:27:05 +00:00
1063777293 feat(kebab-tui): p9-fb-23 task 9 — status_line surfaces unchanged count
Updates the terminal (completed) and aborted branches of status_line
to include the unchanged counter alongside new/updated/skipped, so
users can see how many assets were skipped via the incremental-ingest
early-skip path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:16:44 +00:00
06aaae4eb8 feat(kebab-cli): p9-fb-23 task 8 — --force-reingest flag
Adds `--force-reingest` to the `ingest` subcommand and wires it
through `IngestOpts` into `ingest_with_config_opts`, bypassing the
per-asset early-skip path when set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:15:35 +00:00
0e6d6073e7 feat(kebab-app): p9-fb-23 task 7 — early-skip Unchanged path in ingest
Adds the per-asset incremental-ingest skip block to all three flows
(markdown / image / pdf). When `IngestOpts::force_reingest = false`
AND the asset's blake3 checksum + parser/chunker/embedding versions
all match the existing DB record, ingest emits
`AssetFinished { result: Unchanged }`, bumps `aggregate.unchanged`,
and skips parse / chunk / embed / vector upsert entirely.

Shared `try_skip_unchanged` helper performs the four checks; per-flow
callers supply the active parser_version + chunker_version + optional
embedding_version. `force_reingest = true` bypasses the skip path so
`incremental_ingest::force_reingest_bypasses_skip` still sees `Updated`.

Tests:
- new `incremental_ingest.rs` covers both paths.
- existing `ingest_idempotent_on_second_run` /
  `re_ingest_image_produces_*` / `re_ingest_identical_pdf_produces_*`
  updated to assert `Unchanged` on identical-bytes re-ingest (the
  pre-task behaviour was `Updated`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:12:47 +00:00
4874304d5d refactor(kebab-app): p9-fb-23 task 6 — IngestOpts struct + ingest_with_config_opts entry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:04:50 +00:00
a16e9c9215 feat(kebab-app): p9-fb-23 task 5 — stamp chunker + embedding versions on CanonicalDocument before put_document
All three ingest flows (markdown, image, pdf) now set
last_chunker_version and last_embedding_version on the CanonicalDocument
before calling put_document, giving Task 7's skip detection the data it
needs on the second run. No skip path is added yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:01:48 +00:00
366e89e5e2 feat(kebab-store-sqlite): p9-fb-23 task 4 — get_asset_by_workspace_path
Add `DocumentStore::get_asset_by_workspace_path` trait method to
`kebab-core` and implement it on `SqliteStore` via a private
`asset_from_row` helper. Used by the incremental-ingest skip path to
compare a freshly-computed blake3 checksum against the persisted row
without a full round-trip through `put_asset_with_bytes`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:58:23 +00:00
4261c8953c feat(kebab-store-sqlite): p9-fb-23 task 3 — V006 migration + put/get_document round-trip version stamps
Add V006__incremental_ingest.sql to persist last_chunker_version and
last_embedding_version on the documents table. Wire both columns into
upsert_document (INSERT + ON CONFLICT UPDATE) and get_document (SELECT +
row mapper), replacing the previous hardcoded None. Add two round-trip
tests in tests/incremental_ingest.rs covering the set and None cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:53:30 +00:00
f867b36afb feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:50:25 +00:00
0684b3ad66 review(p9-fb-23-task1): fix missed IngestReport construction sites + snapshot
reviewer-flagged: aa2a6ea claimed build clean but missed:
- crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs (test fixture)
- crates/kebab-cli/src/wire.rs (test fixture)
- crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json (snapshot)

All three add `unchanged: 0` (or `\"unchanged\": 0`) to match the new
IngestReport.unchanged field. cargo clippy --workspace --all-targets
-- -D warnings now clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:47:13 +00:00
aa2a6ea7fc feat(kebab-core): p9-fb-23 task 1 — IngestItemKind::Unchanged + IngestReport.unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:43:52 +00:00
6a8d155da9 plan(p9-fb-23): TDD implementation plan — 10 tasks
Spec → 10-step plan, TDD per task (failing test → impl → pass → commit).

Tasks:
1. IngestItemKind::Unchanged + IngestReport.unchanged + AggregateCounts.unchanged + wire schema additive
2. CanonicalDocument 에 last_chunker_version + last_embedding_version Option 필드 추가 + 14 callers None 채움
3. V006 migration + SQLite put/get_document round-trip 신규 컬럼
4. DocumentStore::get_asset_by_workspace_path trait + SQLite impl
5. ingest pipeline 이 CanonicalDocument 에 현 chunker/embedding version stamp (no skip yet)
6. IngestOpts { progress, cancel, force_reingest } struct + ingest_with_config_opts entry (AskOpts 패턴)
7. asset 루프 early-skip 블록 (4 조건 match → Unchanged + continue)
8. CLI --force-reingest flag
9. TUI status_line 에 unchanged=N 노출
10. docs sync — README + HANDOFF + HOTFIXES + INDEX + per-task spec

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:34:30 +00:00
ddec3d8a94 spec(p9-fb-23): incremental ingest — skip unchanged docs
도그푸딩 피드백: 변경/신규 doc 만 ingest, 변하지 않은 문서는 skip.

설계 핵심:

- Skip 조건 4 개 (full version cascade): blake3 checksum + parser_version
  + chunker_version + embedding_version 모두 일치 시 parse/chunk/embed/
  vector upsert 회피. 비용 dominator (fastembed) 가 변경된 / 새 doc 에만.
- SQLite V006 migration — `documents` 에 `last_chunker_version` +
  `last_embedding_version` column 추가. 기존 row NULL → 첫 ingest 강제
  재처리 (안전 default).
- `IngestItemKind::Unchanged` enum variant 신규 (기존 `Skipped` 와
  의미 분리 — `Skipped` 는 media-type 필터, `Unchanged` 는 모든 versions
  match).
- `IngestReport` + `AggregateCounts` 에 `unchanged: u32` 필드 추가.
  wire schema additive — v1 호환 유지.
- `--force-reingest` flag — skip 무시하고 강제 재처리.
- TUI status_line final 에 `unchanged=N` 노출 (p9-fb-24 status bar
  dynamic slot 자동 cascade).

Spec status `planned`. 다음 단계: writing-plans skill 로 implementation
plan 작성.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:27:06 +00:00