fix(kebab-store-vector): close P7-3 vector orphan caveat — delete_by_chunk_ids
P7-3 의 storage UNIQUE bug fix 가 SQLite 측 (documents → blocks /
chunks / embedding_records) 만 sweep 했음. LanceDB 의 vector 는 별도
store 라 옛 chunk_id 를 가진 row 가 디스크에 잔존. 검색에는 영향 없지만
디스크는 무한 누적. HOTFIXES `2026-05-02 P7-3` caveat 의 "P+ task" 약속을
같은 후속 PR 안에서 닫음.
변경:
- `VectorStore::delete_by_chunk_ids(&[ChunkId])` trait method 추가 (default
no-op 제공 — 테스트 fake / 기존 impl 이 그대로 컴파일).
- `LanceVectorStore::delete_by_chunk_ids` 가 connection 의 모든
`chunk_embeddings_*` 테이블을 순회 + `Table::delete("chunk_id IN (...)")`
를 batch=200 단위로 실행. 다중 모델 워크스페이스 (마이그레이션 중간 등)
에서도 안전.
- `SqliteStore::stale_chunk_ids_at(workspace_path, new_asset_id)` 가
read-only SELECT 로 옛 chunk_id 들 반환. CASCADE 가 흐르기 *전* 에
caller 가 호출.
- `kebab-app::purge_vector_orphans_for_workspace_path` 가 위 두 단계를
orchestrate. 세 ingest path (markdown / image / pdf) 의
`put_asset_with_bytes` 호출 직전에 한 줄로 호출.
Smoke 검증 (release binary, fastembed enabled):
- whitepaper.pdf 첫 ingest → chunk_ids = {f616…, 4e0f…}, vector store 에
그 두 ID 의 row 존재.
- byte 변경 후 re-ingest → 새 doc_id (3741…) + 새 chunk_ids
(ed0c…, e13c…). vector search "REWRITTEN chapter two" → 새 chunk_ids 만
hit. 옛 query "Edited page two body" 시도해도 옛 chunk_ids 는 vector
store 에 더 이상 없음 (의미적으로 가장 가까운 새 chunks 가 hit).
HOTFIXES `2026-05-02 P7-3` 의 \"vector store cleanup\" 항목이 \"deferred\" →
\"closed by follow-up PR\" 로 갱신. SMOKE.md 의 알려진 동작 (\"옛 vector
잔존\") 도 \"두 store 정합\" 으로 갱신.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -615,6 +615,7 @@ fn ingest_one_asset(
|
||||
// (per-document tx semantics per design §5.8); composing them is
|
||||
// the kb-app job. A failure mid-way leaves the DB in a state the
|
||||
// next ingest run can re-converge (UPSERT + DELETE-then-INSERT).
|
||||
purge_vector_orphans_for_workspace_path(app, asset, vector_store)?;
|
||||
app.sqlite
|
||||
.put_asset_with_bytes(asset, &bytes)
|
||||
.context("DocumentStore::put_asset_with_bytes")?;
|
||||
@@ -841,6 +842,7 @@ fn ingest_one_image_asset(
|
||||
.context("kb-chunk::MdHeadingV1Chunker::chunk (image)")?;
|
||||
|
||||
// 5. Persist + embed — identical sequence to markdown.
|
||||
purge_vector_orphans_for_workspace_path(app, asset, vector_store)?;
|
||||
app.sqlite
|
||||
.put_asset_with_bytes(asset, &bytes)
|
||||
.context("DocumentStore::put_asset_with_bytes (image)")?;
|
||||
@@ -949,6 +951,46 @@ fn record_image_analysis_failure(
|
||||
warning_notes.push(note);
|
||||
}
|
||||
|
||||
/// HOTFIXES 2026-05-02 P7-3 follow-up: when a tracked file's bytes
|
||||
/// change, `purge_orphan_at_workspace_path` (in `kebab-store-sqlite`)
|
||||
/// sweeps the SQLite chain (documents → blocks / chunks / embedding_records)
|
||||
/// but the LanceDB rows keyed on the now-deleted `chunk_id`s live in a
|
||||
/// separate store. This helper fetches the stale `chunk_id`s from
|
||||
/// SQLite **before** they get cascade-deleted, then deletes the
|
||||
/// matching vectors from every Lance table.
|
||||
///
|
||||
/// Called by every per-medium ingest helper at the same point —
|
||||
/// immediately before `put_asset_with_bytes` runs, so the SELECT
|
||||
/// still sees the old chunk_ids and the DELETE happens before the
|
||||
/// new rows land. Empty workspace_path / no embedder → no-op.
|
||||
fn purge_vector_orphans_for_workspace_path(
|
||||
app: &App,
|
||||
asset: &RawAsset,
|
||||
vector_store: Option<&Arc<kebab_store_vector::LanceVectorStore>>,
|
||||
) -> anyhow::Result<()> {
|
||||
let Some(vec_store) = vector_store else {
|
||||
return Ok(());
|
||||
};
|
||||
let stale = app
|
||||
.sqlite
|
||||
.stale_chunk_ids_at(&asset.workspace_path.0, &asset.asset_id.0)
|
||||
.context("SqliteStore::stale_chunk_ids_at")?;
|
||||
if stale.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
use kebab_core::VectorStore as _;
|
||||
vec_store
|
||||
.delete_by_chunk_ids(&stale)
|
||||
.context("VectorStore::delete_by_chunk_ids (orphan vector cleanup)")?;
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
path = %asset.workspace_path.0,
|
||||
count = stale.len(),
|
||||
"purged orphan vectors for edited asset"
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// P7-3: process one `MediaType::Pdf` asset end-to-end.
|
||||
///
|
||||
/// - Reads bytes from disk.
|
||||
@@ -1024,6 +1066,7 @@ fn ingest_one_pdf_asset(
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::PdfPageV1Chunker::chunk")?;
|
||||
|
||||
purge_vector_orphans_for_workspace_path(app, asset, vector_store)?;
|
||||
app.sqlite
|
||||
.put_asset_with_bytes(asset, &bytes)
|
||||
.context("DocumentStore::put_asset_with_bytes (pdf)")?;
|
||||
|
||||
@@ -171,6 +171,18 @@ pub trait VectorStore {
|
||||
k: usize,
|
||||
filters: &SearchFilters,
|
||||
) -> anyhow::Result<Vec<VectorHit>>;
|
||||
|
||||
/// Delete every vector whose `chunk_id` appears in `chunk_ids`.
|
||||
///
|
||||
/// Used by `kebab-app` after `purge_orphan_at_workspace_path` sweeps
|
||||
/// the SQLite side on a byte-edit re-ingest, so the LanceDB rows
|
||||
/// keyed on the now-deleted `chunk_id`s do not stay on disk
|
||||
/// forever. Empty input is a no-op. The default impl is a no-op so
|
||||
/// older `VectorStore` impls (e.g. test fakes) keep compiling
|
||||
/// without behavioural change.
|
||||
fn delete_by_chunk_ids(&self, _chunk_ids: &[crate::ids::ChunkId]) -> anyhow::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
pub trait JobRepo {
|
||||
|
||||
@@ -308,6 +308,50 @@ fn temp_path_for(dest: &Path) -> PathBuf {
|
||||
parent.join(format!("{file_name}.tmp.{pid}.{n}"))
|
||||
}
|
||||
|
||||
impl SqliteStore {
|
||||
/// SELECT every `chunks.chunk_id` whose owning document points at a
|
||||
/// stale `asset_id` for `workspace_path` (i.e. the file's bytes have
|
||||
/// changed since the last ingest, producing a brand-new
|
||||
/// `asset_id`).
|
||||
///
|
||||
/// Called by `kebab-app::ingest_one_*_asset` BEFORE
|
||||
/// `put_asset_with_bytes` so the caller can hand the IDs to
|
||||
/// `VectorStore::delete_by_chunk_ids`. After the SQLite cleanup
|
||||
/// runs (CASCADE on `documents` → `chunks`) the same chunk_ids
|
||||
/// would be unreadable. Returns an empty Vec when no stale row
|
||||
/// exists at `workspace_path`.
|
||||
///
|
||||
/// Read-only — does not mutate. The actual sweep happens inside
|
||||
/// `purge_orphan_at_workspace_path` further down the pipeline.
|
||||
pub fn stale_chunk_ids_at(
|
||||
&self,
|
||||
workspace_path: &str,
|
||||
new_asset_id: &str,
|
||||
) -> Result<Vec<kebab_core::ChunkId>> {
|
||||
let conn = self.lock_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT c.chunk_id
|
||||
FROM chunks c
|
||||
INNER JOIN documents d ON c.doc_id = d.doc_id
|
||||
INNER JOIN assets a ON d.asset_id = a.asset_id
|
||||
WHERE a.workspace_path = ?1 AND a.asset_id != ?2",
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
let rows = stmt
|
||||
.query_map(params![workspace_path, new_asset_id], |row| {
|
||||
row.get::<_, String>(0)
|
||||
})
|
||||
.map_err(StoreError::from)?;
|
||||
let mut out: Vec<kebab_core::ChunkId> = Vec::new();
|
||||
for row in rows {
|
||||
let id = row.map_err(StoreError::from)?;
|
||||
out.push(kebab_core::ChunkId(id));
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
/// Sweep stale `assets` + `documents` + downstream rows when the file
|
||||
/// at `workspace_path` is being re-ingested with bytes that produce a
|
||||
/// **different** `asset_id` (i.e. the file was edited).
|
||||
@@ -330,13 +374,12 @@ fn temp_path_for(dest: &Path) -> PathBuf {
|
||||
/// delete the on-disk byte file at `storage_path` so the data dir
|
||||
/// doesn't accumulate orphans across edits.
|
||||
///
|
||||
/// **Caveat — vector store orphans.** `embedding_records.chunk_id`
|
||||
/// CASCADE clears the SQLite side, but the LanceDB rows keyed on
|
||||
/// `chunk_id` live in a separate store and are not touched here.
|
||||
/// Stale vectors do not affect retrieval (search joins through
|
||||
/// SQLite, so an orphan vector is never surfaced) but they consume
|
||||
/// disk in `data_dir/lancedb/`. A future task should reconcile by
|
||||
/// `chunk_id` set diff. Tracked alongside this entry in HOTFIXES.
|
||||
/// **Vector store cleanup**: `embedding_records.chunk_id` CASCADE
|
||||
/// clears the SQLite side, but the LanceDB rows live in a separate
|
||||
/// store. The caller (`kebab-app::ingest_one_*_asset`) is responsible
|
||||
/// for fetching `stale_chunk_ids_at` BEFORE this purge runs and
|
||||
/// calling `VectorStore::delete_by_chunk_ids` on those IDs. The
|
||||
/// follow-up PR for HOTFIXES 2026-05-02 P7-3 wires this in.
|
||||
pub(crate) fn purge_orphan_at_workspace_path(
|
||||
conn: &Connection,
|
||||
workspace_path: &str,
|
||||
|
||||
@@ -283,6 +283,76 @@ impl VectorStore for LanceVectorStore {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Delete every Lance row whose `chunk_id` matches one of the
|
||||
/// supplied IDs. Iterates *all* `chunk_embeddings_*` tables in the
|
||||
/// connection — a single chunk_id only ever lives in one table
|
||||
/// (one-model-per-workspace today, see `INDEX_VERSION` in
|
||||
/// `paths.rs`), but the loop keeps the helper correct should the
|
||||
/// workspace ever maintain multiple tables (e.g. mid-migration
|
||||
/// between embedding models).
|
||||
///
|
||||
/// Wired in by `kebab-app::ingest_one_*_asset` after the SQLite
|
||||
/// side has been swept by `purge_orphan_at_workspace_path` —
|
||||
/// closes the "vector store orphan" caveat from HOTFIXES
|
||||
/// 2026-05-02 P7-3.
|
||||
fn delete_by_chunk_ids(&self, chunk_ids: &[kebab_core::ChunkId]) -> Result<()> {
|
||||
if chunk_ids.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
// SQL IN() list. chunk_ids are 32-hex-char blake3 prefixes
|
||||
// (validated upstream), so SQL injection is structurally
|
||||
// impossible — we still quote to keep the predicate
|
||||
// syntactically valid. We chunk into batches of 200 to keep the
|
||||
// WHERE clause within typical SQL parser limits.
|
||||
const BATCH: usize = 200;
|
||||
self.runtime.block_on(async {
|
||||
let names = self
|
||||
.connection
|
||||
.table_names()
|
||||
.execute()
|
||||
.await
|
||||
.context("table_names")?;
|
||||
for name in names {
|
||||
if !name.starts_with("chunk_embeddings_") {
|
||||
continue;
|
||||
}
|
||||
let table = match self.connection.open_table(&name).execute().await {
|
||||
Ok(t) => t,
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kebab-store-vector",
|
||||
table = %name,
|
||||
error = %e,
|
||||
"delete_by_chunk_ids: skipped unopenable table"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
};
|
||||
for batch in chunk_ids.chunks(BATCH) {
|
||||
let list = batch
|
||||
.iter()
|
||||
.map(|id| format!("'{}'", id.0))
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
let predicate = format!("chunk_id IN ({list})");
|
||||
table
|
||||
.delete(&predicate)
|
||||
.await
|
||||
.with_context(|| {
|
||||
format!("Lance delete on {name} ({} ids)", batch.len())
|
||||
})?;
|
||||
}
|
||||
}
|
||||
anyhow::Ok(())
|
||||
})?;
|
||||
tracing::debug!(
|
||||
target: "kebab-store-vector",
|
||||
count = chunk_ids.len(),
|
||||
"deleted vector rows by chunk_id"
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn search(
|
||||
&self,
|
||||
query_vec: &[f32],
|
||||
|
||||
@@ -217,6 +217,6 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
|
||||
- (P6-4) `image.ocr.enabled = true` + `image.caption.enabled = true` 인 워크스페이스에 PNG 가 N장 있으면 ingest 시간 ≈ markdown_time + N × (OCR + Caption latency). `gemma4:e4b` + 192.168.0.47 로 자산당 ~5-10초. 다수의 책 페이지를 이미지로 넣지 말 것 — 책은 P7 PDF 라인 사용 권장.
|
||||
- (P7-3) `config.chunking.chunker_version` 는 markdown 만 represent — PDF 자산은 `pdf-page-v1` 하드코딩. `config.toml` 의 `chunker_version = "md-heading-v1"` 을 봐도 PDF 는 영향 안 받음. HOTFIXES `2026-05-02 P7-3` entry 참조 (P+ chunker registry task 까지 유지).
|
||||
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
|
||||
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). LanceDB 는 별도 store 라 옛 vector 가 잔존하지만 검색에는 영향 없음 (SQLite join 으로 surface 안 됨) — 디스크 cleanup 은 P+.
|
||||
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
|
||||
|
||||
자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.
|
||||
|
||||
@@ -32,7 +32,13 @@ git history.
|
||||
3. DELETEs the stale `assets` row, freeing the `workspace_path` slot.
|
||||
4. If the stale storage was `copied`, best-effort removes the byte file at `storage_path` so `data_dir/assets/` does not accumulate orphans across edits.
|
||||
|
||||
**Caveat (still deferred)**: `embedding_records.chunk_id` CASCADE clears the SQLite side, but the LanceDB rows keyed on `chunk_id` live in a separate store and are not touched. Stale vectors do not affect retrieval (search joins through SQLite, so an orphan vector is never surfaced) but they consume disk in `data_dir/lancedb/`. A future task should reconcile by `chunk_id` set diff.
|
||||
**Vector store cleanup (closed by follow-up PR)**: `embedding_records.chunk_id` CASCADE clears the SQLite side, but LanceDB lives in a separate store. The follow-up PR adds:
|
||||
- `VectorStore::delete_by_chunk_ids` trait method (default impl no-op for older fakes).
|
||||
- `LanceVectorStore::delete_by_chunk_ids` iterates every `chunk_embeddings_*` table in the connection and runs `Table::delete("chunk_id IN (...)")` in batches of 200.
|
||||
- `SqliteStore::stale_chunk_ids_at(workspace_path, new_asset_id)` SELECT helper (read-only) that fetches the stale chunk_ids before they get cascade-deleted.
|
||||
- `kebab-app::purge_vector_orphans_for_workspace_path` orchestrator. Each per-medium ingest helper (`ingest_one_asset` markdown branch, `ingest_one_image_asset`, `ingest_one_pdf_asset`) calls it immediately before `put_asset_with_bytes` so the stale Lance rows go away in lockstep with the SQLite cascade.
|
||||
|
||||
Verified end-to-end via the SMOKE runbook: edit a tracked PDF → re-ingest → vector search for the old body text returns the *new* chunks (semantic nearest-neighbour) and the old chunk_ids are not present in the vector store.
|
||||
|
||||
The previously-`#[ignore]`d `re_ingest_edited_pdf_produces_new_doc_id` integration test runs by default after this fix, plus a dedicated unit test `put_asset_with_bytes_sweeps_workspace_path_orphan` in `kebab-store-sqlite::tests::asset_writer` that exercises the no-documents flavour. Verified end-to-end via the SMOKE runbook: `kebab ingest` → edit a tracked PDF → `kebab ingest` reports `new=1` for that asset (rest `updated`) and the prior doc/chunks are gone from `inspect` / `list docs`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user