fix(kebab-store-vector): close P7-3 vector orphan caveat — delete_by_chunk_ids

P7-3 의 storage UNIQUE bug fix 가 SQLite 측 (documents → blocks / chunks / embedding_records) 만 sweep 했음. LanceDB 의 vector 는 별도 store 라 옛 chunk_id 를 가진 row 가 디스크에 잔존. 검색에는 영향 없지만 디스크는 무한 누적. HOTFIXES `2026-05-02 P7-3` caveat 의 "P+ task" 약속을 같은 후속 PR 안에서 닫음. 변경: - `VectorStore::delete_by_chunk_ids(&[ChunkId])` trait method 추가 (default no-op 제공 — 테스트 fake / 기존 impl 이 그대로 컴파일). - `LanceVectorStore::delete_by_chunk_ids` 가 connection 의 모든 `chunk_embeddings_*` 테이블을 순회 + `Table::delete("chunk_id IN (...)")` 를 batch=200 단위로 실행. 다중 모델 워크스페이스 (마이그레이션 중간 등) 에서도 안전. - `SqliteStore::stale_chunk_ids_at(workspace_path, new_asset_id)` 가 read-only SELECT 로 옛 chunk_id 들 반환. CASCADE 가 흐르기 *전* 에 caller 가 호출. - `kebab-app::purge_vector_orphans_for_workspace_path` 가 위 두 단계를 orchestrate. 세 ingest path (markdown / image / pdf) 의 `put_asset_with_bytes` 호출 직전에 한 줄로 호출. Smoke 검증 (release binary, fastembed enabled): - whitepaper.pdf 첫 ingest → chunk_ids = {f616…, 4e0f…}, vector store 에 그 두 ID 의 row 존재. - byte 변경 후 re-ingest → 새 doc_id (3741…) + 새 chunk_ids (ed0c…, e13c…). vector search "REWRITTEN chapter two" → 새 chunk_ids 만 hit. 옛 query "Edited page two body" 시도해도 옛 chunk_ids 는 vector store 에 더 이상 없음 (의미적으로 가장 가까운 새 chunks 가 hit). HOTFIXES `2026-05-02 P7-3` 의 \"vector store cleanup\" 항목이 \"deferred\" → \"closed by follow-up PR\" 로 갱신. SMOKE.md 의 알려진 동작 (\"옛 vector 잔존\") 도 \"두 store 정합\" 으로 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 12:32:29 +00:00
parent 91ae624a92
commit 0c8821f857
6 changed files with 183 additions and 9 deletions
--- a/docs/SMOKE.md
+++ b/docs/SMOKE.md
@@ -217,6 +217,6 @@ rm -rf /tmp/kebab-smoke              # 통째로 정리
 - (P6-4) `image.ocr.enabled = true` + `image.caption.enabled = true` 인 워크스페이스에 PNG 가 N장 있으면 ingest 시간 ≈ markdown_time + N × (OCR + Caption latency). `gemma4:e4b` + 192.168.0.47 로 자산당 ~5-10초. 다수의 책 페이지를 이미지로 넣지 말 것 — 책은 P7 PDF 라인 사용 권장.
 - (P7-3) `config.chunking.chunker_version` 는 markdown 만 represent — PDF 자산은 `pdf-page-v1` 하드코딩. `config.toml` 의 `chunker_version = "md-heading-v1"` 을 봐도 PDF 는 영향 안 받음. HOTFIXES `2026-05-02 P7-3` entry 참조 (P+ chunker registry task 까지 유지).
 - (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). LanceDB 는 별도 store 라 옛 vector 가 잔존하지만 검색에는 영향 없음 (SQLite join 으로 surface 안 됨) — 디스크 cleanup 은 P+.
+- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.

 자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.