fix(kebab-store-sqlite): purge stale assets row on workspace_path orphan + smoke

P7-3 통합 테스트가 노출한 storage 레이어 버그 fix.
`assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의
`ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산
re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE
충돌. md / image / pdf 모두 영향.

Fix:
- 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path`
  의 *다른* `asset_id` 를 발견하면 documents → assets 순서로 sweep.
  documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks /
  embedding_records 정리. copied 모드면 storage_path 의 byte 파일도
  best-effort 삭제.
- `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore
  ::put_asset` 모두 호출.
- 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전
  의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로
  대체).
- `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 →
  9 통합 테스트 모두 default 로 통과.

Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게
운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가
SQLite join 통해 surface).

Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2):
- doctor pass
- 첫 ingest: 5 new
- list docs: 5 docs all media types
- search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit
- search hybrid → cross-media 결과
- inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page
- 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency)
- byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix)
- corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0
- 정리 후 다시 ingest: errors=0
- RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` +
  `path: <pdf_path>` 정확히 노출

운영 fixture 보조:
- `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release
  --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로
  reportlab/qpdf 없이 in-tree PDF 생성.
- `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG
  fixture 생성.
- SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시
  errors+=1 → new+=1) 반영.

HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로
업데이트, vector store orphan caveat 만 남음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-02 11:41:23 +00:00
parent 4ad4ef271e
commit 3a57cab1eb
8 changed files with 267 additions and 56 deletions

View File

@@ -230,18 +230,11 @@ fn re_ingest_identical_pdf_produces_updated_with_same_doc_id() {
}
/// Edit a PDF (replace bytes) → different blake3 → different asset_id
/// → different doc_id → `new+=1` for the new doc_id; first-pass row
/// remains untouched.
///
/// **Currently `#[ignore]`** — exposes a storage-layer bug discovered
/// by this PR: `assets.workspace_path` carries a UNIQUE constraint and
/// `upsert_asset_row` only handles `ON CONFLICT(asset_id)`, so the
/// second insert (new `asset_id` for the edited bytes, same
/// `workspace_path`) trips constraint 2067. Affects markdown / image /
/// PDF paths equally; no test exercised it before P7-3. Logged in
/// `tasks/HOTFIXES.md` for a P+ storage-layer fix.
/// → different doc_id → `new+=1` for the new doc_id; stale doc /
/// asset / chunk / embedding rows for the prior bytes are swept by
/// `purge_orphan_at_workspace_path` (HOTFIXES 2026-05-02 P7-3 storage
/// fix shipped alongside this test).
#[test]
#[ignore = "exposes storage-layer assets.workspace_path UNIQUE bug — see HOTFIXES 2026-05-02 P7-3"]
fn re_ingest_edited_pdf_produces_new_doc_id() {
let env = TestEnv::lexical_only();
let path = env.workspace_root.join("evolving.pdf");

View File

@@ -0,0 +1,21 @@
//! `cargo run --example gen_smoke_png -p kebab-parse-image -- <out.png>`
//!
//! 100×50 solid-red PNG used by the SMOKE runbook for the image
//! filename-indexing path (OCR / caption disabled).
use image::{ImageBuffer, Rgb};
use std::io::Cursor;
fn main() {
let out = std::env::args()
.nth(1)
.expect("usage: gen_smoke_png <out.png>");
let img: ImageBuffer<Rgb<u8>, _> =
ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
let mut buf = Cursor::new(Vec::new());
img.write_to(&mut buf, image::ImageFormat::Png)
.expect("encode PNG");
let bytes = buf.into_inner();
std::fs::write(&out, &bytes).expect("write");
println!("wrote {} ({} bytes)", out, bytes.len());
}

View File

@@ -0,0 +1,83 @@
//! `cargo run --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-page-1> [<text-page-N> ...]`
//!
//! Tiny generator used by the SMOKE runbook to produce text PDFs without
//! adding `reportlab` / `qpdf` to the dev-machine prerequisites. Mirrors
//! the `tests/common::build_text_pdf` builder.
use lopdf::content::{Content, Operation};
use lopdf::{Document, Object, Stream, dictionary};
use std::fs::File;
use std::io::Write;
fn main() {
let mut args = std::env::args().skip(1);
let out = args.next().expect("usage: gen_smoke_pdf <out.pdf> <text...>");
let pages: Vec<String> = args.collect();
if pages.is_empty() {
eprintln!("at least one page text required");
std::process::exit(2);
}
let mut doc = Document::with_version("1.5");
let pages_id = doc.new_object_id();
let font_id = doc.add_object(dictionary! {
"Type" => "Font",
"Subtype" => "Type1",
"BaseFont" => "Helvetica",
});
let resources_id = doc.add_object(dictionary! {
"Font" => dictionary! { "F1" => font_id },
});
let mut page_refs: Vec<Object> = Vec::new();
for text in &pages {
let content = Content {
operations: vec![
Operation::new("BT", vec![]),
Operation::new("Tf", vec!["F1".into(), 14.into()]),
Operation::new(
"Td",
vec![Object::Integer(72), Object::Integer(720)],
),
Operation::new("Tj", vec![Object::string_literal(text.as_str())]),
Operation::new("ET", vec![]),
],
};
let stream_data = content.encode().expect("content encode");
let content_id = doc.add_object(Stream::new(dictionary! {}, stream_data));
let page_id = doc.add_object(dictionary! {
"Type" => "Page",
"Parent" => pages_id,
"Contents" => content_id,
});
page_refs.push(page_id.into());
}
let count = page_refs.len() as i64;
let pages_dict = dictionary! {
"Type" => "Pages",
"Kids" => page_refs,
"Count" => count,
"Resources" => resources_id,
"MediaBox" => vec![
Object::Integer(0),
Object::Integer(0),
Object::Integer(595),
Object::Integer(842),
],
};
doc.objects
.insert(pages_id, Object::Dictionary(pages_dict));
let catalog_id = doc.add_object(dictionary! {
"Type" => "Catalog",
"Pages" => pages_id,
});
doc.trailer.set("Root", catalog_id);
let mut buf: Vec<u8> = Vec::new();
doc.save_to(&mut buf).expect("save PDF");
File::create(&out)
.expect("create out")
.write_all(&buf)
.expect("write");
println!("wrote {} ({} bytes, {} pages)", out, buf.len(), pages.len());
}

View File

@@ -16,7 +16,9 @@ use rusqlite::params;
use time::OffsetDateTime;
use crate::error::StoreError;
use crate::store::{SqliteStore, upsert_asset_row, validate_asset_id};
use crate::store::{
SqliteStore, purge_orphan_at_workspace_path, upsert_asset_row, validate_asset_id,
};
impl kebab_core::DocumentStore for SqliteStore {
fn put_asset(&self, asset: &kebab_core::RawAsset) -> Result<()> {
@@ -38,6 +40,11 @@ impl kebab_core::DocumentStore for SqliteStore {
}
};
let conn = self.lock_conn();
purge_orphan_at_workspace_path(
&conn,
&asset.workspace_path.0,
&asset.asset_id.0,
)?;
upsert_asset_row(&conn, asset, storage_kind, &storage_path)
}

View File

@@ -11,7 +11,7 @@ use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{Mutex, MutexGuard};
use anyhow::{Context, Result};
use rusqlite::Connection;
use rusqlite::{Connection, OptionalExtension, params};
use crate::error::StoreError;
use crate::schema;
@@ -208,6 +208,11 @@ impl SqliteStore {
// in place.
{
let conn = self.lock_conn();
purge_orphan_at_workspace_path(
&conn,
&asset.workspace_path.0,
&asset.asset_id.0,
)?;
upsert_asset_row(
&conn,
asset,
@@ -243,6 +248,11 @@ impl SqliteStore {
kebab_core::SourceUri::Kb(u) => u.clone(),
};
let conn = self.lock_conn();
purge_orphan_at_workspace_path(
&conn,
&asset.workspace_path.0,
&asset.asset_id.0,
)?;
upsert_asset_row(&conn, asset, "reference", &storage_path)?;
Ok(())
}
@@ -298,6 +308,81 @@ fn temp_path_for(dest: &Path) -> PathBuf {
parent.join(format!("{file_name}.tmp.{pid}.{n}"))
}
/// Sweep stale `assets` + `documents` + downstream rows when the file
/// at `workspace_path` is being re-ingested with bytes that produce a
/// **different** `asset_id` (i.e. the file was edited).
///
/// Why this exists (HOTFIXES 2026-05-02 P7-3): `idx_assets_workspace_path`
/// is a UNIQUE index. The original `upsert_asset_row` only handles
/// `ON CONFLICT(asset_id)`, so a brand-new `asset_id` colliding on
/// `workspace_path` raises `Error code 2067` and the ingest fails. This
/// helper does the cleanup `ON CONFLICT(workspace_path)` would have done
/// if SQLite let UPSERT target two indexes at once.
///
/// Order matters:
/// 1. `documents.asset_id` is `ON DELETE RESTRICT`, so we must drop the
/// old `documents` rows first. CASCADE on documents → blocks /
/// chunks / embedding_records sweeps the dependent rows in the same
/// statement.
/// 2. Then DELETE the stale `assets` row, freeing the
/// `workspace_path` slot for the new one.
/// 3. If the stale asset was stored in `copied` mode, best-effort
/// delete the on-disk byte file at `storage_path` so the data dir
/// doesn't accumulate orphans across edits.
///
/// **Caveat — vector store orphans.** `embedding_records.chunk_id`
/// CASCADE clears the SQLite side, but the LanceDB rows keyed on
/// `chunk_id` live in a separate store and are not touched here.
/// Stale vectors do not affect retrieval (search joins through
/// SQLite, so an orphan vector is never surfaced) but they consume
/// disk in `data_dir/lancedb/`. A future task should reconcile by
/// `chunk_id` set diff. Tracked alongside this entry in HOTFIXES.
pub(crate) fn purge_orphan_at_workspace_path(
conn: &Connection,
workspace_path: &str,
new_asset_id: &str,
) -> Result<()> {
let stale: Option<(String, String, String)> = conn
.query_row(
"SELECT asset_id, storage_kind, storage_path
FROM assets
WHERE workspace_path = ? AND asset_id != ?",
params![workspace_path, new_asset_id],
|row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)),
)
.optional()
.map_err(StoreError::from)?;
let Some((stale_asset_id, storage_kind, storage_path)) = stale else {
return Ok(());
};
// documents → blocks / chunks / embedding_records via CASCADE.
conn.execute(
"DELETE FROM documents WHERE asset_id = ?",
params![stale_asset_id],
)
.map_err(StoreError::from)?;
conn.execute(
"DELETE FROM assets WHERE asset_id = ?",
params![stale_asset_id],
)
.map_err(StoreError::from)?;
if storage_kind == "copied" {
let _ = std::fs::remove_file(&storage_path);
}
tracing::debug!(
target: "kebab-store-sqlite",
workspace_path = %workspace_path,
stale_asset_id = %stale_asset_id,
new_asset_id = %new_asset_id,
"purged stale asset (file edited; bytes changed)"
);
Ok(())
}
/// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
/// path (which has bytes + computed `storage_kind/path`) and the
/// `DocumentStore::put_asset` path (which only has the `RawAsset` and

View File

@@ -100,22 +100,23 @@ fn reference_mode_does_not_write_file_but_records_path() {
}
#[test]
fn put_asset_with_bytes_orphan_cleanup_on_upsert_failure() {
// Goal: prove that if the row UPSERT fails AFTER the bytes have been
// staged on disk, no `<aa>/<asset_id>` file is left behind.
fn put_asset_with_bytes_sweeps_workspace_path_orphan() {
// HOTFIXES 2026-05-02 P7-3: the original behaviour erred on
// workspace_path UNIQUE conflict (`ON CONFLICT(asset_id)` only) so
// a re-ingest of an edited file was unrecoverable. The fix is
// `purge_orphan_at_workspace_path`, which sweeps the stale
// documents → assets chain before the new INSERT lands.
//
// Lever: the `assets` table has a UNIQUE INDEX on `workspace_path`
// (V001), but the UPSERT is `ON CONFLICT(asset_id)`. So if some other
// row already owns this `workspace_path`, the INSERT half of the
// UPSERT trips a UNIQUE constraint that the ON CONFLICT clause does
// NOT handle — UPSERT errors. The new asset's bytes were already
// staged; we assert they are NOT visible at the final destination.
// This test exercises the no-documents flavour (raw asset row only)
// — the put_asset_with_bytes path. The documents-cascade flavour
// is exercised end-to-end in `kebab-app::tests::pdf_pipeline::
// re_ingest_edited_pdf_produces_new_doc_id`.
let env = common::TestEnv::with_threshold(100);
let store = SqliteStore::open(&env.config()).unwrap();
store.run_migrations().unwrap();
// Pre-populate a row that owns `notes/foo.md` (the workspace_path our
// fixture asset will also claim) under a *different* asset_id.
// Pre-populate a row that owns `notes/foo.md` under a *different*
// asset_id, simulating a prior ingest of an earlier byte version.
env.with_conn(|c| {
c.execute(
"INSERT INTO assets (
@@ -140,36 +141,36 @@ fn put_asset_with_bytes_orphan_cleanup_on_upsert_failure() {
let cs = b3_full_hex(bytes);
let asset = fixed_asset(bytes, bytes.len() as u64, &cs);
let err = store
store
.put_asset_with_bytes(&asset, bytes)
.expect_err("UPSERT must fail on workspace_path UNIQUE violation");
let msg = format!("{err:#}");
assert!(
msg.to_lowercase().contains("unique") || msg.to_lowercase().contains("constraint"),
"expected UNIQUE constraint failure, got: {msg}"
);
.expect("orphan sweep + INSERT must succeed");
// Final destination must NOT exist (no orphan).
// Stale row gone, new row owns the workspace_path.
let stale_count: i64 = env.with_conn(|c| {
c.query_row(
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
rusqlite::params!["b".repeat(32)],
|row| row.get(0),
)
});
assert_eq!(stale_count, 0, "stale asset_id must be purged");
let new_count: i64 = env.with_conn(|c| {
c.query_row(
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
rusqlite::params![asset.asset_id.0],
|row| row.get(0),
)
});
assert_eq!(new_count, 1, "new asset_id must own the workspace_path slot");
// New asset's bytes published at the final destination.
let aa = &asset.asset_id.0[..2];
let dest = env.data_dir().join("assets").join(aa).join(&asset.asset_id.0);
assert!(
!dest.exists(),
"asset bytes were left orphan at {} after UPSERT failure",
dest.exists(),
"new asset bytes must be visible at {}",
dest.display()
);
// No `*.tmp.*` either — temp file must be cleaned up too.
let shard_dir = env.data_dir().join("assets").join(aa);
if let Ok(entries) = std::fs::read_dir(&shard_dir) {
for entry in entries.flatten() {
let name = entry.file_name();
let s = name.to_string_lossy();
assert!(
!s.contains(".tmp."),
"temp file leaked at {}",
entry.path().display()
);
}
}
}
#[test]

View File

@@ -162,16 +162,29 @@ prompt_template_version = "caption-v1"
include = ["**/*.md", "**/*.pdf"]
```
PDF 한 권당 페이지 수만큼 (또는 페이지 텍스트가 길면 그 이상의) chunk 가 한 transaction 안에서 commit. 검색 결과의 `chunk.source_spans[0]``Page { page, char_start, char_end }` 형태라 인용 시 페이지 번호가 그대로 사용 가능.
PDF 한 권당 페이지 수만큼 (또는 페이지 텍스트가 길면 그 이상의) chunk 가 한 transaction 안에서 commit. 검색 결과의 `chunk.source_spans[0]``Page { page, char_start, char_end }` 형태라 인용 시 페이지 번호가 그대로 사용 가능. `kebab ask --json``citations[].citation``kind: "page"` + `page: <N>` + `path: <pdf_path>` 로 노출.
테스트 fixture 가 필요할 때는 두 example 바이너리를 사용 — `reportlab` / `qpdf` 같은 시스템 dep 없이 in-tree 로 PDF / PNG 생성:
```bash
cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- \
/tmp/kebab-smoke/workspace/whitepaper.pdf "page one body" "page two body"
cargo run --release --example gen_smoke_png -p kebab-parse-image -- \
/tmp/kebab-smoke/workspace/diagram.png
```
```bash
kebab --config /tmp/kebab-smoke/config.toml ingest
kebab --config /tmp/kebab-smoke/config.toml search --mode hybrid "<본문 단어>"
kebab --config /tmp/kebab-smoke/config.toml inspect doc "<pdf_doc_id>"
kebab --config /tmp/kebab-smoke/config.toml ask "<PDF 본문에 관한 질문>" --json
```
암호화 PDF (예: DRM 책) → `errors+=1`, `error` 필드에 `qpdf --decrypt` 안내. 빈/스캔 페이지 (텍스트 추출 실패) → 0 chunk + `Provenance::Warning` ("scanned candidate"). v1 에서는 검색 불가, P+ scanned-PDF OCR fallback 까지 대기.
수정된 PDF 를 같은 path 에 다시 배치하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨 — `IngestReport` 에 그 자산만 `new+=1` 로 분류 (다른 자산은 `updated`). HOTFIXES `2026-05-02 P7-3` 참조.
각 명령은 0 종료 코드면 정상. `kebab ask` 는 거절 시 종료 코드 1 (`RefusalSignal`) — 의도된 동작.
## 검증 체크리스트
@@ -204,6 +217,6 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
- (P6-4) `image.ocr.enabled = true` + `image.caption.enabled = true` 인 워크스페이스에 PNG 가 N장 있으면 ingest 시간 ≈ markdown_time + N × (OCR + Caption latency). `gemma4:e4b` + 192.168.0.47 로 자산당 ~5-10초. 다수의 책 페이지를 이미지로 넣지 말 것 — 책은 P7 PDF 라인 사용 권장.
- (P7-3) `config.chunking.chunker_version` 는 markdown 만 represent — PDF 자산은 `pdf-page-v1` 하드코딩. `config.toml``chunker_version = "md-heading-v1"` 을 봐도 PDF 는 영향 안 받음. HOTFIXES `2026-05-02 P7-3` entry 참조 (P+ chunker registry task 까지 유지).
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 storage UNIQUE 제약 (`assets.workspace_path`) 에 트립 → `errors+=1`. md / image / pdf 모두 동일하지만 P7-3 의 통합 테스트가 처음 노출. 우회: 파일 path 변경 후 ingest. 영구 fix 는 P+ storage 작업.
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). LanceDB 는 별도 store 라 옛 vector 가 잔존하지만 검색에는 영향 없음 (SQLite join 으로 surface 안 됨) — 디스크 cleanup 은 P+.
자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.

View File

@@ -22,17 +22,25 @@ git history.
**Fix 1**: `ingest_one_pdf_asset` (in `kebab-app::lib.rs`) instantiates `PdfPageV1Chunker` directly. The `Chunk.chunker_version` field on emitted PDF chunks records `pdf-page-v1` truthfully. A future P+ task (chunker registry) either splits `Config::chunking.chunker_version` per medium or replaces the dispatch with a runtime registry. No HOTFIX entry needed once that happens — this entry is the cross-reference.
**Symptom 2 (storage-layer bug, exposed but not fixed by P7-3)**: P7-3's edited-bytes re-ingest test (`re_ingest_edited_pdf_produces_new_doc_id`) tripped on `sqlite error: UNIQUE constraint failed: assets.workspace_path: Error code 2067`. The assets table has a UNIQUE constraint on `workspace_path`, but `upsert_asset_row` (in `kebab-store-sqlite::store.rs:305`) only handles `ON CONFLICT(asset_id)`. When a file's bytes change, the new BLAKE3 produces a new `asset_id` while the `workspace_path` stays the same — INSERT picks the new asset_id branch, then trips the secondary UNIQUE on workspace_path.
**Symptom 2 (storage-layer bug, fixed in same PR)**: P7-3's edited-bytes re-ingest test (`re_ingest_edited_pdf_produces_new_doc_id`) tripped on `sqlite error: UNIQUE constraint failed: assets.workspace_path: Error code 2067`. The assets table has a UNIQUE constraint on `workspace_path`, but `upsert_asset_row` (in `kebab-store-sqlite::store.rs`) only handles `ON CONFLICT(asset_id)`. When a file's bytes change, the new BLAKE3 produces a new `asset_id` while the `workspace_path` stays the same — INSERT picks the new asset_id branch, then trips the secondary UNIQUE on `workspace_path`.
**Why it didn't surface earlier**: No existing test (markdown / image) exercises edited-bytes re-ingest. The image path's `re_ingest_image_produces_updated_with_same_doc_id` uses identical bytes (same asset_id → ON CONFLICT(asset_id) catches it). Real-world editing of a tracked file would hit the same bug across all media types.
**Why it didn't surface earlier**: No existing test (markdown / image) exercised edited-bytes re-ingest. The image path's `re_ingest_image_produces_updated_with_same_doc_id` uses identical bytes (same asset_id → `ON CONFLICT(asset_id)` catches it). Real-world editing of a tracked file would hit the same bug across all media types.
**Fix 2 (deferred)**: Storage-layer fix is out of scope for P7-3. The P7-3 implementation PR `#[ignore]`s the `re_ingest_edited_pdf_produces_new_doc_id` test with a doc-comment pointing here. A P+ storage task either:
- Adds `ON CONFLICT(workspace_path) DO UPDATE` alongside the existing `ON CONFLICT(asset_id)` clause (DELETE-the-old + INSERT-the-new in a single statement, since UPSERT can only target one conflict path).
- Or drops the UNIQUE constraint on `assets.workspace_path` and relies on application-level uniqueness (workspace_path → asset_id mapping in a separate index table).
**Fix 2** (P7-3 implementation PR): new `purge_orphan_at_workspace_path` helper in `kebab-store-sqlite::store.rs`. Runs immediately before each `upsert_asset_row` call (both `put_asset_with_bytes` paths AND `DocumentStore::put_asset`). It:
1. SELECTs the stale row at `workspace_path` whose `asset_id` differs from the incoming one (none → no-op return).
2. DELETEs from `documents WHERE asset_id = stale``documents.asset_id ON DELETE RESTRICT` requires the documents go first; CASCADE on documents → `blocks` / `chunks` / `embedding_records` sweeps the dependent rows in the same statement.
3. DELETEs the stale `assets` row, freeing the `workspace_path` slot.
4. If the stale storage was `copied`, best-effort removes the byte file at `storage_path` so `data_dir/assets/` does not accumulate orphans across edits.
**Caveat (still deferred)**: `embedding_records.chunk_id` CASCADE clears the SQLite side, but the LanceDB rows keyed on `chunk_id` live in a separate store and are not touched. Stale vectors do not affect retrieval (search joins through SQLite, so an orphan vector is never surfaced) but they consume disk in `data_dir/lancedb/`. A future task should reconcile by `chunk_id` set diff.
The previously-`#[ignore]`d `re_ingest_edited_pdf_produces_new_doc_id` integration test runs by default after this fix, plus a dedicated unit test `put_asset_with_bytes_sweeps_workspace_path_orphan` in `kebab-store-sqlite::tests::asset_writer` that exercises the no-documents flavour. Verified end-to-end via the SMOKE runbook: `kebab ingest` → edit a tracked PDF → `kebab ingest` reports `new=1` for that asset (rest `updated`) and the prior doc/chunks are gone from `inspect` / `list docs`.
**Amends**:
- tasks/p7/p7-3-pdf-ingest-wiring.md (chunker_version deviation, edited-bytes test ignored).
- (Implicitly) every previous task spec that assumed `assets.workspace_path` UNIQUE was safe — the constraint is in fact too strict for the byte-edit re-ingest case.
- tasks/p7/p7-3-pdf-ingest-wiring.md (chunker_version deviation; edited-bytes test runs).
- crates/kebab-store-sqlite (new `purge_orphan_at_workspace_path` helper called from both `put_asset_with_bytes` branches and `DocumentStore::put_asset`).
- crates/kebab-store-sqlite/tests/asset_writer.rs (`put_asset_with_bytes_sweeps_workspace_path_orphan` replaces the prior orphan-cleanup-on-failure test, since the failure path no longer exists).
- docs/SMOKE.md (note that edited-PDF re-ingest produces `new=1` rather than an error).
## 2026-05-02 — P7-2 pdf-page-v1: chunk_id collision + BYTES_PER_TOKEN