fix(kebab-store-sqlite): purge stale assets row on workspace_path orphan + smoke
P7-3 통합 테스트가 노출한 storage 레이어 버그 fix. `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산 re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE 충돌. md / image / pdf 모두 영향. Fix: - 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path` 의 *다른* `asset_id` 를 발견하면 documents → assets 순서로 sweep. documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks / embedding_records 정리. copied 모드면 storage_path 의 byte 파일도 best-effort 삭제. - `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore ::put_asset` 모두 호출. - 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전 의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로 대체). - `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 → 9 통합 테스트 모두 default 로 통과. Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게 운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가 SQLite join 통해 surface). Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2): - doctor pass - 첫 ingest: 5 new - list docs: 5 docs all media types - search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit - search hybrid → cross-media 결과 - inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page - 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency) - byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix) - corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0 - 정리 후 다시 ingest: errors=0 - RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 정확히 노출 운영 fixture 보조: - `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로 reportlab/qpdf 없이 in-tree PDF 생성. - `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG fixture 생성. - SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시 errors+=1 → new+=1) 반영. HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로 업데이트, vector store orphan caveat 만 남음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -230,18 +230,11 @@ fn re_ingest_identical_pdf_produces_updated_with_same_doc_id() {
|
||||
}
|
||||
|
||||
/// Edit a PDF (replace bytes) → different blake3 → different asset_id
|
||||
/// → different doc_id → `new+=1` for the new doc_id; first-pass row
|
||||
/// remains untouched.
|
||||
///
|
||||
/// **Currently `#[ignore]`** — exposes a storage-layer bug discovered
|
||||
/// by this PR: `assets.workspace_path` carries a UNIQUE constraint and
|
||||
/// `upsert_asset_row` only handles `ON CONFLICT(asset_id)`, so the
|
||||
/// second insert (new `asset_id` for the edited bytes, same
|
||||
/// `workspace_path`) trips constraint 2067. Affects markdown / image /
|
||||
/// PDF paths equally; no test exercised it before P7-3. Logged in
|
||||
/// `tasks/HOTFIXES.md` for a P+ storage-layer fix.
|
||||
/// → different doc_id → `new+=1` for the new doc_id; stale doc /
|
||||
/// asset / chunk / embedding rows for the prior bytes are swept by
|
||||
/// `purge_orphan_at_workspace_path` (HOTFIXES 2026-05-02 P7-3 storage
|
||||
/// fix shipped alongside this test).
|
||||
#[test]
|
||||
#[ignore = "exposes storage-layer assets.workspace_path UNIQUE bug — see HOTFIXES 2026-05-02 P7-3"]
|
||||
fn re_ingest_edited_pdf_produces_new_doc_id() {
|
||||
let env = TestEnv::lexical_only();
|
||||
let path = env.workspace_root.join("evolving.pdf");
|
||||
|
||||
21
crates/kebab-parse-image/examples/gen_smoke_png.rs
Normal file
21
crates/kebab-parse-image/examples/gen_smoke_png.rs
Normal file
@@ -0,0 +1,21 @@
|
||||
//! `cargo run --example gen_smoke_png -p kebab-parse-image -- <out.png>`
|
||||
//!
|
||||
//! 100×50 solid-red PNG used by the SMOKE runbook for the image
|
||||
//! filename-indexing path (OCR / caption disabled).
|
||||
|
||||
use image::{ImageBuffer, Rgb};
|
||||
use std::io::Cursor;
|
||||
|
||||
fn main() {
|
||||
let out = std::env::args()
|
||||
.nth(1)
|
||||
.expect("usage: gen_smoke_png <out.png>");
|
||||
let img: ImageBuffer<Rgb<u8>, _> =
|
||||
ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
|
||||
let mut buf = Cursor::new(Vec::new());
|
||||
img.write_to(&mut buf, image::ImageFormat::Png)
|
||||
.expect("encode PNG");
|
||||
let bytes = buf.into_inner();
|
||||
std::fs::write(&out, &bytes).expect("write");
|
||||
println!("wrote {} ({} bytes)", out, bytes.len());
|
||||
}
|
||||
83
crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs
Normal file
83
crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs
Normal file
@@ -0,0 +1,83 @@
|
||||
//! `cargo run --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-page-1> [<text-page-N> ...]`
|
||||
//!
|
||||
//! Tiny generator used by the SMOKE runbook to produce text PDFs without
|
||||
//! adding `reportlab` / `qpdf` to the dev-machine prerequisites. Mirrors
|
||||
//! the `tests/common::build_text_pdf` builder.
|
||||
|
||||
use lopdf::content::{Content, Operation};
|
||||
use lopdf::{Document, Object, Stream, dictionary};
|
||||
use std::fs::File;
|
||||
use std::io::Write;
|
||||
|
||||
fn main() {
|
||||
let mut args = std::env::args().skip(1);
|
||||
let out = args.next().expect("usage: gen_smoke_pdf <out.pdf> <text...>");
|
||||
let pages: Vec<String> = args.collect();
|
||||
if pages.is_empty() {
|
||||
eprintln!("at least one page text required");
|
||||
std::process::exit(2);
|
||||
}
|
||||
|
||||
let mut doc = Document::with_version("1.5");
|
||||
let pages_id = doc.new_object_id();
|
||||
let font_id = doc.add_object(dictionary! {
|
||||
"Type" => "Font",
|
||||
"Subtype" => "Type1",
|
||||
"BaseFont" => "Helvetica",
|
||||
});
|
||||
let resources_id = doc.add_object(dictionary! {
|
||||
"Font" => dictionary! { "F1" => font_id },
|
||||
});
|
||||
let mut page_refs: Vec<Object> = Vec::new();
|
||||
for text in &pages {
|
||||
let content = Content {
|
||||
operations: vec![
|
||||
Operation::new("BT", vec![]),
|
||||
Operation::new("Tf", vec!["F1".into(), 14.into()]),
|
||||
Operation::new(
|
||||
"Td",
|
||||
vec![Object::Integer(72), Object::Integer(720)],
|
||||
),
|
||||
Operation::new("Tj", vec![Object::string_literal(text.as_str())]),
|
||||
Operation::new("ET", vec![]),
|
||||
],
|
||||
};
|
||||
let stream_data = content.encode().expect("content encode");
|
||||
let content_id = doc.add_object(Stream::new(dictionary! {}, stream_data));
|
||||
let page_id = doc.add_object(dictionary! {
|
||||
"Type" => "Page",
|
||||
"Parent" => pages_id,
|
||||
"Contents" => content_id,
|
||||
});
|
||||
page_refs.push(page_id.into());
|
||||
}
|
||||
|
||||
let count = page_refs.len() as i64;
|
||||
let pages_dict = dictionary! {
|
||||
"Type" => "Pages",
|
||||
"Kids" => page_refs,
|
||||
"Count" => count,
|
||||
"Resources" => resources_id,
|
||||
"MediaBox" => vec![
|
||||
Object::Integer(0),
|
||||
Object::Integer(0),
|
||||
Object::Integer(595),
|
||||
Object::Integer(842),
|
||||
],
|
||||
};
|
||||
doc.objects
|
||||
.insert(pages_id, Object::Dictionary(pages_dict));
|
||||
let catalog_id = doc.add_object(dictionary! {
|
||||
"Type" => "Catalog",
|
||||
"Pages" => pages_id,
|
||||
});
|
||||
doc.trailer.set("Root", catalog_id);
|
||||
|
||||
let mut buf: Vec<u8> = Vec::new();
|
||||
doc.save_to(&mut buf).expect("save PDF");
|
||||
File::create(&out)
|
||||
.expect("create out")
|
||||
.write_all(&buf)
|
||||
.expect("write");
|
||||
println!("wrote {} ({} bytes, {} pages)", out, buf.len(), pages.len());
|
||||
}
|
||||
@@ -16,7 +16,9 @@ use rusqlite::params;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::error::StoreError;
|
||||
use crate::store::{SqliteStore, upsert_asset_row, validate_asset_id};
|
||||
use crate::store::{
|
||||
SqliteStore, purge_orphan_at_workspace_path, upsert_asset_row, validate_asset_id,
|
||||
};
|
||||
|
||||
impl kebab_core::DocumentStore for SqliteStore {
|
||||
fn put_asset(&self, asset: &kebab_core::RawAsset) -> Result<()> {
|
||||
@@ -38,6 +40,11 @@ impl kebab_core::DocumentStore for SqliteStore {
|
||||
}
|
||||
};
|
||||
let conn = self.lock_conn();
|
||||
purge_orphan_at_workspace_path(
|
||||
&conn,
|
||||
&asset.workspace_path.0,
|
||||
&asset.asset_id.0,
|
||||
)?;
|
||||
upsert_asset_row(&conn, asset, storage_kind, &storage_path)
|
||||
}
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ use std::sync::atomic::{AtomicU64, Ordering};
|
||||
use std::sync::{Mutex, MutexGuard};
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use rusqlite::Connection;
|
||||
use rusqlite::{Connection, OptionalExtension, params};
|
||||
|
||||
use crate::error::StoreError;
|
||||
use crate::schema;
|
||||
@@ -208,6 +208,11 @@ impl SqliteStore {
|
||||
// in place.
|
||||
{
|
||||
let conn = self.lock_conn();
|
||||
purge_orphan_at_workspace_path(
|
||||
&conn,
|
||||
&asset.workspace_path.0,
|
||||
&asset.asset_id.0,
|
||||
)?;
|
||||
upsert_asset_row(
|
||||
&conn,
|
||||
asset,
|
||||
@@ -243,6 +248,11 @@ impl SqliteStore {
|
||||
kebab_core::SourceUri::Kb(u) => u.clone(),
|
||||
};
|
||||
let conn = self.lock_conn();
|
||||
purge_orphan_at_workspace_path(
|
||||
&conn,
|
||||
&asset.workspace_path.0,
|
||||
&asset.asset_id.0,
|
||||
)?;
|
||||
upsert_asset_row(&conn, asset, "reference", &storage_path)?;
|
||||
Ok(())
|
||||
}
|
||||
@@ -298,6 +308,81 @@ fn temp_path_for(dest: &Path) -> PathBuf {
|
||||
parent.join(format!("{file_name}.tmp.{pid}.{n}"))
|
||||
}
|
||||
|
||||
/// Sweep stale `assets` + `documents` + downstream rows when the file
|
||||
/// at `workspace_path` is being re-ingested with bytes that produce a
|
||||
/// **different** `asset_id` (i.e. the file was edited).
|
||||
///
|
||||
/// Why this exists (HOTFIXES 2026-05-02 P7-3): `idx_assets_workspace_path`
|
||||
/// is a UNIQUE index. The original `upsert_asset_row` only handles
|
||||
/// `ON CONFLICT(asset_id)`, so a brand-new `asset_id` colliding on
|
||||
/// `workspace_path` raises `Error code 2067` and the ingest fails. This
|
||||
/// helper does the cleanup `ON CONFLICT(workspace_path)` would have done
|
||||
/// if SQLite let UPSERT target two indexes at once.
|
||||
///
|
||||
/// Order matters:
|
||||
/// 1. `documents.asset_id` is `ON DELETE RESTRICT`, so we must drop the
|
||||
/// old `documents` rows first. CASCADE on documents → blocks /
|
||||
/// chunks / embedding_records sweeps the dependent rows in the same
|
||||
/// statement.
|
||||
/// 2. Then DELETE the stale `assets` row, freeing the
|
||||
/// `workspace_path` slot for the new one.
|
||||
/// 3. If the stale asset was stored in `copied` mode, best-effort
|
||||
/// delete the on-disk byte file at `storage_path` so the data dir
|
||||
/// doesn't accumulate orphans across edits.
|
||||
///
|
||||
/// **Caveat — vector store orphans.** `embedding_records.chunk_id`
|
||||
/// CASCADE clears the SQLite side, but the LanceDB rows keyed on
|
||||
/// `chunk_id` live in a separate store and are not touched here.
|
||||
/// Stale vectors do not affect retrieval (search joins through
|
||||
/// SQLite, so an orphan vector is never surfaced) but they consume
|
||||
/// disk in `data_dir/lancedb/`. A future task should reconcile by
|
||||
/// `chunk_id` set diff. Tracked alongside this entry in HOTFIXES.
|
||||
pub(crate) fn purge_orphan_at_workspace_path(
|
||||
conn: &Connection,
|
||||
workspace_path: &str,
|
||||
new_asset_id: &str,
|
||||
) -> Result<()> {
|
||||
let stale: Option<(String, String, String)> = conn
|
||||
.query_row(
|
||||
"SELECT asset_id, storage_kind, storage_path
|
||||
FROM assets
|
||||
WHERE workspace_path = ? AND asset_id != ?",
|
||||
params![workspace_path, new_asset_id],
|
||||
|row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)),
|
||||
)
|
||||
.optional()
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
let Some((stale_asset_id, storage_kind, storage_path)) = stale else {
|
||||
return Ok(());
|
||||
};
|
||||
|
||||
// documents → blocks / chunks / embedding_records via CASCADE.
|
||||
conn.execute(
|
||||
"DELETE FROM documents WHERE asset_id = ?",
|
||||
params![stale_asset_id],
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
conn.execute(
|
||||
"DELETE FROM assets WHERE asset_id = ?",
|
||||
params![stale_asset_id],
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
if storage_kind == "copied" {
|
||||
let _ = std::fs::remove_file(&storage_path);
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-store-sqlite",
|
||||
workspace_path = %workspace_path,
|
||||
stale_asset_id = %stale_asset_id,
|
||||
new_asset_id = %new_asset_id,
|
||||
"purged stale asset (file edited; bytes changed)"
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
|
||||
/// path (which has bytes + computed `storage_kind/path`) and the
|
||||
/// `DocumentStore::put_asset` path (which only has the `RawAsset` and
|
||||
|
||||
@@ -100,22 +100,23 @@ fn reference_mode_does_not_write_file_but_records_path() {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn put_asset_with_bytes_orphan_cleanup_on_upsert_failure() {
|
||||
// Goal: prove that if the row UPSERT fails AFTER the bytes have been
|
||||
// staged on disk, no `<aa>/<asset_id>` file is left behind.
|
||||
fn put_asset_with_bytes_sweeps_workspace_path_orphan() {
|
||||
// HOTFIXES 2026-05-02 P7-3: the original behaviour erred on
|
||||
// workspace_path UNIQUE conflict (`ON CONFLICT(asset_id)` only) so
|
||||
// a re-ingest of an edited file was unrecoverable. The fix is
|
||||
// `purge_orphan_at_workspace_path`, which sweeps the stale
|
||||
// documents → assets chain before the new INSERT lands.
|
||||
//
|
||||
// Lever: the `assets` table has a UNIQUE INDEX on `workspace_path`
|
||||
// (V001), but the UPSERT is `ON CONFLICT(asset_id)`. So if some other
|
||||
// row already owns this `workspace_path`, the INSERT half of the
|
||||
// UPSERT trips a UNIQUE constraint that the ON CONFLICT clause does
|
||||
// NOT handle — UPSERT errors. The new asset's bytes were already
|
||||
// staged; we assert they are NOT visible at the final destination.
|
||||
// This test exercises the no-documents flavour (raw asset row only)
|
||||
// — the put_asset_with_bytes path. The documents-cascade flavour
|
||||
// is exercised end-to-end in `kebab-app::tests::pdf_pipeline::
|
||||
// re_ingest_edited_pdf_produces_new_doc_id`.
|
||||
let env = common::TestEnv::with_threshold(100);
|
||||
let store = SqliteStore::open(&env.config()).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
|
||||
// Pre-populate a row that owns `notes/foo.md` (the workspace_path our
|
||||
// fixture asset will also claim) under a *different* asset_id.
|
||||
// Pre-populate a row that owns `notes/foo.md` under a *different*
|
||||
// asset_id, simulating a prior ingest of an earlier byte version.
|
||||
env.with_conn(|c| {
|
||||
c.execute(
|
||||
"INSERT INTO assets (
|
||||
@@ -140,36 +141,36 @@ fn put_asset_with_bytes_orphan_cleanup_on_upsert_failure() {
|
||||
let cs = b3_full_hex(bytes);
|
||||
let asset = fixed_asset(bytes, bytes.len() as u64, &cs);
|
||||
|
||||
let err = store
|
||||
store
|
||||
.put_asset_with_bytes(&asset, bytes)
|
||||
.expect_err("UPSERT must fail on workspace_path UNIQUE violation");
|
||||
let msg = format!("{err:#}");
|
||||
assert!(
|
||||
msg.to_lowercase().contains("unique") || msg.to_lowercase().contains("constraint"),
|
||||
"expected UNIQUE constraint failure, got: {msg}"
|
||||
);
|
||||
.expect("orphan sweep + INSERT must succeed");
|
||||
|
||||
// Final destination must NOT exist (no orphan).
|
||||
// Stale row gone, new row owns the workspace_path.
|
||||
let stale_count: i64 = env.with_conn(|c| {
|
||||
c.query_row(
|
||||
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
|
||||
rusqlite::params!["b".repeat(32)],
|
||||
|row| row.get(0),
|
||||
)
|
||||
});
|
||||
assert_eq!(stale_count, 0, "stale asset_id must be purged");
|
||||
let new_count: i64 = env.with_conn(|c| {
|
||||
c.query_row(
|
||||
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
|
||||
rusqlite::params![asset.asset_id.0],
|
||||
|row| row.get(0),
|
||||
)
|
||||
});
|
||||
assert_eq!(new_count, 1, "new asset_id must own the workspace_path slot");
|
||||
|
||||
// New asset's bytes published at the final destination.
|
||||
let aa = &asset.asset_id.0[..2];
|
||||
let dest = env.data_dir().join("assets").join(aa).join(&asset.asset_id.0);
|
||||
assert!(
|
||||
!dest.exists(),
|
||||
"asset bytes were left orphan at {} after UPSERT failure",
|
||||
dest.exists(),
|
||||
"new asset bytes must be visible at {}",
|
||||
dest.display()
|
||||
);
|
||||
// No `*.tmp.*` either — temp file must be cleaned up too.
|
||||
let shard_dir = env.data_dir().join("assets").join(aa);
|
||||
if let Ok(entries) = std::fs::read_dir(&shard_dir) {
|
||||
for entry in entries.flatten() {
|
||||
let name = entry.file_name();
|
||||
let s = name.to_string_lossy();
|
||||
assert!(
|
||||
!s.contains(".tmp."),
|
||||
"temp file leaked at {}",
|
||||
entry.path().display()
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -162,16 +162,29 @@ prompt_template_version = "caption-v1"
|
||||
include = ["**/*.md", "**/*.pdf"]
|
||||
```
|
||||
|
||||
PDF 한 권당 페이지 수만큼 (또는 페이지 텍스트가 길면 그 이상의) chunk 가 한 transaction 안에서 commit. 검색 결과의 `chunk.source_spans[0]` 가 `Page { page, char_start, char_end }` 형태라 인용 시 페이지 번호가 그대로 사용 가능.
|
||||
PDF 한 권당 페이지 수만큼 (또는 페이지 텍스트가 길면 그 이상의) chunk 가 한 transaction 안에서 commit. 검색 결과의 `chunk.source_spans[0]` 가 `Page { page, char_start, char_end }` 형태라 인용 시 페이지 번호가 그대로 사용 가능. `kebab ask --json` 의 `citations[].citation` 도 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 로 노출.
|
||||
|
||||
테스트 fixture 가 필요할 때는 두 example 바이너리를 사용 — `reportlab` / `qpdf` 같은 시스템 dep 없이 in-tree 로 PDF / PNG 생성:
|
||||
|
||||
```bash
|
||||
cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- \
|
||||
/tmp/kebab-smoke/workspace/whitepaper.pdf "page one body" "page two body"
|
||||
|
||||
cargo run --release --example gen_smoke_png -p kebab-parse-image -- \
|
||||
/tmp/kebab-smoke/workspace/diagram.png
|
||||
```
|
||||
|
||||
```bash
|
||||
kebab --config /tmp/kebab-smoke/config.toml ingest
|
||||
kebab --config /tmp/kebab-smoke/config.toml search --mode hybrid "<본문 단어>"
|
||||
kebab --config /tmp/kebab-smoke/config.toml inspect doc "<pdf_doc_id>"
|
||||
kebab --config /tmp/kebab-smoke/config.toml ask "<PDF 본문에 관한 질문>" --json
|
||||
```
|
||||
|
||||
암호화 PDF (예: DRM 책) → `errors+=1`, `error` 필드에 `qpdf --decrypt` 안내. 빈/스캔 페이지 (텍스트 추출 실패) → 0 chunk + `Provenance::Warning` ("scanned candidate"). v1 에서는 검색 불가, P+ scanned-PDF OCR fallback 까지 대기.
|
||||
|
||||
수정된 PDF 를 같은 path 에 다시 배치하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨 — `IngestReport` 에 그 자산만 `new+=1` 로 분류 (다른 자산은 `updated`). HOTFIXES `2026-05-02 P7-3` 참조.
|
||||
|
||||
각 명령은 0 종료 코드면 정상. `kebab ask` 는 거절 시 종료 코드 1 (`RefusalSignal`) — 의도된 동작.
|
||||
|
||||
## 검증 체크리스트
|
||||
@@ -204,6 +217,6 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
|
||||
- (P6-4) `image.ocr.enabled = true` + `image.caption.enabled = true` 인 워크스페이스에 PNG 가 N장 있으면 ingest 시간 ≈ markdown_time + N × (OCR + Caption latency). `gemma4:e4b` + 192.168.0.47 로 자산당 ~5-10초. 다수의 책 페이지를 이미지로 넣지 말 것 — 책은 P7 PDF 라인 사용 권장.
|
||||
- (P7-3) `config.chunking.chunker_version` 는 markdown 만 represent — PDF 자산은 `pdf-page-v1` 하드코딩. `config.toml` 의 `chunker_version = "md-heading-v1"` 을 봐도 PDF 는 영향 안 받음. HOTFIXES `2026-05-02 P7-3` entry 참조 (P+ chunker registry task 까지 유지).
|
||||
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
|
||||
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 storage UNIQUE 제약 (`assets.workspace_path`) 에 트립 → `errors+=1`. md / image / pdf 모두 동일하지만 P7-3 의 통합 테스트가 처음 노출. 우회: 파일 path 변경 후 ingest. 영구 fix 는 P+ storage 작업.
|
||||
- (P7-3) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embeddings 를 sweep 하고 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). LanceDB 는 별도 store 라 옛 vector 가 잔존하지만 검색에는 영향 없음 (SQLite join 으로 surface 안 됨) — 디스크 cleanup 은 P+.
|
||||
|
||||
자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.
|
||||
|
||||
@@ -22,17 +22,25 @@ git history.
|
||||
|
||||
**Fix 1**: `ingest_one_pdf_asset` (in `kebab-app::lib.rs`) instantiates `PdfPageV1Chunker` directly. The `Chunk.chunker_version` field on emitted PDF chunks records `pdf-page-v1` truthfully. A future P+ task (chunker registry) either splits `Config::chunking.chunker_version` per medium or replaces the dispatch with a runtime registry. No HOTFIX entry needed once that happens — this entry is the cross-reference.
|
||||
|
||||
**Symptom 2 (storage-layer bug, exposed but not fixed by P7-3)**: P7-3's edited-bytes re-ingest test (`re_ingest_edited_pdf_produces_new_doc_id`) tripped on `sqlite error: UNIQUE constraint failed: assets.workspace_path: Error code 2067`. The assets table has a UNIQUE constraint on `workspace_path`, but `upsert_asset_row` (in `kebab-store-sqlite::store.rs:305`) only handles `ON CONFLICT(asset_id)`. When a file's bytes change, the new BLAKE3 produces a new `asset_id` while the `workspace_path` stays the same — INSERT picks the new asset_id branch, then trips the secondary UNIQUE on workspace_path.
|
||||
**Symptom 2 (storage-layer bug, fixed in same PR)**: P7-3's edited-bytes re-ingest test (`re_ingest_edited_pdf_produces_new_doc_id`) tripped on `sqlite error: UNIQUE constraint failed: assets.workspace_path: Error code 2067`. The assets table has a UNIQUE constraint on `workspace_path`, but `upsert_asset_row` (in `kebab-store-sqlite::store.rs`) only handles `ON CONFLICT(asset_id)`. When a file's bytes change, the new BLAKE3 produces a new `asset_id` while the `workspace_path` stays the same — INSERT picks the new asset_id branch, then trips the secondary UNIQUE on `workspace_path`.
|
||||
|
||||
**Why it didn't surface earlier**: No existing test (markdown / image) exercises edited-bytes re-ingest. The image path's `re_ingest_image_produces_updated_with_same_doc_id` uses identical bytes (same asset_id → ON CONFLICT(asset_id) catches it). Real-world editing of a tracked file would hit the same bug across all media types.
|
||||
**Why it didn't surface earlier**: No existing test (markdown / image) exercised edited-bytes re-ingest. The image path's `re_ingest_image_produces_updated_with_same_doc_id` uses identical bytes (same asset_id → `ON CONFLICT(asset_id)` catches it). Real-world editing of a tracked file would hit the same bug across all media types.
|
||||
|
||||
**Fix 2 (deferred)**: Storage-layer fix is out of scope for P7-3. The P7-3 implementation PR `#[ignore]`s the `re_ingest_edited_pdf_produces_new_doc_id` test with a doc-comment pointing here. A P+ storage task either:
|
||||
- Adds `ON CONFLICT(workspace_path) DO UPDATE` alongside the existing `ON CONFLICT(asset_id)` clause (DELETE-the-old + INSERT-the-new in a single statement, since UPSERT can only target one conflict path).
|
||||
- Or drops the UNIQUE constraint on `assets.workspace_path` and relies on application-level uniqueness (workspace_path → asset_id mapping in a separate index table).
|
||||
**Fix 2** (P7-3 implementation PR): new `purge_orphan_at_workspace_path` helper in `kebab-store-sqlite::store.rs`. Runs immediately before each `upsert_asset_row` call (both `put_asset_with_bytes` paths AND `DocumentStore::put_asset`). It:
|
||||
1. SELECTs the stale row at `workspace_path` whose `asset_id` differs from the incoming one (none → no-op return).
|
||||
2. DELETEs from `documents WHERE asset_id = stale` — `documents.asset_id ON DELETE RESTRICT` requires the documents go first; CASCADE on documents → `blocks` / `chunks` / `embedding_records` sweeps the dependent rows in the same statement.
|
||||
3. DELETEs the stale `assets` row, freeing the `workspace_path` slot.
|
||||
4. If the stale storage was `copied`, best-effort removes the byte file at `storage_path` so `data_dir/assets/` does not accumulate orphans across edits.
|
||||
|
||||
**Caveat (still deferred)**: `embedding_records.chunk_id` CASCADE clears the SQLite side, but the LanceDB rows keyed on `chunk_id` live in a separate store and are not touched. Stale vectors do not affect retrieval (search joins through SQLite, so an orphan vector is never surfaced) but they consume disk in `data_dir/lancedb/`. A future task should reconcile by `chunk_id` set diff.
|
||||
|
||||
The previously-`#[ignore]`d `re_ingest_edited_pdf_produces_new_doc_id` integration test runs by default after this fix, plus a dedicated unit test `put_asset_with_bytes_sweeps_workspace_path_orphan` in `kebab-store-sqlite::tests::asset_writer` that exercises the no-documents flavour. Verified end-to-end via the SMOKE runbook: `kebab ingest` → edit a tracked PDF → `kebab ingest` reports `new=1` for that asset (rest `updated`) and the prior doc/chunks are gone from `inspect` / `list docs`.
|
||||
|
||||
**Amends**:
|
||||
- tasks/p7/p7-3-pdf-ingest-wiring.md (chunker_version deviation, edited-bytes test ignored).
|
||||
- (Implicitly) every previous task spec that assumed `assets.workspace_path` UNIQUE was safe — the constraint is in fact too strict for the byte-edit re-ingest case.
|
||||
- tasks/p7/p7-3-pdf-ingest-wiring.md (chunker_version deviation; edited-bytes test runs).
|
||||
- crates/kebab-store-sqlite (new `purge_orphan_at_workspace_path` helper called from both `put_asset_with_bytes` branches and `DocumentStore::put_asset`).
|
||||
- crates/kebab-store-sqlite/tests/asset_writer.rs (`put_asset_with_bytes_sweeps_workspace_path_orphan` replaces the prior orphan-cleanup-on-failure test, since the failure path no longer exists).
|
||||
- docs/SMOKE.md (note that edited-PDF re-ingest produces `new=1` rather than an error).
|
||||
|
||||
## 2026-05-02 — P7-2 pdf-page-v1: chunk_id collision + BYTES_PER_TOKEN
|
||||
|
||||
|
||||
Reference in New Issue
Block a user