P7-3 통합 테스트가 노출한 storage 레이어 버그 fix. `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산 re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE 충돌. md / image / pdf 모두 영향. Fix: - 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path` 의 *다른* `asset_id` 를 발견하면 documents → assets 순서로 sweep. documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks / embedding_records 정리. copied 모드면 storage_path 의 byte 파일도 best-effort 삭제. - `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore ::put_asset` 모두 호출. - 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전 의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로 대체). - `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 → 9 통합 테스트 모두 default 로 통과. Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게 운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가 SQLite join 통해 surface). Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2): - doctor pass - 첫 ingest: 5 new - list docs: 5 docs all media types - search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit - search hybrid → cross-media 결과 - inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page - 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency) - byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix) - corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0 - 정리 후 다시 ingest: errors=0 - RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 정확히 노출 운영 fixture 보조: - `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로 reportlab/qpdf 없이 in-tree PDF 생성. - `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG fixture 생성. - SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시 errors+=1 → new+=1) 반영. HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로 업데이트, vector store orphan caveat 만 남음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
237 lines
8.7 KiB
Rust
237 lines
8.7 KiB
Rust
//! Asset writer tests: copy mode (file written 0o644), reference mode
|
|
//! (no copy, row records source), and checksum mismatch (Conflict).
|
|
|
|
use std::path::PathBuf;
|
|
|
|
use kebab_core::{AssetId, AssetStorage, Checksum, MediaType, RawAsset, SourceUri, WorkspacePath};
|
|
use kebab_store_sqlite::SqliteStore;
|
|
use time::OffsetDateTime;
|
|
|
|
mod common;
|
|
|
|
fn fixed_asset(_bytes: &[u8], byte_len: u64, declared_checksum: &str) -> RawAsset {
|
|
RawAsset {
|
|
// 32-hex AssetId per kb-core newtype invariant.
|
|
asset_id: AssetId("a".repeat(32)),
|
|
source_uri: SourceUri::File(PathBuf::from("/some/source.md")),
|
|
workspace_path: WorkspacePath::new("notes/foo.md".into()).unwrap(),
|
|
media_type: MediaType::Markdown,
|
|
byte_len,
|
|
checksum: Checksum(declared_checksum.into()),
|
|
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
|
stored: AssetStorage::Reference {
|
|
path: PathBuf::from("/some/source.md"),
|
|
sha: Checksum("0".repeat(64)),
|
|
},
|
|
}
|
|
}
|
|
|
|
fn b3_full_hex(bytes: &[u8]) -> String {
|
|
blake3::hash(bytes).to_hex().to_string()
|
|
}
|
|
|
|
#[test]
|
|
fn copy_mode_writes_file_with_0o644_and_correct_bytes() {
|
|
let env = common::TestEnv::with_threshold(100);
|
|
let store = SqliteStore::open(&env.config()).unwrap();
|
|
store.run_migrations().unwrap();
|
|
|
|
let bytes = b"hello, sqlite";
|
|
let cs = b3_full_hex(bytes);
|
|
let asset = fixed_asset(bytes, bytes.len() as u64, &cs);
|
|
|
|
store.put_asset_with_bytes(&asset, bytes).expect("write");
|
|
|
|
// Path: data_dir/assets/aa/aaaaaa…aa
|
|
let aa = &asset.asset_id.0[..2];
|
|
let dest = env.data_dir().join("assets").join(aa).join(&asset.asset_id.0);
|
|
assert!(dest.exists(), "asset file not written at {}", dest.display());
|
|
let on_disk = std::fs::read(&dest).unwrap();
|
|
assert_eq!(on_disk, bytes);
|
|
|
|
// Mode 0o644 on Unix.
|
|
#[cfg(unix)]
|
|
{
|
|
use std::os::unix::fs::PermissionsExt;
|
|
let mode = std::fs::metadata(&dest).unwrap().permissions().mode() & 0o777;
|
|
assert_eq!(mode, 0o644, "expected 0o644, got 0o{mode:o}");
|
|
}
|
|
|
|
// Row recorded copied.
|
|
let storage_kind: String = env.with_conn(|c| {
|
|
c.query_row(
|
|
"SELECT storage_kind FROM assets WHERE asset_id = ?",
|
|
[&asset.asset_id.0],
|
|
|r| r.get(0),
|
|
)
|
|
});
|
|
assert_eq!(storage_kind, "copied");
|
|
}
|
|
|
|
#[test]
|
|
fn reference_mode_does_not_write_file_but_records_path() {
|
|
// copy_threshold_mb=0 → every byte lands on the reference branch.
|
|
let env = common::TestEnv::with_threshold(0);
|
|
let store = SqliteStore::open(&env.config()).unwrap();
|
|
store.run_migrations().unwrap();
|
|
|
|
let bytes = b"big-pretend-bytes";
|
|
let cs = b3_full_hex(bytes);
|
|
// byte_len declared > 0 so the threshold check picks reference. With
|
|
// copy_threshold_bytes=0 even byte_len=1 trips the else branch.
|
|
let mut asset = fixed_asset(bytes, 1, &cs);
|
|
asset.source_uri = SourceUri::File(PathBuf::from("/path/to/original.md"));
|
|
|
|
store.put_asset_with_bytes(&asset, bytes).expect("ref write");
|
|
|
|
let aa = &asset.asset_id.0[..2];
|
|
let dest = env.data_dir().join("assets").join(aa).join(&asset.asset_id.0);
|
|
assert!(!dest.exists(), "reference mode must not copy bytes");
|
|
|
|
let (storage_kind, storage_path): (String, String) = env.with_conn(|c| {
|
|
c.query_row(
|
|
"SELECT storage_kind, storage_path FROM assets WHERE asset_id = ?",
|
|
[&asset.asset_id.0],
|
|
|r| Ok((r.get(0)?, r.get(1)?)),
|
|
)
|
|
});
|
|
assert_eq!(storage_kind, "reference");
|
|
assert_eq!(storage_path, "/path/to/original.md");
|
|
}
|
|
|
|
#[test]
|
|
fn put_asset_with_bytes_sweeps_workspace_path_orphan() {
|
|
// HOTFIXES 2026-05-02 P7-3: the original behaviour erred on
|
|
// workspace_path UNIQUE conflict (`ON CONFLICT(asset_id)` only) so
|
|
// a re-ingest of an edited file was unrecoverable. The fix is
|
|
// `purge_orphan_at_workspace_path`, which sweeps the stale
|
|
// documents → assets chain before the new INSERT lands.
|
|
//
|
|
// This test exercises the no-documents flavour (raw asset row only)
|
|
// — the put_asset_with_bytes path. The documents-cascade flavour
|
|
// is exercised end-to-end in `kebab-app::tests::pdf_pipeline::
|
|
// re_ingest_edited_pdf_produces_new_doc_id`.
|
|
let env = common::TestEnv::with_threshold(100);
|
|
let store = SqliteStore::open(&env.config()).unwrap();
|
|
store.run_migrations().unwrap();
|
|
|
|
// Pre-populate a row that owns `notes/foo.md` under a *different*
|
|
// asset_id, simulating a prior ingest of an earlier byte version.
|
|
env.with_conn(|c| {
|
|
c.execute(
|
|
"INSERT INTO assets (
|
|
asset_id, source_uri, workspace_path, media_type, byte_len,
|
|
checksum, storage_kind, storage_path, discovered_at
|
|
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
|
|
rusqlite::params![
|
|
"b".repeat(32),
|
|
"file:///elsewhere/foo.md",
|
|
"notes/foo.md",
|
|
"\"markdown\"",
|
|
7_i64,
|
|
"0".repeat(64),
|
|
"reference",
|
|
"/elsewhere/foo.md",
|
|
"2024-01-01T00:00:00Z",
|
|
],
|
|
)
|
|
});
|
|
|
|
let bytes = b"hello, sqlite";
|
|
let cs = b3_full_hex(bytes);
|
|
let asset = fixed_asset(bytes, bytes.len() as u64, &cs);
|
|
|
|
store
|
|
.put_asset_with_bytes(&asset, bytes)
|
|
.expect("orphan sweep + INSERT must succeed");
|
|
|
|
// Stale row gone, new row owns the workspace_path.
|
|
let stale_count: i64 = env.with_conn(|c| {
|
|
c.query_row(
|
|
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
|
|
rusqlite::params!["b".repeat(32)],
|
|
|row| row.get(0),
|
|
)
|
|
});
|
|
assert_eq!(stale_count, 0, "stale asset_id must be purged");
|
|
let new_count: i64 = env.with_conn(|c| {
|
|
c.query_row(
|
|
"SELECT COUNT(*) FROM assets WHERE asset_id = ?",
|
|
rusqlite::params![asset.asset_id.0],
|
|
|row| row.get(0),
|
|
)
|
|
});
|
|
assert_eq!(new_count, 1, "new asset_id must own the workspace_path slot");
|
|
|
|
// New asset's bytes published at the final destination.
|
|
let aa = &asset.asset_id.0[..2];
|
|
let dest = env.data_dir().join("assets").join(aa).join(&asset.asset_id.0);
|
|
assert!(
|
|
dest.exists(),
|
|
"new asset bytes must be visible at {}",
|
|
dest.display()
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn put_asset_with_bytes_rejects_invalid_asset_id() {
|
|
// `kebab_core::AssetId(pub String)` lets a hand-construction bypass the
|
|
// 32-hex `FromStr` invariant. The store boundary must reject any ID
|
|
// whose shape would let path construction escape `data_dir/assets/`.
|
|
let env = common::TestEnv::with_threshold(100);
|
|
let store = SqliteStore::open(&env.config()).unwrap();
|
|
store.run_migrations().unwrap();
|
|
|
|
// 32 chars but contains a `/` — would let `assets_path_for` stitch
|
|
// together a path outside the shard tree.
|
|
let evil_id = "../etc/passwd_padded_to_xx_xxxxx".to_string();
|
|
assert_eq!(evil_id.len(), 32, "test fixture must be 32 chars to exercise length-only checks");
|
|
let mut asset = fixed_asset(b"x", 1, &b3_full_hex(b"x"));
|
|
asset.asset_id = AssetId(evil_id.clone());
|
|
|
|
let err = store
|
|
.put_asset_with_bytes(&asset, b"x")
|
|
.expect_err("must reject non-hex AssetId");
|
|
let msg = format!("{err:#}");
|
|
assert!(
|
|
msg.contains("invalid AssetId shape"),
|
|
"expected AssetId-shape rejection, got: {msg}"
|
|
);
|
|
|
|
// And the bytes must NOT have been staged anywhere under the assets
|
|
// tree (no I/O should have happened before validation).
|
|
let assets_dir = env.data_dir().join("assets");
|
|
if assets_dir.exists() {
|
|
for entry in std::fs::read_dir(&assets_dir).unwrap().flatten() {
|
|
// Recurse one level into shard dirs and assert empty.
|
|
if let Some(sub) = std::fs::read_dir(entry.path()).unwrap().flatten().next() {
|
|
panic!(
|
|
"invalid AssetId still produced filesystem artifact at {}",
|
|
sub.path().display()
|
|
);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
#[test]
|
|
fn checksum_mismatch_returns_conflict() {
|
|
let env = common::TestEnv::new();
|
|
let store = SqliteStore::open(&env.config()).unwrap();
|
|
store.run_migrations().unwrap();
|
|
|
|
let bytes = b"the real bytes";
|
|
// Tampered checksum: hash a different payload.
|
|
let wrong_cs = b3_full_hex(b"different bytes");
|
|
let asset = fixed_asset(bytes, bytes.len() as u64, &wrong_cs);
|
|
|
|
let err = store
|
|
.put_asset_with_bytes(&asset, bytes)
|
|
.expect_err("must reject checksum mismatch");
|
|
let msg = format!("{err:#}");
|
|
assert!(
|
|
msg.contains("checksum mismatch") || msg.contains("conflict"),
|
|
"expected Conflict-flavoured error, got: {msg}"
|
|
);
|
|
}
|