Merge pull request 'fix(dogfood): auto-purge stored docs for filesystem-deleted files' (#148 ) from fix/dogfood-file-deletion-auto-purge into main

docs(dogfood): sync sweep_deleted_files algorithm doc with try_exists (PR #148 nit)
Round 2 review found the function-level doc-comment still referenced the old fs::exists() (now replaced by try_exists().unwrap_or(true) in commit 2baa846). One-line clarification — describes the conservative-on-Err semantics so future readers don't reintroduce the data-safety bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:10:33 +00:00 · 2026-05-20 07:10:27 +00:00 · 2026-05-20 07:04:03 +00:00 · 2026-05-20 06:51:07 +00:00 · 2026-05-20 06:29:27 +00:00 · 2026-05-20 06:28:41 +00:00
18 changed files with 730 additions and 68 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4127,7 +4127,7 @@ dependencies = [

 [[package]]
 name = "kebab-app"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "base64 0.22.1",
@@ -4172,7 +4172,7 @@ dependencies = [

 [[package]]
 name = "kebab-chunk"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4187,7 +4187,7 @@ dependencies = [

 [[package]]
 name = "kebab-cli"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "clap",
@@ -4208,7 +4208,7 @@ dependencies = [

 [[package]]
 name = "kebab-config"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "dirs 5.0.1",
@@ -4223,7 +4223,7 @@ dependencies = [

 [[package]]
 name = "kebab-core"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4237,7 +4237,7 @@ dependencies = [

 [[package]]
 name = "kebab-embed"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4251,7 +4251,7 @@ dependencies = [

 [[package]]
 name = "kebab-embed-local"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "fastembed",
@@ -4264,7 +4264,7 @@ dependencies = [

 [[package]]
 name = "kebab-eval"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-app",
@@ -4283,7 +4283,7 @@ dependencies = [

 [[package]]
 name = "kebab-llm"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4292,7 +4292,7 @@ dependencies = [

 [[package]]
 name = "kebab-llm-local"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-config",
@@ -4309,7 +4309,7 @@ dependencies = [

 [[package]]
 name = "kebab-mcp"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-app",
@@ -4327,7 +4327,7 @@ dependencies = [

 [[package]]
 name = "kebab-normalize"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4342,7 +4342,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-code"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "gix",
@@ -4360,7 +4360,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-image"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "ab_glyph",
 "anyhow",
@@ -4384,7 +4384,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-md"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4401,7 +4401,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-pdf"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4414,7 +4414,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-types"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "kebab-core",
 "serde",
@@ -4422,7 +4422,7 @@ dependencies = [

 [[package]]
 name = "kebab-rag"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4443,7 +4443,7 @@ dependencies = [

 [[package]]
 name = "kebab-search"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "globset",
@@ -4462,7 +4462,7 @@ dependencies = [

 [[package]]
 name = "kebab-source-fs"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4481,7 +4481,7 @@ dependencies = [

 [[package]]
 name = "kebab-store-sqlite"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4502,7 +4502,7 @@ dependencies = [

 [[package]]
 name = "kebab-store-vector"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "arrow",
@@ -4526,7 +4526,7 @@ dependencies = [

 [[package]]
 name = "kebab-tui"
-version = "0.8.2"
+version = "0.9.0"
 dependencies = [
 "anyhow",
 "crossterm",
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -31,7 +31,7 @@ edition       = "2024"
 rust-version  = "1.85"
 license       = "MIT OR Apache-2.0"
 repository    = "https://github.com/altair823/kebab"
-version       = "0.8.2"
+version       = "0.9.0"

 [workspace.dependencies]
 anyhow       = "1"
--- a/crates/kebab-app/src/lib.rs
+++ b/crates/kebab-app/src/lib.rs
@@ -375,6 +375,28 @@ pub fn ingest_with_config_opts(
        .map(|d| d.doc_id.0)
        .collect();

+    // Dogfood: post-walker sweep to remove stored docs whose source
+    // file has been deleted from the filesystem. Must run BEFORE the
+    // per-asset loop so the loop's New/Updated labelling is based on
+    // the post-purge store state (the purged doc_ids won't be in
+    // `existing_doc_ids` above — they were already removed, OR the
+    // sweep here removes them before we start counting).
+    //
+    // Critical design invariant: only purge when the file is TRULY
+    // absent from disk. A file that is still on disk but outside the
+    // current walker scope (config narrowing / include-glob change) is
+    // NOT purged — we leave it in place to protect against accidental
+    // data loss via config edits.
+    let scanned_paths: std::collections::HashSet<kebab_core::WorkspacePath> = assets
+        .iter()
+        .map(|a| a.workspace_path.clone())
+        .collect();
+    let purged_deleted_files = sweep_deleted_files(
+        &app,
+        &scanned_paths,
+        vector_store.as_ref().map(|v| v.as_ref()),
+    )?;
+
    let started_at = time::OffsetDateTime::now_utc();

    let mut items: Vec<kebab_core::IngestItem> = Vec::new();
@@ -647,11 +669,11 @@ pub fn ingest_with_config_opts(
    crate::ingest_progress::emit(progress, terminal_event);

    // p9-fb-19: bump the persistent corpus_revision counter when a
-    // commit landed (any new / updated). This invalidates every
+    // commit landed (any new / updated / purged). This invalidates every
    // entry in any in-process LRU search cache (in this process or
    // a sibling) on the next lookup. No-op when nothing changed
    // (skipped-only run) — the cache stays valid.
-    if new_count > 0 || updated_count > 0 {
+    if new_count > 0 || updated_count > 0 || purged_deleted_files > 0 {
        match app.sqlite.bump_corpus_revision() {
            Ok(rev) => tracing::debug!(
                target: "kebab-app",
@@ -682,6 +704,7 @@ pub fn ingest_with_config_opts(
        skipped_generated: fs_skips.skipped_generated,
        skipped_size_exceeded: fs_skips.skipped_size_exceeded,
        skip_examples: fs_skips.skip_examples,
+        purged_deleted_files,
        items: if summary_only { None } else { Some(items) },
    })
 }
@@ -748,15 +771,18 @@ struct ImagePipeline<'a> {
 /// hold (per design §9 cascade rule):
 ///
 /// 1. `force_reingest == false` — caller hasn't asked to bypass skip.
-/// 2. The freshly-scanned asset's blake3 checksum equals what the
-///    existing `assets` row stores at the same `workspace_path`.
-/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)`
-///    exists. If the parser_version changed, `id_for_doc` produces a
-///    different `doc_id` so the lookup misses → no skip → re-process.
-/// 4. The existing doc's stamped `last_chunker_version` AND
-///    `last_embedding_version` match the values the caller is about
-///    to use (`Some(v) == Some(v)` and `None == None` — see design
-///    doc for the `None == None` rule when no embedder is configured).
+/// 2. A document already exists at this `workspace_path`
+///    (`get_document_by_workspace_path`). The lookup is document-side, not
+///    asset-side, so twin files (identical content at different paths) each
+///    hit their own stable doc row — `documents.workspace_path` is UNIQUE
+///    while `assets` may dedupe content into a single row with a flip-flop
+///    `workspace_path` column (dogfood bug #4, see `tasks/HOTFIXES.md`).
+/// 3. The existing doc's `source_asset_id` equals the freshly-scanned
+///    asset's blake3 checksum (content unchanged).
+/// 4. The existing doc's `parser_version` matches the current extractor's
+///    `parser_version` (extractor not upgraded). Combined with `chunker_version`
+///    and `last_embedding_version` checks immediately below — full cascade
+///    per design §9.
 ///
 /// Returns `Ok(None)` (proceed with full re-process) when any check
 /// fails or any DB read errors out — the skip path is opportunistic;
@@ -773,31 +799,19 @@ fn try_skip_unchanged(
    if force_reingest {
        return Ok(None);
    }
-    let existing_asset = match app
+    // Document-centric skip: look up the existing document row by
+    // workspace_path directly. This avoids the twin-file flip-flop
+    // that the old asset-side lookup suffers from — multiple files
+    // with identical content share one `assets` row whose
+    // `workspace_path` is overwritten on every UPSERT, so
+    // `get_asset_by_workspace_path(path1)` could return the OTHER
+    // twin's path (or None) after any ingest of the twin. The
+    // `documents` table has a UNIQUE index on `workspace_path` (V001),
+    // so each twin has its own stable row regardless of asset de-dup.
+    let existing_doc = match app
        .sqlite
-        .get_asset_by_workspace_path(&asset.workspace_path)
+        .get_document_by_workspace_path(&asset.workspace_path)
    {
-        Ok(Some(a)) => a,
-        Ok(None) => return Ok(None),
-        Err(e) => {
-            tracing::debug!(
-                target: "kebab-app",
-                path = %asset.workspace_path.0,
-                error = %e,
-                "skip-check: get_asset_by_workspace_path failed; falling through to re-process"
-            );
-            return Ok(None);
-        }
-    };
-    if existing_asset.checksum != asset.checksum {
-        return Ok(None);
-    }
-    let candidate_doc_id = kebab_core::id_for_doc(
-        &asset.workspace_path,
-        &asset.asset_id,
-        current_parser_version,
-    );
-    let existing_doc = match app.sqlite.get_document(&candidate_doc_id) {
        Ok(Some(d)) => d,
        Ok(None) => return Ok(None),
        Err(e) => {
@@ -805,21 +819,37 @@ fn try_skip_unchanged(
                target: "kebab-app",
                path = %asset.workspace_path.0,
                error = %e,
-                "skip-check: get_document failed; falling through to re-process"
+                "skip-check: get_document_by_workspace_path failed; falling through to re-process"
            );
            return Ok(None);
        }
    };
+    // 1. Content unchanged: the freshly-computed asset_id (blake3
+    //    content hash) must match what this document was ingested from.
+    if existing_doc.source_asset_id != asset.asset_id {
+        return Ok(None);
+    }
+    // 2. Parser unchanged: parser_version is baked into id_for_doc so
+    //    a version bump yields a different doc_id and the row above
+    //    would have been missing. Checking here explicitly keeps the
+    //    logic self-documenting and guards against future id_for_doc
+    //    changes.
+    if existing_doc.parser_version != *current_parser_version {
+        return Ok(None);
+    }
+    // 3. Chunker unchanged.
    let chunker_match = existing_doc.last_chunker_version.as_ref()
        == Some(current_chunker_version);
    if !chunker_match {
        return Ok(None);
    }
+    // 4. Embedder unchanged.
    let embedder_match = existing_doc.last_embedding_version.as_ref()
        == current_embedding_version;
    if !embedder_match {
        return Ok(None);
    }
+    let candidate_doc_id = existing_doc.doc_id.clone();
    tracing::debug!(
        target: "kebab-app::ingest",
        path = %asset.workspace_path.0,
@@ -1446,6 +1476,120 @@ fn purge_vector_orphans_for_workspace_path(
    Ok(())
 }

+/// Dogfood: post-walker sweep that purges stored documents whose source
+/// file has been physically deleted from the filesystem.
+///
+/// Algorithm:
+/// 1. Query `documents` for every `workspace_path` currently stored.
+/// 2. Compute `orphan_candidates = stored_paths - scanned_paths`.
+/// 3. For each candidate: resolve to an absolute path and call
+///    `Path::try_exists().unwrap_or(true)` — transient FS errors
+///    (EACCES, NFS hiccup, ownership change) conservatively count as
+///    "still present" so we never purge on uncertain signal. If the
+///    file still exists on disk it was merely out-of-scope this run
+///    (config narrowing / include-glob change) — leave it alone. Only
+///    files that are truly absent trigger a purge.
+/// 4. For absent files: call `purge_deleted_workspace_path` (SQLite
+///    cascade delete + optional copied-asset file removal) and, if a
+///    vector store is present, delete the associated vectors.
+///
+/// Returns the number of documents purged.
+///
+/// Non-fatal design: individual purge failures are logged and counted
+/// as errors on the per-file level but do NOT abort the sweep — a
+/// partial failure is preferable to blocking the rest of ingest. The
+/// return value only counts successful purges.
+fn sweep_deleted_files(
+    app: &App,
+    scanned_paths: &std::collections::HashSet<kebab_core::WorkspacePath>,
+    vector_store: Option<&kebab_store_vector::LanceVectorStore>,
+) -> anyhow::Result<u32> {
+    use kebab_core::DocumentStore as _;
+
+    let stored_paths = app
+        .sqlite
+        .all_workspace_paths()
+        .context("sweep_deleted_files: all_workspace_paths")?;
+
+    if stored_paths.is_empty() {
+        return Ok(0);
+    }
+
+    let workspace_root = app.config.resolve_workspace_root();
+    let mut purged: u32 = 0;
+
+    for stored_path in stored_paths {
+        if scanned_paths.contains(&stored_path) {
+            continue; // still in scope — skip
+        }
+
+        // Resolve to an absolute path and check existence on disk.
+        // Use `try_exists` + `unwrap_or(true)` so transient FS errors
+        // (EACCES on a path we lack read on, NFS hiccups, ownership
+        // change) are CONSERVATIVELY treated as "file still present" —
+        // never purge on uncertain signal (data-safety: PR #148 review).
+        // `exists()` would return false on Err and trigger a wrongful
+        // purge. Files whose path cannot be joined (theoretically
+        // impossible for non-empty workspace_path strings, but
+        // defense-in-depth) are likewise treated as still present.
+        let abs = workspace_root.join(&stored_path.0);
+        if abs.try_exists().unwrap_or(true) {
+            // File is on disk but not in this scan's scope (config
+            // narrowing). DO NOT purge — critical design constraint.
+            tracing::debug!(
+                target: "kebab-app",
+                path = %stored_path.0,
+                "sweep_deleted_files: file on disk but out of scope — leaving in store"
+            );
+            continue;
+        }
+
+        // File is truly absent → purge.
+        let chunk_ids = match kebab_store_sqlite::purge_deleted_workspace_path(
+            &app.sqlite,
+            &stored_path,
+        ) {
+            Ok(ids) => ids,
+            Err(e) => {
+                tracing::warn!(
+                    target: "kebab-app",
+                    path = %stored_path.0,
+                    error = %e,
+                    "sweep_deleted_files: purge failed; skipping this path"
+                );
+                continue;
+            }
+        };
+
+        // Purge associated vectors (best-effort; partial failure
+        // acceptable — orphan vectors get cleaned by `kebab reset
+        // --vector-only` if they accumulate).
+        if let Some(vec) = vector_store {
+            if !chunk_ids.is_empty() {
+                use kebab_core::VectorStore as _;
+                if let Err(e) = vec.delete_by_chunk_ids(&chunk_ids) {
+                    tracing::warn!(
+                        target: "kebab-app",
+                        path = %stored_path.0,
+                        count = chunk_ids.len(),
+                        error = %e,
+                        "sweep_deleted_files: vector delete failed; SQLite side already cleaned"
+                    );
+                }
+            }
+        }
+
+        tracing::info!(
+            target: "kebab-app",
+            path = %stored_path.0,
+            "sweep_deleted_files: purged document for deleted file"
+        );
+        purged = purged.saturating_add(1);
+    }
+
+    Ok(purged)
+}
+
 /// P7-3: process one `MediaType::Pdf` asset end-to-end.
 ///
 /// - Reads bytes from disk.
--- a/crates/kebab-app/tests/file_deletion_auto_purge.rs
+++ b/crates/kebab-app/tests/file_deletion_auto_purge.rs
@@ -0,0 +1,178 @@
+//! Dogfood: auto-purge stored docs for filesystem-deleted files.
+//!
+//! Two tests:
+//!
+//! 1. `file_deletion_auto_purge` — ingest 2 files, delete one, re-ingest.
+//!    The re-ingest must report `purged_deleted_files = 1`, the deleted
+//!    file must no longer appear in `list_docs`, and lexical search for
+//!    its unique content must return no hits.
+//!
+//! 2. `include_scope_narrowing_does_not_purge` — ingest 2 files under a
+//!    wide glob, narrow the walker scope to only one file, re-ingest.
+//!    The narrowed ingest must NOT purge the out-of-scope file because
+//!    the file is still on disk (just excluded from this run). Protects
+//!    users against accidental data loss via config edits.
+
+mod common;
+
+use common::TestEnv;
+use kebab_app::ingest_with_config_opts;
+use kebab_app::IngestOpts;
+use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
+
+/// Helper: open the store via `TestEnv` and run `list_documents`.
+fn list_doc_paths(env: &TestEnv) -> Vec<String> {
+    use kebab_store_sqlite::SqliteStore;
+    let store = SqliteStore::open(&env.config).unwrap();
+    store.run_migrations().unwrap();
+    store
+        .list_documents(&DocFilter::default())
+        .unwrap()
+        .into_iter()
+        .map(|d| d.doc_path.0)
+        .collect()
+}
+
+#[test]
+fn file_deletion_auto_purge() {
+    let env = TestEnv::lexical_only();
+
+    // Write two .rs files into the workspace.
+    let a_path = env.workspace_root.join("a.rs");
+    let b_path = env.workspace_root.join("b.rs");
+    std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
+    std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
+
+    // First ingest — both must be New.
+    let first = ingest_with_config_opts(
+        env.config.clone(),
+        env.scope(),
+        false,
+        IngestOpts::default(),
+    )
+    .expect("first ingest must succeed");
+    // Only count the .rs files we added (there may be fixture files too).
+    let first_new = first.new;
+    assert!(first_new >= 2, "expected at least 2 new docs: {first:?}");
+    assert_eq!(
+        first.purged_deleted_files, 0,
+        "no purges on first ingest: {first:?}"
+    );
+    assert_eq!(first.errors, 0, "no errors on first ingest: {first:?}");
+
+    // Delete one file from the filesystem.
+    std::fs::remove_file(&b_path).expect("remove b.rs");
+
+    // Second ingest — scanned count drops by 1; b.rs should be purged.
+    let second = ingest_with_config_opts(
+        env.config.clone(),
+        env.scope(),
+        false,
+        IngestOpts::default(),
+    )
+    .expect("second ingest must succeed");
+
+    assert_eq!(
+        second.purged_deleted_files, 1,
+        "exactly 1 file should be purged: {second:?}"
+    );
+    assert_eq!(second.new, 0, "no new docs after deletion: {second:?}");
+    assert_eq!(second.updated, 0, "no updated docs: {second:?}");
+    assert_eq!(second.errors, 0, "no errors: {second:?}");
+
+    // b.rs must no longer appear in list_docs.
+    let doc_paths = list_doc_paths(&env);
+    let b_ws_path = "b.rs";
+    assert!(
+        !doc_paths.iter().any(|p| p == b_ws_path),
+        "b.rs must be gone from list_docs; got: {doc_paths:?}"
+    );
+    // a.rs must still be present.
+    let a_ws_path = "a.rs";
+    assert!(
+        doc_paths.iter().any(|p| p == a_ws_path),
+        "a.rs must still be in list_docs; got: {doc_paths:?}"
+    );
+
+    // Lexical search for b.rs's unique content returns no hits.
+    let app = env.app();
+    let query = SearchQuery {
+        text: "bravo".to_string(),
+        mode: SearchMode::Lexical,
+        k: 10,
+        filters: kebab_core::SearchFilters::default(),
+    };
+    let hits = app.search(query).expect("search must not error");
+    assert!(
+        hits.is_empty(),
+        "search for deleted file's content must return no hits; got: {hits:?}"
+    );
+}
+
+#[test]
+fn include_scope_narrowing_does_not_purge() {
+    let env = TestEnv::lexical_only();
+
+    // Write two .rs files.
+    let a_path = env.workspace_root.join("a_narrow.rs");
+    let b_path = env.workspace_root.join("b_narrow.rs");
+    std::fs::write(&a_path, "// narrow a\nfn alpha_narrow() {}\n").unwrap();
+    std::fs::write(&b_path, "// narrow b\nfn bravo_narrow() {}\n").unwrap();
+
+    // Wide scope: first ingest — both must be New.
+    let wide_scope = SourceScope {
+        root: env.workspace_root.clone(),
+        include: vec!["**/*.rs".to_string()],
+        exclude: env.config.workspace.exclude.clone(),
+    };
+    let first = ingest_with_config_opts(
+        env.config.clone(),
+        wide_scope,
+        false,
+        IngestOpts::default(),
+    )
+    .expect("first ingest (wide) must succeed");
+    assert!(
+        first.new >= 2,
+        "expected at least 2 new docs: {first:?}"
+    );
+    assert_eq!(
+        first.purged_deleted_files, 0,
+        "no purges on first ingest: {first:?}"
+    );
+
+    // Narrow scope: only a_narrow.rs in include — b_narrow.rs is still
+    // on disk but excluded from the walker scope.
+    let narrow_scope = SourceScope {
+        root: env.workspace_root.clone(),
+        include: vec!["a_narrow.rs".to_string()],
+        exclude: env.config.workspace.exclude.clone(),
+    };
+    let second = ingest_with_config_opts(
+        env.config.clone(),
+        narrow_scope,
+        false,
+        IngestOpts::default(),
+    )
+    .expect("second ingest (narrow) must succeed");
+
+    // CRITICAL: b_narrow.rs is still on disk — must NOT be purged.
+    assert_eq!(
+        second.purged_deleted_files, 0,
+        "scope-narrowing must NOT purge on-disk files; got: {second:?}"
+    );
+    assert_eq!(second.errors, 0, "no errors: {second:?}");
+
+    // b_narrow.rs must still exist in the store.
+    let doc_paths = list_doc_paths(&env);
+    let b_ws_path = "b_narrow.rs";
+    assert!(
+        doc_paths.iter().any(|p| p == b_ws_path),
+        "b_narrow.rs must still be in list_docs after scope narrowing; got: {doc_paths:?}"
+    );
+    // And the file must still be on disk.
+    assert!(
+        b_path.exists(),
+        "b_narrow.rs must still be on disk (we didn't delete it)"
+    );
+}
--- a/crates/kebab-app/tests/twin_files_idempotent.rs
+++ b/crates/kebab-app/tests/twin_files_idempotent.rs
@@ -0,0 +1,90 @@
+//! Regression test for the twin-file idempotency bug.
+//!
+//! Identical-content files at different workspace paths share one
+//! `assets` row (`asset_id` = blake3 content hash, PRIMARY KEY). The
+//! old UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path =
+//! excluded.workspace_path` made each twin overwrite the other's path
+//! on every ingest, so `get_asset_by_workspace_path(path1)` returned
+//! None (or the wrong twin) → re-process every time.
+//!
+//! Fix: `try_skip_unchanged` now uses `get_document_by_workspace_path`
+//! instead.  `documents.workspace_path` is UNIQUE (V001) so each twin
+//! has its own stable document row.
+//!
+//! Assertion contract:
+//!   1st ingest → 2 New (one per twin)
+//!   2nd ingest → 0 New, 0 Updated, 2 Unchanged
+
+mod common;
+
+use common::TestEnv;
+use kebab_app::ingest_with_config;
+use kebab_core::IngestItemKind;
+
+#[test]
+fn twin_files_second_ingest_is_unchanged() {
+    let env = TestEnv::lexical_only();
+
+    // Write two files with identical content at different paths.
+    let pkg_a = env.workspace_root.join("pkg_a");
+    let pkg_b = env.workspace_root.join("pkg_b");
+    std::fs::create_dir_all(&pkg_a).unwrap();
+    std::fs::create_dir_all(&pkg_b).unwrap();
+
+    let content = b"# shared\nThis content is identical in both files.\n";
+    std::fs::write(pkg_a.join("__init__.py"), content).unwrap();
+    std::fs::write(pkg_b.join("__init__.py"), content).unwrap();
+
+    // First ingest — both files must be New.
+    let first = ingest_with_config(env.config.clone(), env.scope(), false)
+        .expect("first ingest must succeed");
+    assert_eq!(first.errors, 0, "first ingest: no errors; report={first:?}");
+
+    let items = first.items.as_ref().expect("items must be present");
+    let twin_items: Vec<_> = items
+        .iter()
+        .filter(|i| {
+            i.doc_path.0.ends_with("__init__.py")
+        })
+        .collect();
+    assert_eq!(
+        twin_items.len(),
+        2,
+        "first ingest: expected exactly 2 __init__.py items; items={items:?}"
+    );
+    for item in &twin_items {
+        assert_eq!(
+            item.kind,
+            IngestItemKind::New,
+            "first ingest: each twin must be New; item={item:?}"
+        );
+    }
+
+    // Second ingest — same files, same content → both must be Unchanged.
+    let second = ingest_with_config(env.config.clone(), env.scope(), false)
+        .expect("second ingest must succeed");
+    assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
+    assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
+    assert_eq!(
+        second.updated, 0,
+        "second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"
+    );
+
+    let second_items = second.items.as_ref().expect("items must be present");
+    let twin_items2: Vec<_> = second_items
+        .iter()
+        .filter(|i| i.doc_path.0.ends_with("__init__.py"))
+        .collect();
+    assert_eq!(
+        twin_items2.len(),
+        2,
+        "second ingest: expected exactly 2 __init__.py items; items={second_items:?}"
+    );
+    for item in &twin_items2 {
+        assert_eq!(
+            item.kind,
+            IngestItemKind::Unchanged,
+            "second ingest: each twin must be Unchanged; item={item:?}"
+        );
+    }
+}
--- a/crates/kebab-cli/src/main.rs
+++ b/crates/kebab-cli/src/main.rs
@@ -595,14 +595,20 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
                println!("{}", serde_json::to_string(&wire::wire_ingest(&report))?);
            } else {
                let skipped_breakdown = kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
+                let purged_suffix = if report.purged_deleted_files > 0 {
+                    format!("  purged {}", report.purged_deleted_files)
+                } else {
+                    String::new()
+                };
                println!(
-                    "scanned {}  new {}  updated {}  skipped {}{}  errors {}  ({} ms)",
+                    "scanned {}  new {}  updated {}  skipped {}{}  errors {}{}  ({} ms)",
                    report.scanned,
                    report.new,
                    report.updated,
                    report.skipped,
                    skipped_breakdown,
                    report.errors,
+                    purged_suffix,
                    report.duration_ms
                );
            }
--- a/crates/kebab-cli/src/wire.rs
+++ b/crates/kebab-cli/src/wire.rs
@@ -260,6 +260,7 @@ mod tests {
            skipped_generated: 0,
            skipped_size_exceeded: 0,
            skip_examples: SkipExamples::default(),
+            purged_deleted_files: 0,
            items: None,
        };
        let v = wire_ingest(&r);
--- a/crates/kebab-core/src/ingest.rs
+++ b/crates/kebab-core/src/ingest.rs
@@ -47,6 +47,12 @@ pub struct IngestReport {
    /// p10-1A-1: sample file paths per skip category (≤ 5 each).
    #[serde(default)]
    pub skip_examples: SkipExamples,
+    /// Dogfood: docs whose on-disk file was deleted since the last ingest
+    /// and were therefore removed from the store. Additive field — older
+    /// wire consumers that pre-date this field read it as 0 via
+    /// `#[serde(default)]`.
+    #[serde(default)]
+    pub purged_deleted_files: u32,
    /// `None` ↔ wire `items: null` (`--summary-only`).
    pub items: Option<Vec<IngestItem>>,
 }
@@ -136,6 +142,7 @@ mod tests {
                builtin_blacklist: vec!["node_modules/x.js".into()],
                gitignore: vec![],
            },
+            purged_deleted_files: 0,
            items: None,
        };
        let v = serde_json::to_value(&r).unwrap();
--- a/crates/kebab-core/src/traits.rs
+++ b/crates/kebab-core/src/traits.rs
@@ -169,6 +169,30 @@ pub trait DocumentStore {
        &self,
        path: &WorkspacePath,
    ) -> anyhow::Result<Option<RawAsset>>;
+
+    /// Look up a document row by its workspace path. Used by the
+    /// document-centric skip path in `try_skip_unchanged` to avoid the
+    /// twin-file flip-flop that the asset-side lookup suffers from
+    /// (multiple files with identical content share one `assets` row
+    /// whose `workspace_path` is overwritten on every UPSERT, so
+    /// `get_asset_by_workspace_path` returns the wrong twin's path).
+    ///
+    /// `documents.workspace_path` is UNIQUE (V001), so each twin has
+    /// its own stable document row regardless of the asset de-dup.
+    fn get_document_by_workspace_path(
+        &self,
+        path: &WorkspacePath,
+    ) -> anyhow::Result<Option<CanonicalDocument>>;
+
+    /// Return every `workspace_path` stored in the `documents` table.
+    ///
+    /// Used by the post-walker sweep in `kebab-app::ingest` to detect
+    /// documents whose source file has been deleted from the filesystem.
+    /// The set difference `(stored - scanned)` yields orphan candidates;
+    /// each candidate is then existence-checked on disk so that
+    /// out-of-scope files (config narrowing) are NOT purged — only truly
+    /// absent files trigger the purge.
+    fn all_workspace_paths(&self) -> anyhow::Result<Vec<WorkspacePath>>;
 }

 pub trait VectorStore {
--- a/crates/kebab-parse-code/src/lang.rs
+++ b/crates/kebab-parse-code/src/lang.rs
@@ -24,7 +24,7 @@ pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
    match ext.as_str() {
        "rs" => Some("rust"),
        "py" | "pyi" => Some("python"),
-        "ts" | "tsx" => Some("typescript"),
+        "ts" | "tsx" | "mts" | "cts" => Some("typescript"),
        "js" | "mjs" | "cjs" | "jsx" => Some("javascript"),
        "go" => Some("go"),
        "java" => Some("java"),
@@ -82,7 +82,7 @@ pub fn module_path_for_python(workspace_path: &str) -> String {
 /// (no slash replacement, no source-root strip). See plan §Task C.
 pub fn module_path_for_tsjs(workspace_path: &str) -> String {
    let p = workspace_path;
-    for ext in [".tsx", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
+    for ext in [".tsx", ".mts", ".cts", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
        if let Some(stripped) = p.strip_suffix(ext) {
            return stripped.to_string();
        }
@@ -110,7 +110,7 @@ mod tests {

    #[test]
    fn module_path_for_tsjs_keeps_slashes_and_strips_ext() {
-        for ext in ["ts", "tsx", "js", "jsx", "mjs", "cjs"] {
+        for ext in ["ts", "tsx", "mts", "cts", "js", "jsx", "mjs", "cjs"] {
            let p = format!("src/search/retriever/Retriever.{ext}");
            assert_eq!(module_path_for_tsjs(&p), "src/search/retriever/Retriever");
        }
--- a/crates/kebab-parse-code/src/typescript.rs
+++ b/crates/kebab-parse-code/src/typescript.rs
@@ -173,8 +173,9 @@ impl Extractor for TypescriptAstExtractor {
 }

 /// Select the tree-sitter grammar based on the workspace path's
-/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.d.ts`,
-/// missing extension) → TypeScript grammar.
+/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.mts`,
+/// `.cts`, `.d.ts`, missing extension) → TypeScript grammar (the JSX-
+/// agnostic variants all share one grammar in tree-sitter-typescript 0.23).
 fn select_grammar(workspace_path: &str) -> tree_sitter::Language {
    if workspace_path.ends_with(".tsx") {
        tree_sitter_typescript::LANGUAGE_TSX.into()
--- a/crates/kebab-parse-code/tests/lang.rs
+++ b/crates/kebab-parse-code/tests/lang.rs
@@ -9,6 +9,8 @@ fn known_extensions_map_to_canonical_identifiers() {
        ("foo.pyi", Some("python")),
        ("foo.ts", Some("typescript")),
        ("foo.tsx", Some("typescript")),
+        ("foo.mts", Some("typescript")),  // ESM TS — same grammar
+        ("foo.cts", Some("typescript")),  // CommonJS TS — same grammar
        ("foo.js", Some("javascript")),
        ("foo.mjs", Some("javascript")),
        ("foo.cjs", Some("javascript")),
--- a/crates/kebab-source-fs/src/media.rs
+++ b/crates/kebab-source-fs/src/media.rs
@@ -19,7 +19,9 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
        .unwrap_or_default();

    match ext.as_str() {
-        "md" => MediaType::Markdown,
+        // Markdown + MDX (markdown + JSX, treated as plain markdown — the
+        // JSX islands are folded into raw passthrough by the md parser).
+        "md" | "mdx" => MediaType::Markdown,
        "pdf" => MediaType::Pdf,

        "png" => MediaType::Image(ImageType::Png),
@@ -40,7 +42,8 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {

        // p10-1B: Python / TS / JS AST chunkers active.
        "py" | "pyi"               => MediaType::Code("python".into()),
-        "ts" | "tsx"               => MediaType::Code("typescript".into()),
+        // .mts / .cts are TypeScript ESM / CommonJS variants — same grammar.
+        "ts" | "tsx" | "mts" | "cts" => MediaType::Code("typescript".into()),
        "js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),

        // Empty string (no extension) and any other extension: bucket as
@@ -102,6 +105,20 @@ mod tests {
        assert_eq!(media_type_for(Path::new("a/b.rs")),    MediaType::Code("rust".into()));
    }

+    #[test]
+    fn ts_variants_mts_cts() {
+        // .mts / .cts are TypeScript ESM / CommonJS — same grammar as .ts.
+        assert_eq!(media_type_for(Path::new("a/b.mts")), MediaType::Code("typescript".into()));
+        assert_eq!(media_type_for(Path::new("a/b.cts")), MediaType::Code("typescript".into()));
+    }
+
+    #[test]
+    fn mdx_routes_to_markdown() {
+        // MDX is markdown with JSX islands; the md parser folds the JSX
+        // through as raw passthrough.
+        assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
+    }
+
    #[test]
    fn unknown_and_missing_extension() {
        assert_eq!(
--- a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json
+++ b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json
@@ -56,5 +56,6 @@
  "skipped_kebabignore": 0,
  "skipped_size_exceeded": 0,
  "unchanged": 0,
+  "purged_deleted_files": 0,
  "updated": 1
 }
--- a/crates/kebab-store-sqlite/src/documents.rs
+++ b/crates/kebab-store-sqlite/src/documents.rs
@@ -286,6 +286,88 @@ impl kebab_core::DocumentStore for SqliteStore {
        }
    }

+    fn get_document_by_workspace_path(
+        &self,
+        path: &kebab_core::WorkspacePath,
+    ) -> Result<Option<kebab_core::CanonicalDocument>> {
+        let conn = self.lock_conn();
+        let row: Option<DocumentRow> = conn
+            .query_row(
+                "SELECT
+                    doc_id, asset_id, workspace_path, title, lang,
+                    source_type, trust_level, parser_version,
+                    doc_version, schema_version, metadata_json,
+                    provenance_json, created_at, updated_at,
+                    last_chunker_version, last_embedding_version
+                FROM documents WHERE workspace_path = ?",
+                params![path.0],
+                document_row_from_sql,
+            )
+            .map(Some)
+            .or_else(rows_optional)
+            .map_err(StoreError::from)?;
+        let Some(row) = row else { return Ok(None) };
+
+        let doc_id = kebab_core::DocumentId(row.doc_id.clone());
+        let mut blocks_stmt = conn
+            .prepare(
+                "SELECT payload_json FROM blocks
+                 WHERE doc_id = ? ORDER BY ordinal ASC",
+            )
+            .map_err(StoreError::from)?;
+        let block_rows = blocks_stmt
+            .query_map(params![row.doc_id], |r| {
+                let payload_json: String = r.get(0)?;
+                Ok(payload_json)
+            })
+            .map_err(StoreError::from)?;
+        let mut blocks: Vec<kebab_core::Block> = Vec::new();
+        for block_row in block_rows {
+            let payload_json = block_row.map_err(StoreError::from)?;
+            let block: kebab_core::Block = serde_json::from_str(&payload_json)
+                .context("deserialize block payload_json")?;
+            blocks.push(block);
+        }
+
+        let metadata: kebab_core::Metadata = serde_json::from_str(&row.metadata_json)
+            .context("deserialize metadata_json")?;
+        let provenance: kebab_core::Provenance =
+            serde_json::from_str(&row.provenance_json)
+                .context("deserialize provenance_json")?;
+
+        Ok(Some(kebab_core::CanonicalDocument {
+            doc_id,
+            source_asset_id: kebab_core::AssetId(row.asset_id),
+            workspace_path: kebab_core::WorkspacePath(row.workspace_path),
+            title: row.title.unwrap_or_default(),
+            lang: kebab_core::Lang(row.lang.unwrap_or_default()),
+            blocks,
+            metadata,
+            provenance,
+            parser_version: kebab_core::ParserVersion(row.parser_version),
+            schema_version: row.schema_version as u32,
+            doc_version: row.doc_version as u32,
+            last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion),
+            last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion),
+        }))
+    }
+
+    fn all_workspace_paths(&self) -> Result<Vec<kebab_core::WorkspacePath>> {
+        let conn = self.lock_conn();
+        let mut stmt = conn
+            .prepare("SELECT workspace_path FROM documents")
+            .map_err(StoreError::from)?;
+        let rows = stmt
+            .query_map([], |r| r.get::<_, String>(0))
+            .map_err(StoreError::from)?;
+        let mut out = Vec::new();
+        for row in rows {
+            let path = row.map_err(StoreError::from)?;
+            out.push(kebab_core::WorkspacePath(path));
+        }
+        Ok(out)
+    }
+
    fn list_documents(
        &self,
        filter: &kebab_core::DocFilter,
--- a/crates/kebab-store-sqlite/src/lib.rs
+++ b/crates/kebab-store-sqlite/src/lib.rs
@@ -35,4 +35,4 @@ pub use error::StoreError;
 pub use eval::{EvalQueryResultRecord, EvalRunRecord, EvalRunRow};
 pub use fts::rebuild_chunks_fts;
 pub use jobs::IngestRunRow;
-pub use store::{CountSummary, NotIndexed, SqliteStore};
+pub use store::{CountSummary, NotIndexed, SqliteStore, purge_deleted_workspace_path};
--- a/crates/kebab-store-sqlite/src/store.rs
+++ b/crates/kebab-store-sqlite/src/store.rs
@@ -540,6 +540,114 @@ pub(crate) fn purge_orphan_at_workspace_path(
    Ok(())
 }

+/// Purge all stored data for a document whose on-disk file has been
+/// deleted (as opposed to content-changed, which is handled by
+/// `purge_orphan_at_workspace_path`).
+///
+/// Returns the `chunk_id`s that were associated with the document so
+/// the caller can issue a matching `VectorStore::delete_by_chunk_ids`
+/// on the LanceDB side.
+///
+/// Deletion order:
+/// 1. Collect chunk_ids (before cascade removes them).
+/// 2. DELETE the `documents` row → CASCADE clears `blocks`, `chunks`,
+///    `embedding_records`.
+/// 3. DELETE the `assets` row **only if no other document still
+///    references it** (twin-file protection — `assets` can be shared
+///    across identical-content files via the blake3 PK).
+/// 4. If the asset was `storage_kind = 'copied'`, best-effort delete
+///    the on-disk byte file at `storage_path`.
+///
+/// Returns `Ok(vec![])` when no document exists at `workspace_path`
+/// (idempotent — caller doesn't need to pre-check).
+pub fn purge_deleted_workspace_path(
+    store: &SqliteStore,
+    workspace_path: &kebab_core::WorkspacePath,
+) -> anyhow::Result<Vec<kebab_core::ChunkId>> {
+    let conn = store.lock_conn();
+
+    // Look up the document + its asset_id.
+    let doc_row: Option<(String, String)> = conn
+        .query_row(
+            "SELECT doc_id, asset_id FROM documents WHERE workspace_path = ?",
+            rusqlite::params![workspace_path.0],
+            |r| Ok((r.get(0)?, r.get(1)?)),
+        )
+        .optional()
+        .map_err(StoreError::from)?;
+
+    let Some((doc_id, asset_id)) = doc_row else {
+        return Ok(Vec::new());
+    };
+
+    // 1. Collect chunk_ids before CASCADE removes them.
+    let mut stmt = conn
+        .prepare("SELECT chunk_id FROM chunks WHERE doc_id = ?")
+        .map_err(StoreError::from)?;
+    let rows = stmt
+        .query_map(rusqlite::params![doc_id], |r| r.get::<_, String>(0))
+        .map_err(StoreError::from)?;
+    let chunk_ids: Vec<kebab_core::ChunkId> = rows
+        .map(|r| r.map(kebab_core::ChunkId))
+        .collect::<rusqlite::Result<Vec<_>>>()
+        .map_err(StoreError::from)?;
+    drop(stmt);
+
+    // 2. DELETE the document row (CASCADE clears blocks / chunks /
+    //    embedding_records via the FK constraints in V001).
+    conn.execute(
+        "DELETE FROM documents WHERE doc_id = ?",
+        rusqlite::params![doc_id],
+    )
+    .map_err(StoreError::from)?;
+
+    // 3. Delete the asset row only when no other document still
+    //    references it (twin-file safety: two files with identical
+    //    bytes share a single asset row via the blake3 PK).
+    let remaining_refs: i64 = conn
+        .query_row(
+            "SELECT COUNT(*) FROM documents WHERE asset_id = ?",
+            rusqlite::params![asset_id],
+            |r| r.get(0),
+        )
+        .map_err(StoreError::from)?;
+
+    if remaining_refs == 0 {
+        // 4. Capture storage details before deleting the row.
+        let asset_storage: Option<(String, String)> = conn
+            .query_row(
+                "SELECT storage_kind, storage_path FROM assets WHERE asset_id = ?",
+                rusqlite::params![asset_id],
+                |r| Ok((r.get(0)?, r.get(1)?)),
+            )
+            .optional()
+            .map_err(StoreError::from)?;
+
+        conn.execute(
+            "DELETE FROM assets WHERE asset_id = ?",
+            rusqlite::params![asset_id],
+        )
+        .map_err(StoreError::from)?;
+
+        // 5. Best-effort: remove the on-disk copied asset file.
+        if let Some((storage_kind, storage_path)) = asset_storage {
+            if storage_kind == "copied" {
+                let _ = std::fs::remove_file(&storage_path);
+            }
+        }
+    }
+
+    tracing::debug!(
+        target: "kebab-store-sqlite",
+        workspace_path = %workspace_path.0,
+        doc_id = %doc_id,
+        chunk_count = chunk_ids.len(),
+        "purged deleted-file document from store"
+    );
+
+    Ok(chunk_ids)
+}
+
 /// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
 /// path (which has bytes + computed `storage_kind/path`) and the
 /// `DocumentStore::put_asset` path (which only has the `RawAsset` and
--- a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs
+++ b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs
@@ -41,6 +41,7 @@ fn fixture_report() -> IngestReport {
        skipped_generated: 0,
        skipped_size_exceeded: 0,
        skip_examples: kebab_core::SkipExamples::default(),
+        purged_deleted_files: 0,
        items: Some(vec![
            IngestItem {
                kind: IngestItemKind::New,
Author	SHA1	Message	Date
altair823	d26efe167f	Merge pull request 'fix(dogfood): auto-purge stored docs for filesystem-deleted files' (#148 ) from fix/dogfood-file-deletion-auto-purge into main	2026-05-20 07:10:33 +00:00
altair823	d6d165df01	docs(dogfood): sync sweep_deleted_files algorithm doc with try_exists (PR #148 nit) Round 2 review found the function-level doc-comment still referenced the old fs::exists() (now replaced by try_exists().unwrap_or(true) in commit `2baa846`). One-line clarification — describes the conservative-on-Err semantics so future readers don't reintroduce the data-safety bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:10:27 +00:00
altair823	2baa846c6b	fix(dogfood): conservative try_exists() in sweep_deleted_files (PR #148 review) Round 1 review found a data-safety bug: fs::exists() returns false on errors like EACCES / EPERM / NFS-hiccup / ownership-change, which would trigger purge on a file that is in fact still on disk (just unreadable this moment). Switched to try_exists().unwrap_or(true) so transient FS errors are CONSERVATIVELY treated as 'file present' — never purge on uncertain signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:04:03 +00:00
altair823	27baec82ea	fix(dogfood): auto-purge stored docs for filesystem-deleted files Files deleted from disk (rm a.md) were leaving stale documents + chunks + embeddings in the store, surfacing as ghost citations in search/ask. Existing purge_orphan_at_workspace_path only handled content-changed stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no new asset_id. Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths), for each candidate check filesystem existence — only purge when the file is TRULY missing. Scope-narrowing case (file on disk but outside include glob) is explicitly NOT purged to protect users from accidental data loss via config edits. Adds: - DocumentStore::all_workspace_paths trait method + SqliteStore impl - purge_deleted_workspace_path in store-sqlite (returns chunk_ids for vector delete; deletes doc CASCADE + asset row + copied storage file) - sweep_deleted_files in kebab-app::ingest path; called once per ingest before the per-asset loop - IngestReport.purged_deleted_files counter (additive, serde default) - CLI ingest summary mentions purge count when > 0 - 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash). Per user decision: only filesystem deletion auto-purges; scope narrowing requires explicit kebab reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:51:07 +00:00
altair823	acf8cf3be2	chore: bump version 0.8.3 → 0.9.0 dogfood-discovered routing additions (PR #147) land: - .mts / .cts → MediaType::Code(typescript) - .mdx → MediaType::Markdown minor bump 사유: 사용자 도그푸딩 surface 확장 — 이전에 skip 되던 28+ 파일이 이제 색인됨. design §10.4 dogfooding-ready surface 확장 = minor trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:29:27 +00:00
altair823	ea5f7b22c8	Merge pull request 'feat(dogfood): route .mts/.cts → typescript + .mdx → markdown' (#147 ) from feat/dogfood-routing-cts-mts-mdx into main	2026-05-20 06:28:41 +00:00
altair823	5497c6e7b5	feat(dogfood): route .mts/.cts to typescript + .mdx to markdown Dogfood (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash) showed 28 files skipped by extension that are routable to existing extractors: - .mts (ESM TypeScript) / .cts (CommonJS TypeScript) — same grammar as .ts in tree-sitter-typescript 0.23 (LANGUAGE_TYPESCRIPT covers JSX- agnostic variants; LANGUAGE_TSX stays for .tsx only) - .mdx (Markdown + JSX) — routed as MediaType::Markdown; the md parser folds JSX islands through as raw passthrough Changes: - crates/kebab-source-fs/src/media.rs: 'mts'\|'cts' → Code(typescript), 'mdx' → Markdown. +2 unit tests. - crates/kebab-parse-code/src/lang.rs: code_lang_for_path matches mts/cts; module_path_for_tsjs strips .mts/.cts as well. Test cases extended. - crates/kebab-parse-code/src/typescript.rs: doc comment on select_grammar refreshed to mention .mts/.cts. - crates/kebab-parse-code/tests/lang.rs: 2 new assertions. verify: kebab-source-fs 44 / kebab-parse-code lib 20 + lang 4 all pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:24:21 +00:00
altair823	5a90940f1c	chore: bump version 0.8.2 → 0.8.3 dogfood-discovered fix (PR #146) lands: idempotent re-ingest now correctly returns Unchanged for twin files (identical content at different paths) via document-centric try_skip_unchanged lookup. patch bump 사유: advertised idempotency 의 정상 동작 복원. 새 wire / config / surface 변경 없음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:20:34 +00:00
altair823	4389b887f0	Merge pull request 'fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency' (#146 ) from fix/dogfood-bug4-idempotent-twin-files into main	2026-05-20 06:16:28 +00:00
altair823	360f825f3a	docs(dogfood): refresh try_skip_unchanged doc-comment to match new flow (PR #146 review) Round 1 review found the function-level doc-comment still described the old asset-side algorithm (item 2 asset-row checksum, item 3 id_for_doc miss). Updated to the document-centric flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:35:17 +00:00
altair823	641b92af7d	fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency Identical-content files at different workspace paths share one assets row (assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made twin files overwrite each other's workspace_path on every ingest, so `get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or None) — break idempotent unchanged-detection for both files. Fix: switch try_skip_unchanged to document-centric lookup. `documents. workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)` includes path, so each twin has its own stable document row. Compare `doc.source_asset_id` with the new asset's checksum instead of going through the assets table. Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of 726 docs marked Updated on every idempotent re-ingest — all 27 are twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md same content, duplicate logo PDFs/JPGs). After: re-ingest reports 0 new / 0 updated / 726 unchanged. No schema migration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:27:21 +00:00