chore: bump version 0.10.0 → 0.11.0

dogfood follow-up (PR #149) lands: kebab reset --orphans-only explicit complement to PR #148's conservative sweep. minor bump 사유: 새 CLI flag (--orphans-only) + 새 ResetScope variant + ResetReport additive 필드 = surface 확장. design §10.4 트리거 충족. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge pull request 'feat(dogfood): kebab reset --orphans-only — explicit complement to PR #148 sweep' (#149 ) from feat/dogfood-reset-orphans-only into main
2026-05-20 07:53:55 +00:00 · 2026-05-20 07:50:43 +00:00 · 2026-05-20 07:47:44 +00:00 · 2026-05-20 07:38:10 +00:00 · 2026-05-20 07:12:58 +00:00 · 2026-05-20 07:10:33 +00:00
30 changed files with 1727 additions and 99 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4127,7 +4127,7 @@ dependencies = [

 [[package]]
 name = "kebab-app"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "base64 0.22.1",
@@ -4172,7 +4172,7 @@ dependencies = [

 [[package]]
 name = "kebab-chunk"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4187,7 +4187,7 @@ dependencies = [

 [[package]]
 name = "kebab-cli"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "clap",
@@ -4208,7 +4208,7 @@ dependencies = [

 [[package]]
 name = "kebab-config"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "dirs 5.0.1",
@@ -4223,7 +4223,7 @@ dependencies = [

 [[package]]
 name = "kebab-core"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4237,7 +4237,7 @@ dependencies = [

 [[package]]
 name = "kebab-embed"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4251,7 +4251,7 @@ dependencies = [

 [[package]]
 name = "kebab-embed-local"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "fastembed",
@@ -4264,7 +4264,7 @@ dependencies = [

 [[package]]
 name = "kebab-eval"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-app",
@@ -4283,7 +4283,7 @@ dependencies = [

 [[package]]
 name = "kebab-llm"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4292,7 +4292,7 @@ dependencies = [

 [[package]]
 name = "kebab-llm-local"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-config",
@@ -4309,7 +4309,7 @@ dependencies = [

 [[package]]
 name = "kebab-mcp"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-app",
@@ -4327,7 +4327,7 @@ dependencies = [

 [[package]]
 name = "kebab-normalize"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4342,7 +4342,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-code"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "gix",
@@ -4360,7 +4360,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-image"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "ab_glyph",
 "anyhow",
@@ -4384,7 +4384,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-md"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "kebab-core",
@@ -4401,7 +4401,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-pdf"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4414,7 +4414,7 @@ dependencies = [

 [[package]]
 name = "kebab-parse-types"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "kebab-core",
 "serde",
@@ -4422,7 +4422,7 @@ dependencies = [

 [[package]]
 name = "kebab-rag"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4443,7 +4443,7 @@ dependencies = [

 [[package]]
 name = "kebab-search"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "globset",
@@ -4462,10 +4462,11 @@ dependencies = [

 [[package]]
 name = "kebab-source-fs"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
+ "globset",
 "ignore",
 "kebab-config",
 "kebab-core",
@@ -4480,7 +4481,7 @@ dependencies = [

 [[package]]
 name = "kebab-store-sqlite"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "blake3",
@@ -4501,7 +4502,7 @@ dependencies = [

 [[package]]
 name = "kebab-store-vector"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "arrow",
@@ -4525,7 +4526,7 @@ dependencies = [

 [[package]]
 name = "kebab-tui"
-version = "0.8.0"
+version = "0.11.0"
 dependencies = [
 "anyhow",
 "crossterm",
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -31,7 +31,7 @@ edition       = "2024"
 rust-version  = "1.85"
 license       = "MIT OR Apache-2.0"
 repository    = "https://github.com/altair823/kebab"
-version       = "0.8.0"
+version       = "0.11.0"

 [workspace.dependencies]
 anyhow       = "1"
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke

 업데이트는 `git pull && cargo install --path crates/kebab-cli --locked --force` 또는 git URL 형식의 경우 `cargo install --git ... --force`.

-제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed) 등.
+제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed), **`kebab reset --orphans-only`** (현재 walker scope 밖에 있는 stored doc 만 정리 — `config.workspace.include` 좁히거나 sub-dir 옮긴 후 explicit reconcile; fs 의 file 은 건드리지 않음) 등.

 ## Quick start

--- a/crates/kebab-app/src/lib.rs
+++ b/crates/kebab-app/src/lib.rs
@@ -71,7 +71,7 @@ mod staleness;

 pub use app::{App, SearchResponse};
 pub use ingest_progress::{AggregateCounts, IngestEvent, render_skipped_breakdown};
-pub use reset::{ResetReport, ResetScope};
+pub use reset::{ResetReport, ResetScope, enumerate_orphans};
 pub use error_wire::{ERROR_V1_ID, ErrorV1, StructuredError, classify};
 pub use fetch::fetch_with_config;
 #[doc(hidden)]
@@ -375,6 +375,28 @@ pub fn ingest_with_config_opts(
        .map(|d| d.doc_id.0)
        .collect();

+    // Dogfood: post-walker sweep to remove stored docs whose source
+    // file has been deleted from the filesystem. Must run BEFORE the
+    // per-asset loop so the loop's New/Updated labelling is based on
+    // the post-purge store state (the purged doc_ids won't be in
+    // `existing_doc_ids` above — they were already removed, OR the
+    // sweep here removes them before we start counting).
+    //
+    // Critical design invariant: only purge when the file is TRULY
+    // absent from disk. A file that is still on disk but outside the
+    // current walker scope (config narrowing / include-glob change) is
+    // NOT purged — we leave it in place to protect against accidental
+    // data loss via config edits.
+    let scanned_paths: std::collections::HashSet<kebab_core::WorkspacePath> = assets
+        .iter()
+        .map(|a| a.workspace_path.clone())
+        .collect();
+    let purged_deleted_files = sweep_deleted_files(
+        &app,
+        &scanned_paths,
+        vector_store.as_ref().map(|v| v.as_ref()),
+    )?;
+
    let started_at = time::OffsetDateTime::now_utc();

    let mut items: Vec<kebab_core::IngestItem> = Vec::new();
@@ -647,11 +669,11 @@ pub fn ingest_with_config_opts(
    crate::ingest_progress::emit(progress, terminal_event);

    // p9-fb-19: bump the persistent corpus_revision counter when a
-    // commit landed (any new / updated). This invalidates every
+    // commit landed (any new / updated / purged). This invalidates every
    // entry in any in-process LRU search cache (in this process or
    // a sibling) on the next lookup. No-op when nothing changed
    // (skipped-only run) — the cache stays valid.
-    if new_count > 0 || updated_count > 0 {
+    if new_count > 0 || updated_count > 0 || purged_deleted_files > 0 {
        match app.sqlite.bump_corpus_revision() {
            Ok(rev) => tracing::debug!(
                target: "kebab-app",
@@ -682,6 +704,7 @@ pub fn ingest_with_config_opts(
        skipped_generated: fs_skips.skipped_generated,
        skipped_size_exceeded: fs_skips.skipped_size_exceeded,
        skip_examples: fs_skips.skip_examples,
+        purged_deleted_files,
        items: if summary_only { None } else { Some(items) },
    })
 }
@@ -748,15 +771,18 @@ struct ImagePipeline<'a> {
 /// hold (per design §9 cascade rule):
 ///
 /// 1. `force_reingest == false` — caller hasn't asked to bypass skip.
-/// 2. The freshly-scanned asset's blake3 checksum equals what the
-///    existing `assets` row stores at the same `workspace_path`.
-/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)`
-///    exists. If the parser_version changed, `id_for_doc` produces a
-///    different `doc_id` so the lookup misses → no skip → re-process.
-/// 4. The existing doc's stamped `last_chunker_version` AND
-///    `last_embedding_version` match the values the caller is about
-///    to use (`Some(v) == Some(v)` and `None == None` — see design
-///    doc for the `None == None` rule when no embedder is configured).
+/// 2. A document already exists at this `workspace_path`
+///    (`get_document_by_workspace_path`). The lookup is document-side, not
+///    asset-side, so twin files (identical content at different paths) each
+///    hit their own stable doc row — `documents.workspace_path` is UNIQUE
+///    while `assets` may dedupe content into a single row with a flip-flop
+///    `workspace_path` column (dogfood bug #4, see `tasks/HOTFIXES.md`).
+/// 3. The existing doc's `source_asset_id` equals the freshly-scanned
+///    asset's blake3 checksum (content unchanged).
+/// 4. The existing doc's `parser_version` matches the current extractor's
+///    `parser_version` (extractor not upgraded). Combined with `chunker_version`
+///    and `last_embedding_version` checks immediately below — full cascade
+///    per design §9.
 ///
 /// Returns `Ok(None)` (proceed with full re-process) when any check
 /// fails or any DB read errors out — the skip path is opportunistic;
@@ -773,31 +799,19 @@ fn try_skip_unchanged(
    if force_reingest {
        return Ok(None);
    }
-    let existing_asset = match app
+    // Document-centric skip: look up the existing document row by
+    // workspace_path directly. This avoids the twin-file flip-flop
+    // that the old asset-side lookup suffers from — multiple files
+    // with identical content share one `assets` row whose
+    // `workspace_path` is overwritten on every UPSERT, so
+    // `get_asset_by_workspace_path(path1)` could return the OTHER
+    // twin's path (or None) after any ingest of the twin. The
+    // `documents` table has a UNIQUE index on `workspace_path` (V001),
+    // so each twin has its own stable row regardless of asset de-dup.
+    let existing_doc = match app
        .sqlite
-        .get_asset_by_workspace_path(&asset.workspace_path)
+        .get_document_by_workspace_path(&asset.workspace_path)
    {
-        Ok(Some(a)) => a,
-        Ok(None) => return Ok(None),
-        Err(e) => {
-            tracing::debug!(
-                target: "kebab-app",
-                path = %asset.workspace_path.0,
-                error = %e,
-                "skip-check: get_asset_by_workspace_path failed; falling through to re-process"
-            );
-            return Ok(None);
-        }
-    };
-    if existing_asset.checksum != asset.checksum {
-        return Ok(None);
-    }
-    let candidate_doc_id = kebab_core::id_for_doc(
-        &asset.workspace_path,
-        &asset.asset_id,
-        current_parser_version,
-    );
-    let existing_doc = match app.sqlite.get_document(&candidate_doc_id) {
        Ok(Some(d)) => d,
        Ok(None) => return Ok(None),
        Err(e) => {
@@ -805,21 +819,37 @@ fn try_skip_unchanged(
                target: "kebab-app",
                path = %asset.workspace_path.0,
                error = %e,
-                "skip-check: get_document failed; falling through to re-process"
+                "skip-check: get_document_by_workspace_path failed; falling through to re-process"
            );
            return Ok(None);
        }
    };
+    // 1. Content unchanged: the freshly-computed asset_id (blake3
+    //    content hash) must match what this document was ingested from.
+    if existing_doc.source_asset_id != asset.asset_id {
+        return Ok(None);
+    }
+    // 2. Parser unchanged: parser_version is baked into id_for_doc so
+    //    a version bump yields a different doc_id and the row above
+    //    would have been missing. Checking here explicitly keeps the
+    //    logic self-documenting and guards against future id_for_doc
+    //    changes.
+    if existing_doc.parser_version != *current_parser_version {
+        return Ok(None);
+    }
+    // 3. Chunker unchanged.
    let chunker_match = existing_doc.last_chunker_version.as_ref()
        == Some(current_chunker_version);
    if !chunker_match {
        return Ok(None);
    }
+    // 4. Embedder unchanged.
    let embedder_match = existing_doc.last_embedding_version.as_ref()
        == current_embedding_version;
    if !embedder_match {
        return Ok(None);
    }
+    let candidate_doc_id = existing_doc.doc_id.clone();
    tracing::debug!(
        target: "kebab-app::ingest",
        path = %asset.workspace_path.0,
@@ -1446,6 +1476,120 @@ fn purge_vector_orphans_for_workspace_path(
    Ok(())
 }

+/// Dogfood: post-walker sweep that purges stored documents whose source
+/// file has been physically deleted from the filesystem.
+///
+/// Algorithm:
+/// 1. Query `documents` for every `workspace_path` currently stored.
+/// 2. Compute `orphan_candidates = stored_paths - scanned_paths`.
+/// 3. For each candidate: resolve to an absolute path and call
+///    `Path::try_exists().unwrap_or(true)` — transient FS errors
+///    (EACCES, NFS hiccup, ownership change) conservatively count as
+///    "still present" so we never purge on uncertain signal. If the
+///    file still exists on disk it was merely out-of-scope this run
+///    (config narrowing / include-glob change) — leave it alone. Only
+///    files that are truly absent trigger a purge.
+/// 4. For absent files: call `purge_deleted_workspace_path` (SQLite
+///    cascade delete + optional copied-asset file removal) and, if a
+///    vector store is present, delete the associated vectors.
+///
+/// Returns the number of documents purged.
+///
+/// Non-fatal design: individual purge failures are logged and counted
+/// as errors on the per-file level but do NOT abort the sweep — a
+/// partial failure is preferable to blocking the rest of ingest. The
+/// return value only counts successful purges.
+fn sweep_deleted_files(
+    app: &App,
+    scanned_paths: &std::collections::HashSet<kebab_core::WorkspacePath>,
+    vector_store: Option<&kebab_store_vector::LanceVectorStore>,
+) -> anyhow::Result<u32> {
+    use kebab_core::DocumentStore as _;
+
+    let stored_paths = app
+        .sqlite
+        .all_workspace_paths()
+        .context("sweep_deleted_files: all_workspace_paths")?;
+
+    if stored_paths.is_empty() {
+        return Ok(0);
+    }
+
+    let workspace_root = app.config.resolve_workspace_root();
+    let mut purged: u32 = 0;
+
+    for stored_path in stored_paths {
+        if scanned_paths.contains(&stored_path) {
+            continue; // still in scope — skip
+        }
+
+        // Resolve to an absolute path and check existence on disk.
+        // Use `try_exists` + `unwrap_or(true)` so transient FS errors
+        // (EACCES on a path we lack read on, NFS hiccups, ownership
+        // change) are CONSERVATIVELY treated as "file still present" —
+        // never purge on uncertain signal (data-safety: PR #148 review).
+        // `exists()` would return false on Err and trigger a wrongful
+        // purge. Files whose path cannot be joined (theoretically
+        // impossible for non-empty workspace_path strings, but
+        // defense-in-depth) are likewise treated as still present.
+        let abs = workspace_root.join(&stored_path.0);
+        if abs.try_exists().unwrap_or(true) {
+            // File is on disk but not in this scan's scope (config
+            // narrowing). DO NOT purge — critical design constraint.
+            tracing::debug!(
+                target: "kebab-app",
+                path = %stored_path.0,
+                "sweep_deleted_files: file on disk but out of scope — leaving in store"
+            );
+            continue;
+        }
+
+        // File is truly absent → purge.
+        let chunk_ids = match kebab_store_sqlite::purge_deleted_workspace_path(
+            &app.sqlite,
+            &stored_path,
+        ) {
+            Ok(ids) => ids,
+            Err(e) => {
+                tracing::warn!(
+                    target: "kebab-app",
+                    path = %stored_path.0,
+                    error = %e,
+                    "sweep_deleted_files: purge failed; skipping this path"
+                );
+                continue;
+            }
+        };
+
+        // Purge associated vectors (best-effort; partial failure
+        // acceptable — orphan vectors get cleaned by `kebab reset
+        // --vector-only` if they accumulate).
+        if let Some(vec) = vector_store {
+            if !chunk_ids.is_empty() {
+                use kebab_core::VectorStore as _;
+                if let Err(e) = vec.delete_by_chunk_ids(&chunk_ids) {
+                    tracing::warn!(
+                        target: "kebab-app",
+                        path = %stored_path.0,
+                        count = chunk_ids.len(),
+                        error = %e,
+                        "sweep_deleted_files: vector delete failed; SQLite side already cleaned"
+                    );
+                }
+            }
+        }
+
+        tracing::info!(
+            target: "kebab-app",
+            path = %stored_path.0,
+            "sweep_deleted_files: purged document for deleted file"
+        );
+        purged = purged.saturating_add(1);
+    }
+
+    Ok(purged)
+}
+
 /// P7-3: process one `MediaType::Pdf` asset end-to-end.
 ///
 /// - Reads bytes from disk.
--- a/crates/kebab-app/src/reset.rs
+++ b/crates/kebab-app/src/reset.rs
@@ -9,13 +9,19 @@
 //!
 //! `--vector-only` additionally truncates `embedding_records` in SQLite
 //! so the next `kebab ingest` re-embeds cleanly without orphan rows.
+//!
+//! `--orphans-only` purges stored docs that are outside the current walker
+//! scope (config narrowing / removed sub-directory). No filesystem paths are
+//! removed — this is purely a store-level reconciliation.

+use std::collections::HashSet;
 use std::path::PathBuf;

 use anyhow::{Context, Result};
 use serde::{Deserialize, Serialize};

 use kebab_config::{Config, expand_path};
+use kebab_core::WorkspacePath;

 /// What the user asked to remove. Mutually exclusive — picked by the CLI
 /// from a clap `ArgGroup`.
@@ -32,6 +38,13 @@ pub enum ResetScope {
    VectorOnly,
    /// Wipe only the config dir.
    ConfigOnly,
+    /// Purge stored docs that are outside the current walker scope (no
+    /// filesystem paths are removed). Filesystem existence is NOT checked —
+    /// anything the current walker would not visit is considered an orphan.
+    /// The explicit complement to the conservative `sweep_deleted_files`
+    /// that runs during ingest (which leaves on-disk-but-out-of-scope docs
+    /// alone for data safety).
+    OrphansOnly,
 }

 /// Result of a successful wipe — emitted as `reset_report.v1` by the
@@ -41,6 +54,16 @@ pub struct ResetReport {
    pub scope: ResetScope,
    pub removed_paths: Vec<PathBuf>,
    pub embedding_rows_truncated: u64,
+    /// Number of stored docs purged because they are outside the current
+    /// walker scope. Non-zero only when `scope == OrphansOnly`.
+    /// `#[serde(default)]` preserves back-compat with older callers that
+    /// do not include this field.
+    #[serde(default)]
+    pub orphans_purged: u32,
+    /// Paths of the orphaned docs that were purged. Sorted for deterministic
+    /// output. Non-empty only when `scope == OrphansOnly`.
+    #[serde(default)]
+    pub purged_paths: Vec<WorkspacePath>,
 }

 /// Compute the absolute on-disk paths a given scope will wipe, given a
@@ -67,6 +90,10 @@ pub fn enumerate_paths(scope: ResetScope, cfg: &Config) -> Vec<PathBuf> {
            vec![vector_dir]
        }
        ResetScope::ConfigOnly => vec![cfg_dir],
+        // OrphansOnly operates purely at the store level — no filesystem paths
+        // are removed. Return empty so `estimate_size_bytes` stays zero and
+        // the existing confirm UI path for directory wipes is skipped.
+        ResetScope::OrphansOnly => vec![],
    }
 }

@@ -96,16 +123,82 @@ pub fn estimate_size_bytes(paths: &[PathBuf]) -> u64 {
    paths.iter().map(|p| walk(p)).sum()
 }

+/// Compute the workspace paths stored in SQLite that are NOT visited by
+/// the current walker scope (i.e. they are "orphans" — on disk but
+/// outside the configured include/exclude rules, or from a sub-directory
+/// that has since been removed from the workspace).
+///
+/// Does NOT check filesystem existence — `OrphansOnly` is the explicit
+/// "I know what I'm doing" variant; callers that want the conservative
+/// fs-aware sweep should use `sweep_deleted_files` inside ingest.
+///
+/// Returns the list sorted for deterministic output. Called twice by the
+/// CLI path (once for the confirm UI preview, once inside `execute`);
+/// the double scan is acceptable for a rare destructive operation.
+pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
+    use kebab_core::DocumentStore as _;
+    use kebab_source_fs::FsSourceConnector;
+    use kebab_core::SourceScope;
+
+    let store = kebab_store_sqlite::SqliteStore::open(cfg)
+        .context("enumerate_orphans: open SqliteStore")?;
+
+    let stored = store
+        .all_workspace_paths()
+        .context("enumerate_orphans: all_workspace_paths")?;
+
+    if stored.is_empty() {
+        return Ok(Vec::new());
+    }
+
+    // Build the same SourceScope the CLI's ingest path uses: root from
+    // config, exclude list from config, no include override (full scope).
+    let root = cfg.resolve_workspace_root();
+    let scope = SourceScope {
+        root: root.clone(),
+        exclude: cfg.workspace.exclude.clone(),
+        ..Default::default()
+    };
+
+    let connector = FsSourceConnector::new(cfg)
+        .context("enumerate_orphans: build FsSourceConnector")?;
+    let (assets, _skips) = connector
+        .scan_with_skips(&scope)
+        .context("enumerate_orphans: scan workspace")?;
+
+    let scanned: HashSet<WorkspacePath> = assets
+        .into_iter()
+        .map(|a| a.workspace_path)
+        .collect();
+
+    let mut orphans: Vec<WorkspacePath> = stored
+        .into_iter()
+        .filter(|p| !scanned.contains(p))
+        .collect();
+    orphans.sort_by(|a, b| a.0.cmp(&b.0));
+    Ok(orphans)
+}
+
 /// Wipe every path from `enumerate_paths(scope, cfg)`. For
 /// `ResetScope::VectorOnly`, also truncates the SQLite
 /// `embedding_records` table so the store doesn't point at the Lance
 /// rows we just removed off-disk.
 ///
+/// For `ResetScope::OrphansOnly`, no filesystem directories are removed.
+/// Instead the store is reconciled: stored docs outside the current walker
+/// scope are purged from SQLite (+ vector store when configured). The
+/// caller is expected to have already shown the confirm UI using
+/// `enumerate_orphans`.
+///
 /// Idempotent: a missing path is treated as already-removed (success).
 /// Returns a `ResetReport` listing exactly what was removed (paths that
 /// existed before the call) so `--json` callers see the truth, not the
 /// request.
 pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
+    if matches!(scope, ResetScope::OrphansOnly) {
+        return execute_orphans_only(cfg);
+    }
+
    let paths = enumerate_paths(scope, cfg);
    let mut removed = Vec::new();

@@ -128,9 +221,100 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
        scope,
        removed_paths: removed,
        embedding_rows_truncated,
+        orphans_purged: 0,
+        purged_paths: Vec::new(),
    })
 }

+/// Execute the `OrphansOnly` variant: reconcile stored docs against the
+/// current walker scope without touching any filesystem directory.
+fn execute_orphans_only(cfg: &Config) -> Result<ResetReport> {
+    let orphans = enumerate_orphans(cfg)
+        .context("execute_orphans_only: enumerate orphans")?;
+
+    if orphans.is_empty() {
+        return Ok(ResetReport {
+            scope: ResetScope::OrphansOnly,
+            removed_paths: Vec::new(),
+            embedding_rows_truncated: 0,
+            orphans_purged: 0,
+            purged_paths: Vec::new(),
+        });
+    }
+
+    let store = std::sync::Arc::new(
+        kebab_store_sqlite::SqliteStore::open(cfg)
+            .context("execute_orphans_only: open SqliteStore")?,
+    );
+
+    // Open vector store if configured. Mirror the same guard the ingest
+    // path uses: only construct when the provider is not "none" / dims > 0.
+    let vector_store: Option<kebab_store_vector::LanceVectorStore> =
+        open_vector_store_if_configured(cfg, store.clone())?;
+
+    let mut purged_paths: Vec<WorkspacePath> = Vec::new();
+
+    for path in &orphans {
+        let chunk_ids = kebab_store_sqlite::purge_deleted_workspace_path(&store, path)
+            .with_context(|| format!("execute_orphans_only: purge {}", path.0))?;
+
+        if let Some(ref vs) = vector_store {
+            if !chunk_ids.is_empty() {
+                use kebab_core::VectorStore as _;
+                if let Err(e) = vs.delete_by_chunk_ids(&chunk_ids) {
+                    tracing::warn!(
+                        target: "kebab-app",
+                        path = %path.0,
+                        count = chunk_ids.len(),
+                        error = %e,
+                        "reset --orphans-only: vector delete failed; SQLite side already cleaned"
+                    );
+                }
+            }
+        }
+
+        tracing::info!(
+            target: "kebab-app",
+            path = %path.0,
+            "reset --orphans-only: purged orphan document"
+        );
+        purged_paths.push(path.clone());
+    }
+
+    let orphans_purged = u32::try_from(purged_paths.len()).unwrap_or(u32::MAX);
+
+    Ok(ResetReport {
+        scope: ResetScope::OrphansOnly,
+        removed_paths: Vec::new(),
+        embedding_rows_truncated: 0,
+        orphans_purged,
+        purged_paths,
+    })
+}
+
+/// Open the Lance vector store if the configured embedding provider is
+/// active (non-"none", dimensions > 0). Returns `None` for lexical-only
+/// configs. Mirrors the guard in `App::vector`.
+fn open_vector_store_if_configured(
+    cfg: &Config,
+    store: std::sync::Arc<kebab_store_sqlite::SqliteStore>,
+) -> Result<Option<kebab_store_vector::LanceVectorStore>> {
+    if cfg.models.embedding.provider == "none" || cfg.models.embedding.dimensions == 0 {
+        return Ok(None);
+    }
+    match kebab_store_vector::LanceVectorStore::new(cfg, store) {
+        Ok(vs) => Ok(Some(vs)),
+        Err(e) => {
+            tracing::warn!(
+                target: "kebab-app",
+                error = %e,
+                "reset --orphans-only: could not open vector store; skipping vector delete"
+            );
+            Ok(None)
+        }
+    }
+}
+
 /// Open the SQLite store at the configured path and run
 /// `truncate_embedding_records`. Returns the count of truncated rows
 /// (the helper itself reports `DELETE` rowcount). If the SQLite file
@@ -200,4 +384,14 @@ mod tests {
        let bytes = estimate_size_bytes(&[dir.path().to_path_buf()]);
        assert_eq!(bytes, 5 + 6);
    }
+
+    #[test]
+    fn enumerate_orphans_only_returns_empty_paths() {
+        let cfg = Config::defaults();
+        let paths = enumerate_paths(ResetScope::OrphansOnly, &cfg);
+        assert!(
+            paths.is_empty(),
+            "OrphansOnly must return empty vec from enumerate_paths"
+        );
+    }
 }
--- a/crates/kebab-app/src/schema.rs
+++ b/crates/kebab-app/src/schema.rs
@@ -168,7 +168,9 @@ fn collect_stats(
        stale_doc_count: counts.stale_doc_count,
        // p10-1A-2: populated by the store query added in this task.
        code_lang_breakdown: store.code_lang_breakdown()?,
-        repo_breakdown: std::collections::BTreeMap::new(),
+        // p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
+        // placeholder — mirror of code_lang_breakdown for the repo field.
+        repo_breakdown: store.repo_breakdown()?,
    })
 }

--- a/crates/kebab-app/tests/file_deletion_auto_purge.rs
+++ b/crates/kebab-app/tests/file_deletion_auto_purge.rs
@@ -0,0 +1,178 @@
+//! Dogfood: auto-purge stored docs for filesystem-deleted files.
+//!
+//! Two tests:
+//!
+//! 1. `file_deletion_auto_purge` — ingest 2 files, delete one, re-ingest.
+//!    The re-ingest must report `purged_deleted_files = 1`, the deleted
+//!    file must no longer appear in `list_docs`, and lexical search for
+//!    its unique content must return no hits.
+//!
+//! 2. `include_scope_narrowing_does_not_purge` — ingest 2 files under a
+//!    wide glob, narrow the walker scope to only one file, re-ingest.
+//!    The narrowed ingest must NOT purge the out-of-scope file because
+//!    the file is still on disk (just excluded from this run). Protects
+//!    users against accidental data loss via config edits.
+
+mod common;
+
+use common::TestEnv;
+use kebab_app::ingest_with_config_opts;
+use kebab_app::IngestOpts;
+use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
+
+/// Helper: open the store via `TestEnv` and run `list_documents`.
+fn list_doc_paths(env: &TestEnv) -> Vec<String> {
+    use kebab_store_sqlite::SqliteStore;
+    let store = SqliteStore::open(&env.config).unwrap();
+    store.run_migrations().unwrap();
+    store
+        .list_documents(&DocFilter::default())
+        .unwrap()
+        .into_iter()
+        .map(|d| d.doc_path.0)
+        .collect()
+}
+
+#[test]
+fn file_deletion_auto_purge() {
+    let env = TestEnv::lexical_only();
+
+    // Write two .rs files into the workspace.
+    let a_path = env.workspace_root.join("a.rs");
+    let b_path = env.workspace_root.join("b.rs");
+    std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
+    std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
+
+    // First ingest — both must be New.
+    let first = ingest_with_config_opts(
+        env.config.clone(),
+        env.scope(),
+        false,
+        IngestOpts::default(),
+    )
+    .expect("first ingest must succeed");
+    // Only count the .rs files we added (there may be fixture files too).
+    let first_new = first.new;
+    assert!(first_new >= 2, "expected at least 2 new docs: {first:?}");
+    assert_eq!(
+        first.purged_deleted_files, 0,
+        "no purges on first ingest: {first:?}"
+    );
+    assert_eq!(first.errors, 0, "no errors on first ingest: {first:?}");
+
+    // Delete one file from the filesystem.
+    std::fs::remove_file(&b_path).expect("remove b.rs");
+
+    // Second ingest — scanned count drops by 1; b.rs should be purged.
+    let second = ingest_with_config_opts(
+        env.config.clone(),
+        env.scope(),
+        false,
+        IngestOpts::default(),
+    )
+    .expect("second ingest must succeed");
+
+    assert_eq!(
+        second.purged_deleted_files, 1,
+        "exactly 1 file should be purged: {second:?}"
+    );
+    assert_eq!(second.new, 0, "no new docs after deletion: {second:?}");
+    assert_eq!(second.updated, 0, "no updated docs: {second:?}");
+    assert_eq!(second.errors, 0, "no errors: {second:?}");
+
+    // b.rs must no longer appear in list_docs.
+    let doc_paths = list_doc_paths(&env);
+    let b_ws_path = "b.rs";
+    assert!(
+        !doc_paths.iter().any(|p| p == b_ws_path),
+        "b.rs must be gone from list_docs; got: {doc_paths:?}"
+    );
+    // a.rs must still be present.
+    let a_ws_path = "a.rs";
+    assert!(
+        doc_paths.iter().any(|p| p == a_ws_path),
+        "a.rs must still be in list_docs; got: {doc_paths:?}"
+    );
+
+    // Lexical search for b.rs's unique content returns no hits.
+    let app = env.app();
+    let query = SearchQuery {
+        text: "bravo".to_string(),
+        mode: SearchMode::Lexical,
+        k: 10,
+        filters: kebab_core::SearchFilters::default(),
+    };
+    let hits = app.search(query).expect("search must not error");
+    assert!(
+        hits.is_empty(),
+        "search for deleted file's content must return no hits; got: {hits:?}"
+    );
+}
+
+#[test]
+fn include_scope_narrowing_does_not_purge() {
+    let env = TestEnv::lexical_only();
+
+    // Write two .rs files.
+    let a_path = env.workspace_root.join("a_narrow.rs");
+    let b_path = env.workspace_root.join("b_narrow.rs");
+    std::fs::write(&a_path, "// narrow a\nfn alpha_narrow() {}\n").unwrap();
+    std::fs::write(&b_path, "// narrow b\nfn bravo_narrow() {}\n").unwrap();
+
+    // Wide scope: first ingest — both must be New.
+    let wide_scope = SourceScope {
+        root: env.workspace_root.clone(),
+        include: vec!["**/*.rs".to_string()],
+        exclude: env.config.workspace.exclude.clone(),
+    };
+    let first = ingest_with_config_opts(
+        env.config.clone(),
+        wide_scope,
+        false,
+        IngestOpts::default(),
+    )
+    .expect("first ingest (wide) must succeed");
+    assert!(
+        first.new >= 2,
+        "expected at least 2 new docs: {first:?}"
+    );
+    assert_eq!(
+        first.purged_deleted_files, 0,
+        "no purges on first ingest: {first:?}"
+    );
+
+    // Narrow scope: only a_narrow.rs in include — b_narrow.rs is still
+    // on disk but excluded from the walker scope.
+    let narrow_scope = SourceScope {
+        root: env.workspace_root.clone(),
+        include: vec!["a_narrow.rs".to_string()],
+        exclude: env.config.workspace.exclude.clone(),
+    };
+    let second = ingest_with_config_opts(
+        env.config.clone(),
+        narrow_scope,
+        false,
+        IngestOpts::default(),
+    )
+    .expect("second ingest (narrow) must succeed");
+
+    // CRITICAL: b_narrow.rs is still on disk — must NOT be purged.
+    assert_eq!(
+        second.purged_deleted_files, 0,
+        "scope-narrowing must NOT purge on-disk files; got: {second:?}"
+    );
+    assert_eq!(second.errors, 0, "no errors: {second:?}");
+
+    // b_narrow.rs must still exist in the store.
+    let doc_paths = list_doc_paths(&env);
+    let b_ws_path = "b_narrow.rs";
+    assert!(
+        doc_paths.iter().any(|p| p == b_ws_path),
+        "b_narrow.rs must still be in list_docs after scope narrowing; got: {doc_paths:?}"
+    );
+    // And the file must still be on disk.
+    assert!(
+        b_path.exists(),
+        "b_narrow.rs must still be on disk (we didn't delete it)"
+    );
+}
--- a/crates/kebab-app/tests/reset_orphans.rs
+++ b/crates/kebab-app/tests/reset_orphans.rs
@@ -0,0 +1,141 @@
+//! Integration test for `kebab reset --orphans-only`.
+//!
+//! Verifies that stored docs outside the current walker scope are purged
+//! from the store without removing any files from the filesystem.
+//!
+//! Test outline:
+//! 1. Ingest 3 .rs files (a.rs, b.rs, c.rs) — all New.
+//! 2. Narrow the config `include` to `["a.rs"]` only; b.rs and c.rs are
+//!    still on disk but outside the walker scope.
+//! 3. Run `execute(ResetScope::OrphansOnly, &cfg)` — report must show
+//!    `orphans_purged == 2` and `purged_paths` contains b.rs + c.rs.
+//! 4. `list docs` must show only a.rs.
+//! 5. b.rs and c.rs must still exist on disk (no filesystem removal).
+//! 6. Second reset → `orphans_purged == 0` (idempotent).
+
+mod common;
+
+use common::TestEnv;
+use kebab_app::IngestOpts;
+use kebab_app::reset::{ResetScope, execute};
+use kebab_core::{DocFilter, DocumentStore, SourceScope};
+
+/// Open the SqliteStore and list all `workspace_path` values.
+fn list_doc_paths(env: &TestEnv) -> Vec<String> {
+    use kebab_store_sqlite::SqliteStore;
+    let store = SqliteStore::open(&env.config).unwrap();
+    store.run_migrations().unwrap();
+    store
+        .list_documents(&DocFilter::default())
+        .unwrap()
+        .into_iter()
+        .map(|d| d.doc_path.0)
+        .collect()
+}
+
+#[test]
+fn reset_orphans_only_purges_out_of_scope_docs() {
+    let env = TestEnv::lexical_only();
+
+    // Write three .rs files into the workspace.
+    let a_path = env.workspace_root.join("a.rs");
+    let b_path = env.workspace_root.join("b.rs");
+    let c_path = env.workspace_root.join("c.rs");
+    std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
+    std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
+    std::fs::write(&c_path, "// file c\nfn charlie() {}\n").unwrap();
+
+    // Ingest all three with a wide scope.
+    let wide_scope = SourceScope {
+        root: env.workspace_root.clone(),
+        include: vec!["**/*.rs".to_string()],
+        exclude: env.config.workspace.exclude.clone(),
+    };
+    let first = kebab_app::ingest_with_config_opts(
+        env.config.clone(),
+        wide_scope,
+        false,
+        IngestOpts::default(),
+    )
+    .expect("first ingest must succeed");
+    // The fixture workspace may contain other .rs files — just assert we
+    // got at least 3 new docs (our a.rs, b.rs, c.rs).
+    assert!(first.new >= 3, "expected at least 3 new docs: {first:?}");
+    assert_eq!(first.errors, 0, "no errors on first ingest");
+
+    // Narrow config to include only a.rs; b.rs + c.rs are still on disk.
+    let mut narrow_cfg = env.config.clone();
+    narrow_cfg.workspace.exclude.clear();
+    // Re-point workspace root (already correct) and restrict include via
+    // the SourceScope in the connector. The config's `workspace.root` is
+    // used by `enumerate_orphans` to build its scope — we keep that
+    // pointing at the workspace root. We simulate narrowing by setting a
+    // glob that only matches a.rs.
+    //
+    // NOTE: `kebab_config::WorkspaceCfg` does not have an `include` field
+    // (it was removed in p9-fb-25). We narrow the scope via the walker
+    // exclude list: exclude b.rs and c.rs explicitly.
+    narrow_cfg.workspace.exclude = vec!["b.rs".to_string(), "c.rs".to_string()];
+
+    // Run orphans-only reset.
+    let report = execute(ResetScope::OrphansOnly, &narrow_cfg)
+        .expect("orphans-only reset must succeed");
+
+    assert_eq!(
+        report.orphans_purged, 2,
+        "expected 2 orphans purged (b.rs + c.rs): {report:?}"
+    );
+
+    let mut purged: Vec<String> = report
+        .purged_paths
+        .iter()
+        .map(|p| p.0.clone())
+        .collect();
+    purged.sort();
+    assert_eq!(
+        purged,
+        vec!["b.rs".to_string(), "c.rs".to_string()],
+        "purged_paths must list b.rs and c.rs in sorted order: {purged:?}"
+    );
+
+    // list docs must show only a.rs (and any pre-existing fixture files
+    // that are not excluded by the narrow config).
+    let doc_paths = list_doc_paths(&env);
+    // The narrow_cfg excludes b.rs + c.rs — they must no longer be in store.
+    assert!(
+        !doc_paths.iter().any(|p| p == "b.rs"),
+        "b.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
+    );
+    assert!(
+        !doc_paths.iter().any(|p| p == "c.rs"),
+        "c.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
+    );
+    assert!(
+        doc_paths.iter().any(|p| p == "a.rs"),
+        "a.rs must still be in store; got: {doc_paths:?}"
+    );
+
+    // Both b.rs and c.rs must still exist on the filesystem — no file
+    // removal is performed by orphans-only.
+    assert!(
+        b_path.exists(),
+        "b.rs must still be on disk after orphans-only reset"
+    );
+    assert!(
+        c_path.exists(),
+        "c.rs must still be on disk after orphans-only reset"
+    );
+
+    // Second reset must be idempotent: nothing left to purge.
+    let second = execute(ResetScope::OrphansOnly, &narrow_cfg)
+        .expect("second orphans-only reset must succeed");
+    assert_eq!(
+        second.orphans_purged, 0,
+        "second reset must be idempotent (orphans_purged == 0): {second:?}"
+    );
+    assert!(
+        second.purged_paths.is_empty(),
+        "second reset purged_paths must be empty: {:?}",
+        second.purged_paths
+    );
+}
--- a/crates/kebab-app/tests/twin_files_idempotent.rs
+++ b/crates/kebab-app/tests/twin_files_idempotent.rs
@@ -0,0 +1,90 @@
+//! Regression test for the twin-file idempotency bug.
+//!
+//! Identical-content files at different workspace paths share one
+//! `assets` row (`asset_id` = blake3 content hash, PRIMARY KEY). The
+//! old UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path =
+//! excluded.workspace_path` made each twin overwrite the other's path
+//! on every ingest, so `get_asset_by_workspace_path(path1)` returned
+//! None (or the wrong twin) → re-process every time.
+//!
+//! Fix: `try_skip_unchanged` now uses `get_document_by_workspace_path`
+//! instead.  `documents.workspace_path` is UNIQUE (V001) so each twin
+//! has its own stable document row.
+//!
+//! Assertion contract:
+//!   1st ingest → 2 New (one per twin)
+//!   2nd ingest → 0 New, 0 Updated, 2 Unchanged
+
+mod common;
+
+use common::TestEnv;
+use kebab_app::ingest_with_config;
+use kebab_core::IngestItemKind;
+
+#[test]
+fn twin_files_second_ingest_is_unchanged() {
+    let env = TestEnv::lexical_only();
+
+    // Write two files with identical content at different paths.
+    let pkg_a = env.workspace_root.join("pkg_a");
+    let pkg_b = env.workspace_root.join("pkg_b");
+    std::fs::create_dir_all(&pkg_a).unwrap();
+    std::fs::create_dir_all(&pkg_b).unwrap();
+
+    let content = b"# shared\nThis content is identical in both files.\n";
+    std::fs::write(pkg_a.join("__init__.py"), content).unwrap();
+    std::fs::write(pkg_b.join("__init__.py"), content).unwrap();
+
+    // First ingest — both files must be New.
+    let first = ingest_with_config(env.config.clone(), env.scope(), false)
+        .expect("first ingest must succeed");
+    assert_eq!(first.errors, 0, "first ingest: no errors; report={first:?}");
+
+    let items = first.items.as_ref().expect("items must be present");
+    let twin_items: Vec<_> = items
+        .iter()
+        .filter(|i| {
+            i.doc_path.0.ends_with("__init__.py")
+        })
+        .collect();
+    assert_eq!(
+        twin_items.len(),
+        2,
+        "first ingest: expected exactly 2 __init__.py items; items={items:?}"
+    );
+    for item in &twin_items {
+        assert_eq!(
+            item.kind,
+            IngestItemKind::New,
+            "first ingest: each twin must be New; item={item:?}"
+        );
+    }
+
+    // Second ingest — same files, same content → both must be Unchanged.
+    let second = ingest_with_config(env.config.clone(), env.scope(), false)
+        .expect("second ingest must succeed");
+    assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
+    assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
+    assert_eq!(
+        second.updated, 0,
+        "second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"
+    );
+
+    let second_items = second.items.as_ref().expect("items must be present");
+    let twin_items2: Vec<_> = second_items
+        .iter()
+        .filter(|i| i.doc_path.0.ends_with("__init__.py"))
+        .collect();
+    assert_eq!(
+        twin_items2.len(),
+        2,
+        "second ingest: expected exactly 2 __init__.py items; items={second_items:?}"
+    );
+    for item in &twin_items2 {
+        assert_eq!(
+            item.kind,
+            IngestItemKind::Unchanged,
+            "second ingest: each twin must be Unchanged; item={item:?}"
+        );
+    }
+}
--- a/crates/kebab-cli/src/main.rs
+++ b/crates/kebab-cli/src/main.rs
@@ -275,6 +275,14 @@ enum Cmd {
        #[arg(long, group = "reset_scope")]
        config_only: bool,

+        /// Purge stored docs that are outside the current walker scope
+        /// (config narrowing / removed sub-directory). No filesystem paths
+        /// are removed — this is purely a store-level reconciliation.
+        /// Filesystem existence is NOT checked; anything the current walker
+        /// would not visit is considered an orphan and removed from the store.
+        #[arg(long, group = "reset_scope")]
+        orphans_only: bool,
+
        /// Skip the interactive confirm. Required in non-interactive
        /// contexts (CI, pipes).
        #[arg(long)]
@@ -595,14 +603,20 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
                println!("{}", serde_json::to_string(&wire::wire_ingest(&report))?);
            } else {
                let skipped_breakdown = kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
+                let purged_suffix = if report.purged_deleted_files > 0 {
+                    format!("  purged {}", report.purged_deleted_files)
+                } else {
+                    String::new()
+                };
                println!(
-                    "scanned {}  new {}  updated {}  skipped {}{}  errors {}  ({} ms)",
+                    "scanned {}  new {}  updated {}  skipped {}{}  errors {}{}  ({} ms)",
                    report.scanned,
                    report.new,
                    report.updated,
                    report.skipped,
                    skipped_breakdown,
                    report.errors,
+                    purged_suffix,
                    report.duration_ms
                );
            }
@@ -1088,6 +1102,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
            data_only: _,
            vector_only,
            config_only,
+            orphans_only,
            yes,
        } => {
            use kebab_app::ResetScope;
@@ -1101,11 +1116,50 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
                ResetScope::VectorOnly
            } else if *config_only {
                ResetScope::ConfigOnly
+            } else if *orphans_only {
+                ResetScope::OrphansOnly
            } else {
                ResetScope::DataOnly
            };

            let cfg = kebab_config::Config::load(cli.config.as_deref())?;
+
+            if matches!(scope, ResetScope::OrphansOnly) {
+                // OrphansOnly: confirm UI shows orphan count + sample paths
+                // rather than on-disk directory sizes.
+                let orphan_paths = kebab_app::enumerate_orphans(&cfg)?;
+
+                if !*yes {
+                    use std::io::IsTerminal;
+                    if !std::io::stdin().is_terminal() {
+                        anyhow::bail!(
+                            "reset --orphans-only is destructive and stdin is non-interactive — pass --yes to proceed"
+                        );
+                    }
+                    if !confirm_orphans_only(&orphan_paths)? {
+                        if !cli.quiet {
+                            eprintln!("aborted.");
+                        }
+                        return Ok(());
+                    }
+                }
+
+                let report = kebab_app::reset::execute(scope, &cfg)?;
+                if cli.json {
+                    println!("{}", serde_json::to_string(&wire::wire_reset(&report))?);
+                } else {
+                    if report.orphans_purged > 0 {
+                        println!("orphans purged: {}", report.orphans_purged);
+                        for p in &report.purged_paths {
+                            println!("  - {}", p.0);
+                        }
+                    } else {
+                        println!("no orphaned docs found — store is already in sync with walker scope");
+                    }
+                }
+                return Ok(());
+            }
+
            let paths = kebab_app::reset::enumerate_paths(scope, &cfg);
            let bytes = kebab_app::reset::estimate_size_bytes(&paths);

@@ -1444,6 +1498,46 @@ fn confirm_destructive(
    Ok(matches!(s.as_str(), "y" | "yes"))
 }

+/// Confirm prompt for `--orphans-only`: shows the orphan count + a
+/// sample of up to 5 paths so the user knows what will be purged before
+/// committing. No filesystem paths are removed — only store records.
+fn confirm_orphans_only(
+    orphan_paths: &[kebab_core::WorkspacePath],
+) -> anyhow::Result<bool> {
+    use std::io::Write;
+    let n = orphan_paths.len();
+    let mut out = std::io::stderr().lock();
+
+    if n == 0 {
+        writeln!(out, "no orphaned docs found — nothing to purge.")?;
+        out.flush()?;
+        // Nothing to do; treat as confirmed so the caller can emit the
+        // "no orphans" report without prompting.
+        return Ok(true);
+    }
+
+    let sample: Vec<&str> = orphan_paths
+        .iter()
+        .take(5)
+        .map(|p| p.0.as_str())
+        .collect();
+    let sample_str = sample.join(", ");
+    let ellipsis = if n > 5 { ", …" } else { "" };
+
+    writeln!(
+        out,
+        "Purge {n} stored doc(s) outside the current walker scope? (no filesystem paths removed)"
+    )?;
+    writeln!(out, "  sample: {sample_str}{ellipsis}")?;
+    write!(out, "[y/N] ")?;
+    out.flush()?;
+
+    let mut line = String::new();
+    std::io::stdin().read_line(&mut line)?;
+    let s = line.trim().to_ascii_lowercase();
+    Ok(matches!(s.as_str(), "y" | "yes"))
+}
+
 /// p9-fb-35: human-friendly plain output for `kebab fetch`.
 fn render_fetch_plain(r: &kebab_core::FetchResult) {
    println!("# {} ({})", r.doc_path.0, format_kind(r.kind));
--- a/crates/kebab-cli/src/wire.rs
+++ b/crates/kebab-cli/src/wire.rs
@@ -260,6 +260,7 @@ mod tests {
            skipped_generated: 0,
            skipped_size_exceeded: 0,
            skip_examples: SkipExamples::default(),
+            purged_deleted_files: 0,
            items: None,
        };
        let v = wire_ingest(&r);
@@ -364,6 +365,8 @@ mod tests {
            scope: kebab_app::ResetScope::DataOnly,
            removed_paths: vec![std::path::PathBuf::from("/tmp/x")],
            embedding_rows_truncated: 0,
+            orphans_purged: 0,
+            purged_paths: vec![],
        };
        let v = wire_reset(&r);
        assert_eq!(schema_of(&v), Some("reset_report.v1"));
--- a/crates/kebab-core/src/ingest.rs
+++ b/crates/kebab-core/src/ingest.rs
@@ -47,6 +47,12 @@ pub struct IngestReport {
    /// p10-1A-1: sample file paths per skip category (≤ 5 each).
    #[serde(default)]
    pub skip_examples: SkipExamples,
+    /// Dogfood: docs whose on-disk file was deleted since the last ingest
+    /// and were therefore removed from the store. Additive field — older
+    /// wire consumers that pre-date this field read it as 0 via
+    /// `#[serde(default)]`.
+    #[serde(default)]
+    pub purged_deleted_files: u32,
    /// `None` ↔ wire `items: null` (`--summary-only`).
    pub items: Option<Vec<IngestItem>>,
 }
@@ -136,6 +142,7 @@ mod tests {
                builtin_blacklist: vec!["node_modules/x.js".into()],
                gitignore: vec![],
            },
+            purged_deleted_files: 0,
            items: None,
        };
        let v = serde_json::to_value(&r).unwrap();
--- a/crates/kebab-core/src/traits.rs
+++ b/crates/kebab-core/src/traits.rs
@@ -169,6 +169,30 @@ pub trait DocumentStore {
        &self,
        path: &WorkspacePath,
    ) -> anyhow::Result<Option<RawAsset>>;
+
+    /// Look up a document row by its workspace path. Used by the
+    /// document-centric skip path in `try_skip_unchanged` to avoid the
+    /// twin-file flip-flop that the asset-side lookup suffers from
+    /// (multiple files with identical content share one `assets` row
+    /// whose `workspace_path` is overwritten on every UPSERT, so
+    /// `get_asset_by_workspace_path` returns the wrong twin's path).
+    ///
+    /// `documents.workspace_path` is UNIQUE (V001), so each twin has
+    /// its own stable document row regardless of the asset de-dup.
+    fn get_document_by_workspace_path(
+        &self,
+        path: &WorkspacePath,
+    ) -> anyhow::Result<Option<CanonicalDocument>>;
+
+    /// Return every `workspace_path` stored in the `documents` table.
+    ///
+    /// Used by the post-walker sweep in `kebab-app::ingest` to detect
+    /// documents whose source file has been deleted from the filesystem.
+    /// The set difference `(stored - scanned)` yields orphan candidates;
+    /// each candidate is then existence-checked on disk so that
+    /// out-of-scope files (config narrowing) are NOT purged — only truly
+    /// absent files trigger the purge.
+    fn all_workspace_paths(&self) -> anyhow::Result<Vec<WorkspacePath>>;
 }

 pub trait VectorStore {
--- a/crates/kebab-parse-code/src/lang.rs
+++ b/crates/kebab-parse-code/src/lang.rs
@@ -24,7 +24,7 @@ pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
    match ext.as_str() {
        "rs" => Some("rust"),
        "py" | "pyi" => Some("python"),
-        "ts" | "tsx" => Some("typescript"),
+        "ts" | "tsx" | "mts" | "cts" => Some("typescript"),
        "js" | "mjs" | "cjs" | "jsx" => Some("javascript"),
        "go" => Some("go"),
        "java" => Some("java"),
@@ -82,7 +82,7 @@ pub fn module_path_for_python(workspace_path: &str) -> String {
 /// (no slash replacement, no source-root strip). See plan §Task C.
 pub fn module_path_for_tsjs(workspace_path: &str) -> String {
    let p = workspace_path;
-    for ext in [".tsx", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
+    for ext in [".tsx", ".mts", ".cts", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
        if let Some(stripped) = p.strip_suffix(ext) {
            return stripped.to_string();
        }
@@ -110,7 +110,7 @@ mod tests {

    #[test]
    fn module_path_for_tsjs_keeps_slashes_and_strips_ext() {
-        for ext in ["ts", "tsx", "js", "jsx", "mjs", "cjs"] {
+        for ext in ["ts", "tsx", "mts", "cts", "js", "jsx", "mjs", "cjs"] {
            let p = format!("src/search/retriever/Retriever.{ext}");
            assert_eq!(module_path_for_tsjs(&p), "src/search/retriever/Retriever");
        }
--- a/crates/kebab-parse-code/src/typescript.rs
+++ b/crates/kebab-parse-code/src/typescript.rs
@@ -173,8 +173,9 @@ impl Extractor for TypescriptAstExtractor {
 }

 /// Select the tree-sitter grammar based on the workspace path's
-/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.d.ts`,
-/// missing extension) → TypeScript grammar.
+/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.mts`,
+/// `.cts`, `.d.ts`, missing extension) → TypeScript grammar (the JSX-
+/// agnostic variants all share one grammar in tree-sitter-typescript 0.23).
 fn select_grammar(workspace_path: &str) -> tree_sitter::Language {
    if workspace_path.ends_with(".tsx") {
        tree_sitter_typescript::LANGUAGE_TSX.into()
--- a/crates/kebab-parse-code/tests/lang.rs
+++ b/crates/kebab-parse-code/tests/lang.rs
@@ -9,6 +9,8 @@ fn known_extensions_map_to_canonical_identifiers() {
        ("foo.pyi", Some("python")),
        ("foo.ts", Some("typescript")),
        ("foo.tsx", Some("typescript")),
+        ("foo.mts", Some("typescript")),  // ESM TS — same grammar
+        ("foo.cts", Some("typescript")),  // CommonJS TS — same grammar
        ("foo.js", Some("javascript")),
        ("foo.mjs", Some("javascript")),
        ("foo.cjs", Some("javascript")),
--- a/crates/kebab-search/src/lexical.rs
+++ b/crates/kebab-search/src/lexical.rs
@@ -346,6 +346,34 @@ fn run_query(
        }
    }

+    // p10-1A-1 fix (dogfood-discovered 2026-05-20): code_lang filter
+    // (IN-list on metadata_json.$.code_lang). Empty Vec = no filter.
+    if !filters.code_lang.is_empty() {
+        let placeholders = std::iter::repeat_n("?", filters.code_lang.len())
+            .collect::<Vec<_>>()
+            .join(",");
+        sql.push_str(&format!(
+            " AND json_extract(d.metadata_json, '$.code_lang') IN ({placeholders})"
+        ));
+        for lang in &filters.code_lang {
+            params.push(Box::new(lang.clone()));
+        }
+    }
+
+    // p10-1A-1 fix (dogfood-discovered 2026-05-20): repo filter
+    // (IN-list on metadata_json.$.repo). Empty Vec = no filter.
+    if !filters.repo.is_empty() {
+        let placeholders = std::iter::repeat_n("?", filters.repo.len())
+            .collect::<Vec<_>>()
+            .join(",");
+        sql.push_str(&format!(
+            " AND json_extract(d.metadata_json, '$.repo') IN ({placeholders})"
+        ));
+        for repo in &filters.repo {
+            params.push(Box::new(repo.clone()));
+        }
+    }
+
    // p9-fb-36: ingested_after filter.
    // `documents.updated_at` is RFC3339 stored as TEXT (always UTC `Z` per
    // fb-32 ingest path), so lexicographic >= compare is correct — but only
--- a/crates/kebab-search/tests/lexical.rs
+++ b/crates/kebab-search/tests/lexical.rs
@@ -785,6 +785,19 @@ impl TestEnv {
        body: &str,
        media: MediaType,
        updated_at: OffsetDateTime,
+    ) -> DocumentId {
+        self.insert_doc_full_with_metadata(path, body, media, updated_at, "{}")
+    }
+
+    /// Like `insert_doc_full` but accepts an explicit `metadata_json` string
+    /// so p10-1A-1 filter tests can set `metadata.code_lang` / `metadata.repo`.
+    fn insert_doc_full_with_metadata(
+        &self,
+        path: &str,
+        body: &str,
+        media: MediaType,
+        updated_at: OffsetDateTime,
+        metadata_json: &str,
    ) -> DocumentId {
        use time::format_description::well_known::Rfc3339;
        let doc_id = self.next_id("doc");
@@ -810,10 +823,10 @@ impl TestEnv {
                source_type, trust_level, parser_version,
                doc_version, schema_version, metadata_json,
                provenance_json, created_at, updated_at
-            ) VALUES (?, ?, ?, NULL, 'en', 'markdown', 'primary', 'pv1', 1, 1,
-                      '{}', '{\"events\":[]}',
+            ) VALUES (?, ?, ?, NULL, 'en', 'code', 'primary', 'pv1', 1, 1,
+                      ?, '{\"events\":[]}',
                      '2024-01-01T00:00:00Z', ?)",
-            rusqlite::params![doc_id, asset_id, path, updated_at_str],
+            rusqlite::params![doc_id, asset_id, path, metadata_json, updated_at_str],
        )
        .expect("insert document");

@@ -834,6 +847,21 @@ impl TestEnv {
        DocumentId(doc_id)
    }

+    /// Insert a code doc with explicit `code_lang` and optional `repo` in metadata.
+    fn insert_code_doc(&self, path: &str, body: &str, code_lang: &str, repo: Option<&str>) -> DocumentId {
+        let metadata_json = match repo {
+            Some(r) => format!(r#"{{"code_lang":"{code_lang}","repo":"{r}"}}"#),
+            None => format!(r#"{{"code_lang":"{code_lang}"}}"#),
+        };
+        self.insert_doc_full_with_metadata(
+            path,
+            body,
+            MediaType::Markdown,
+            OffsetDateTime::now_utc(),
+            &metadata_json,
+        )
+    }
+
    fn run_search(&self, query: &str, filters: &SearchFilters) -> Vec<SearchHit> {
        let r = self.inner.retriever();
        let q = SearchQuery {
@@ -934,6 +962,52 @@ fn lexical_empty_filters_match_default_behavior() {
    assert!(!with_default.is_empty());
 }

+// ── p10-1A-1 filter tests ────────────────────────────────────────────────
+
+#[test]
+fn lexical_filter_by_code_lang() {
+    // Three docs: python code, rust code, markdown (no code_lang).
+    // Filter code_lang=["python"] → only the python doc should match.
+    let env = TestEnv::new();
+    env.insert_code_doc("src/main.py", "AsyncClient session", "python", None);
+    env.insert_code_doc("src/lib.rs", "AsyncClient session", "rust", None);
+    env.insert_doc("docs/guide.md", "AsyncClient session");
+
+    let filters = SearchFilters {
+        code_lang: vec!["python".to_string()],
+        ..Default::default()
+    };
+    let hits = env.run_search("AsyncClient", &filters);
+    assert_eq!(hits.len(), 1, "only python doc should match code_lang filter");
+    assert!(
+        hits[0].doc_path.0.ends_with(".py"),
+        "expected python path, got: {}",
+        hits[0].doc_path.0
+    );
+}
+
+#[test]
+fn lexical_filter_by_repo() {
+    // Three docs: one in repo "httpx", one in repo "requests", one with no repo.
+    // Filter repo=["httpx"] → only the httpx doc should match.
+    let env = TestEnv::new();
+    env.insert_code_doc("httpx/client.py", "session send request", "python", Some("httpx"));
+    env.insert_code_doc("requests/api.py", "session send request", "python", Some("requests"));
+    env.insert_code_doc("standalone.py", "session send request", "python", None);
+
+    let filters = SearchFilters {
+        repo: vec!["httpx".to_string()],
+        ..Default::default()
+    };
+    let hits = env.run_search("session", &filters);
+    assert_eq!(hits.len(), 1, "only httpx doc should match repo filter");
+    assert!(
+        hits[0].doc_path.0.starts_with("httpx/"),
+        "expected httpx path, got: {}",
+        hits[0].doc_path.0
+    );
+}
+
 #[test]
 fn lexical_snapshot_run_1() {
    // Pinned snapshot. A small, deterministic corpus; the JSON shape of
--- a/crates/kebab-source-fs/Cargo.toml
+++ b/crates/kebab-source-fs/Cargo.toml
@@ -18,6 +18,7 @@ blake3       = { workspace = true }
 tracing      = { workspace = true }
 walkdir      = "2"
 ignore       = "0.4"
+globset      = "0.4"

 [dev-dependencies]
 serde_json   = { workspace = true }
--- a/crates/kebab-source-fs/src/connector.rs
+++ b/crates/kebab-source-fs/src/connector.rs
@@ -86,7 +86,7 @@ impl FsSourceConnector {
        excludes.extend(scope.exclude.iter().cloned());
        let kbignore = read_kbignore(&root)?;

-        let overrides = build_overrides(&root, &excludes, &kbignore)?;
+        let overrides = build_overrides(&root, &excludes, &kbignore, &scope.include)?;
        Ok((root, overrides))
    }

@@ -103,8 +103,6 @@ impl FsSourceConnector {
    ) -> Result<(Vec<RawAsset>, FsScanSkips)> {
        let (root, overrides) = self.resolve_scan_params(scope)?;

-        log_scope_include_warning(scope);
-
        let (files, skipped_entries) = walk_files_with_skips(&root, &overrides)?;

        // Accumulate per-category skip counts and sample paths.
@@ -284,14 +282,6 @@ fn build_assets(
    Ok(assets)
 }

-fn log_scope_include_warning(scope: &SourceScope) {
-    if !scope.include.is_empty() {
-        tracing::debug!(
-            count = scope.include.len(),
-            "FsSourceConnector ignores scope.include — handled by extractor router"
-        );
-    }
-}

 impl SourceConnector for FsSourceConnector {
    fn scan(&self, scope: &SourceScope) -> Result<Vec<RawAsset>> {
--- a/crates/kebab-source-fs/src/media.rs
+++ b/crates/kebab-source-fs/src/media.rs
@@ -19,7 +19,9 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
        .unwrap_or_default();

    match ext.as_str() {
-        "md" => MediaType::Markdown,
+        // Markdown + MDX (markdown + JSX, treated as plain markdown — the
+        // JSX islands are folded into raw passthrough by the md parser).
+        "md" | "mdx" => MediaType::Markdown,
        "pdf" => MediaType::Pdf,

        "png" => MediaType::Image(ImageType::Png),
@@ -40,7 +42,8 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {

        // p10-1B: Python / TS / JS AST chunkers active.
        "py" | "pyi"               => MediaType::Code("python".into()),
-        "ts" | "tsx"               => MediaType::Code("typescript".into()),
+        // .mts / .cts are TypeScript ESM / CommonJS variants — same grammar.
+        "ts" | "tsx" | "mts" | "cts" => MediaType::Code("typescript".into()),
        "js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),

        // Empty string (no extension) and any other extension: bucket as
@@ -102,6 +105,20 @@ mod tests {
        assert_eq!(media_type_for(Path::new("a/b.rs")),    MediaType::Code("rust".into()));
    }

+    #[test]
+    fn ts_variants_mts_cts() {
+        // .mts / .cts are TypeScript ESM / CommonJS — same grammar as .ts.
+        assert_eq!(media_type_for(Path::new("a/b.mts")), MediaType::Code("typescript".into()));
+        assert_eq!(media_type_for(Path::new("a/b.cts")), MediaType::Code("typescript".into()));
+    }
+
+    #[test]
+    fn mdx_routes_to_markdown() {
+        // MDX is markdown with JSX islands; the md parser folds the JSX
+        // through as raw passthrough.
+        assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
+    }
+
    #[test]
    fn unknown_and_missing_extension() {
        assert_eq!(
--- a/crates/kebab-source-fs/src/walker.rs
+++ b/crates/kebab-source-fs/src/walker.rs
@@ -44,6 +44,7 @@ use std::collections::HashSet;
 use std::path::{Path, PathBuf};

 use anyhow::{Context, Result};
+use globset::{GlobBuilder, GlobSet, GlobSetBuilder};
 use ignore::overrides::{Override, OverrideBuilder};
 use walkdir::WalkDir;

@@ -69,6 +70,11 @@ const DEFAULT_EXCLUDES: &[&str] = &[
 ///
 /// `default_and_config` covers DEFAULT_EXCLUDES + `config.workspace.exclude`
 /// — these do NOT map to any of the three named `IngestReport` counters.
+///
+/// `include` is the compiled `scope.include` allow-list. When the set is
+/// empty (no patterns) every file passes; when non-empty a file must match
+/// at least one pattern to be accepted (directories always pass, so the
+/// walker can still descend into them).
 pub(crate) struct WalkOverrides {
    /// Merged matcher — same as today's `Override`; used for the walk decision.
    pub combined: Override,
@@ -78,6 +84,8 @@ pub(crate) struct WalkOverrides {
    pub kebabignore: Override,
    /// Matcher built from `kebab_parse_code::BUILTIN_BLACKLIST` only.
    pub builtin: Override,
+    /// Compiled allow-list from `scope.include`. Empty set = pass all.
+    pub include: GlobSet,
 }

 /// Skip attribution category. Used by the connector when counting per-source
@@ -161,10 +169,15 @@ fn build_single_matcher_owned(root: &Path, patterns: &[String]) -> Result<Overri
 /// The three per-source matchers (`gitignore`, `kebabignore`, `builtin`) are
 /// built in addition to the combined one so the connector can attribute skips
 /// to the correct `IngestReport` counter without a second walker pass.
+///
+/// `include_patterns` (from `scope.include`) are compiled into an allow-list
+/// `GlobSet`. Empty slice → pass-all (backward-compat); non-empty → file
+/// must match at least one pattern to be accepted.
 pub(crate) fn build_overrides(
    root: &Path,
    config_exclude: &[String],
    kbignore_patterns: &[String],
+    include_patterns: &[String],
 ) -> Result<WalkOverrides> {
    let gitignore_patterns = read_gitignore(root)?;

@@ -209,14 +222,41 @@ pub(crate) fn build_overrides(
        .build()
        .context("failed to compile combined override set")?;

+    // Allow-list GlobSet: empty Vec → matches nothing (= pass all); non-empty
+    // → file must match at least one glob to be accepted. We compile with
+    // `case_insensitive=false` to keep the semantics consistent with the
+    // OverrideBuilder exclude patterns above.
+    let include = build_include_globset(include_patterns)?;
+
    Ok(WalkOverrides {
        combined,
        gitignore,
        kebabignore,
        builtin,
+        include,
    })
 }

+/// Compile `scope.include` patterns into a `GlobSet` allow-list.
+///
+/// Each pattern uses `GlobBuilder` with `literal_separator = true` so that
+/// `**` can cross directory boundaries while `*` stops at `/`, matching the
+/// gitignore convention used throughout the rest of the walker.
+///
+/// An empty slice produces an empty `GlobSet` — callers interpret that as
+/// "pass all files" (no allow-list constraint).
+fn build_include_globset(patterns: &[String]) -> Result<GlobSet> {
+    let mut builder = GlobSetBuilder::new();
+    for pat in patterns {
+        let glob = GlobBuilder::new(pat)
+            .literal_separator(true)
+            .build()
+            .with_context(|| format!("invalid include pattern: {pat}"))?;
+        builder.add(glob);
+    }
+    builder.build().context("failed to compile include globset")
+}
+
 /// Classify why a path was excluded, using per-source matchers in spec §5.2
 /// priority order: built-in > gitignore > kebabignore > other.
 ///
@@ -391,6 +431,13 @@ pub(crate) fn walk_files_with_skips(
        }

        if entry.file_type().is_file() {
+            // Apply include allow-list: if non-empty, the file's path
+            // relative to root must match at least one pattern.
+            if !overrides.include.is_empty() && !overrides.include.is_match(rel) {
+                // Not in the allow-list — silently drop (no skip counter;
+                // the include filter is not a "skip" source in IngestReport).
+                continue;
+            }
            accepted.push(path.to_path_buf());
        }
    }
@@ -406,7 +453,7 @@ mod tests {
    #[test]
    fn empty_inputs_compile_into_an_override() {
        let dir = tempfile::tempdir().unwrap();
-        let ov = build_overrides(dir.path(), &[], &[]).unwrap();
+        let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
        // Default-excludes only; non-special files should not match.
        let m = ov.combined.matched(Path::new("notes/alpha.md"), false);
        assert!(!m.is_ignore());
@@ -415,7 +462,7 @@ mod tests {
    #[test]
    fn default_excludes_ds_store_and_resource_forks() {
        let dir = tempfile::tempdir().unwrap();
-        let ov = build_overrides(dir.path(), &[], &[]).unwrap();
+        let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
        assert!(ov.combined.matched(Path::new(".DS_Store"), false).is_ignore());
        assert!(
            ov.combined.matched(Path::new("notes/.DS_Store"), false).is_ignore()
@@ -433,6 +480,7 @@ mod tests {
            dir.path(),
            &["*.tmp".to_string(), "node_modules/**".to_string()],
            &[],
+            &[],
        )
        .unwrap();
        assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
@@ -452,6 +500,7 @@ mod tests {
            dir.path(),
            &["*.tmp".to_string()],
            &["secret/**".to_string()],
+            &[],
        )
        .unwrap();
        assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
@@ -491,7 +540,7 @@ mod tests {
        fs::write(root.join("src/main.rs"), "x").unwrap();
        fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();

-        let overrides = build_overrides(root, &[], &[]).unwrap();
+        let overrides = build_overrides(root, &[], &[], &[]).unwrap();
        // Override::matched expects paths relative to the builder's root.
        let m_in = overrides.combined.matched(Path::new("src/main.rs"), false);
        let m_out = overrides.combined.matched(Path::new("node_modules/foo/bar.js"), false);
@@ -514,7 +563,7 @@ mod tests {
        fs::create_dir_all(root.join("ok")).unwrap();
        fs::write(root.join("ok/z.txt"), "z").unwrap();

-        let overrides = build_overrides(root, &[], &[]).unwrap();
+        let overrides = build_overrides(root, &[], &[], &[]).unwrap();
        // Override::matched expects paths relative to the builder's root.
        for blacklisted in [
            "target/x/y.txt",
@@ -544,7 +593,7 @@ mod tests {
        fs::create_dir_all(root.join("dist")).unwrap();
        fs::write(root.join("dist/bundle.js"), "x").unwrap();

-        let overrides = build_overrides(root, &[], &[]).unwrap();
+        let overrides = build_overrides(root, &[], &[], &[]).unwrap();
        assert!(overrides.combined.matched(Path::new("a.log"), false).is_ignore());
        assert!(overrides.combined.matched(Path::new("dist/bundle.js"), false).is_ignore());
        assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
@@ -562,7 +611,7 @@ mod tests {
        fs::write(root.join("src/main.rs"), "x").unwrap();

        // No .gitignore present — patterns from .gitignore should not affect overrides.
-        let overrides = build_overrides(root, &[], &[]).unwrap();
+        let overrides = build_overrides(root, &[], &[], &[]).unwrap();
        assert!(!overrides.combined.matched(Path::new("a.log"), false).is_ignore());
        assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
    }
@@ -577,7 +626,7 @@ mod tests {
        // semantics, but at minimum it must not produce double-`!` corruption.
        fs::write(root.join(".gitignore"), "!keep/\n").unwrap();
        // Just verify build_overrides doesn't error.
-        let result = build_overrides(root, &[], &[]);
+        let result = build_overrides(root, &[], &[], &[]);
        assert!(result.is_ok(), "should not error on negation pattern: {:?}", result.err());
    }

@@ -594,7 +643,7 @@ mod tests {
        // .gitignore entry. Builtin must win (priority order §5.2).
        fs::write(root.join(".gitignore"), "node_modules/\n").unwrap();

-        let ov = build_overrides(root, &[], &[]).unwrap();
+        let ov = build_overrides(root, &[], &[], &[]).unwrap();
        // node_modules/ dir itself
        let cat = classify_skip(Path::new("node_modules"), true, &ov);
        assert_eq!(cat, SkipCategory::BuiltinBlacklist, "builtin must have priority");
@@ -609,7 +658,7 @@ mod tests {
        let root = tmp.path();
        fs::write(root.join(".gitignore"), "*.log\n").unwrap();

-        let ov = build_overrides(root, &[], &[]).unwrap();
+        let ov = build_overrides(root, &[], &[], &[]).unwrap();
        let cat = classify_skip(Path::new("app.log"), false, &ov);
        assert_eq!(cat, SkipCategory::Gitignore);
    }
@@ -621,7 +670,7 @@ mod tests {
        let tmp = TempDir::new().unwrap();
        let root = tmp.path();

-        let ov = build_overrides(root, &[], &["*.secret".to_string()]).unwrap();
+        let ov = build_overrides(root, &[], &["*.secret".to_string()], &[]).unwrap();
        let cat = classify_skip(Path::new("creds.secret"), false, &ov);
        assert_eq!(cat, SkipCategory::Kebabignore);
    }
@@ -637,7 +686,7 @@ mod tests {
        fs::write(root.join("ok.md"), "# ok").unwrap();
        fs::write(root.join("skipme.log"), "x").unwrap();

-        let ov = build_overrides(root, &[], &[]).unwrap();
+        let ov = build_overrides(root, &[], &[], &[]).unwrap();
        let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();

        let accepted_names: Vec<_> = accepted
@@ -677,7 +726,7 @@ mod tests {
        fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();
        fs::write(root.join("ok.md"), "# ok").unwrap();

-        let ov = build_overrides(root, &[], &[]).unwrap();
+        let ov = build_overrides(root, &[], &[], &[]).unwrap();
        let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();

        let accepted_names: Vec<_> = accepted
--- a/crates/kebab-source-fs/tests/include_allowlist.rs
+++ b/crates/kebab-source-fs/tests/include_allowlist.rs
@@ -0,0 +1,111 @@
+//! Integration test: `scope.include` enforces an allow-list.
+//!
+//! Semantics (gitignore convention):
+//!   - `include` is empty Vec → all files pass through (backward-compat).
+//!   - `include` is non-empty → only files matching at least one pattern
+//!     are accepted. `exclude` rules still apply after include.
+//!
+//! Layout (built per-test in a TempDir):
+//!   root/
+//!   ├── a.md
+//!   ├── b.py
+//!   ├── c.png
+//!   └── d.pdf
+
+use std::fs;
+
+use kebab_config::Config;
+use kebab_core::{SourceConnector, SourceScope};
+use kebab_source_fs::FsSourceConnector;
+
+fn cfg_with_root(root: &str) -> Config {
+    let mut c = Config::defaults();
+    c.workspace.root = root.to_string();
+    c.workspace.exclude.clear();
+    // Disable size / generated caps so small test files always pass.
+    c.ingest.code.max_file_bytes = u64::MAX;
+    c.ingest.code.max_file_lines = u32::MAX;
+    c.ingest.code.skip_generated_header = false;
+    c
+}
+
+fn setup_mixed_dir() -> tempfile::TempDir {
+    let dir = tempfile::tempdir().unwrap();
+    let root = dir.path();
+    fs::write(root.join("a.md"), b"md").unwrap();
+    fs::write(root.join("b.py"), b"py").unwrap();
+    fs::write(root.join("c.png"), b"\x89PNG").unwrap();
+    fs::write(root.join("d.pdf"), b"%PDF").unwrap();
+    dir
+}
+
+/// Empty include → all 4 files pass (backward-compat).
+#[test]
+fn include_empty_accepts_all_files() {
+    let dir = setup_mixed_dir();
+    let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
+    let scope = SourceScope {
+        include: vec![],
+        ..SourceScope::default()
+    };
+    let assets = conn.scan(&scope).unwrap();
+    let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
+    assert!(names.contains(&"a.md".to_string()), "a.md missing; got: {names:?}");
+    assert!(names.contains(&"b.py".to_string()), "b.py missing; got: {names:?}");
+    assert!(names.contains(&"c.png".to_string()), "c.png missing; got: {names:?}");
+    assert!(names.contains(&"d.pdf".to_string()), "d.pdf missing; got: {names:?}");
+    assert_eq!(names.len(), 4, "expected exactly 4 files; got: {names:?}");
+}
+
+/// Non-empty include → only md + py come back; png + pdf are excluded.
+#[test]
+fn include_nonempty_is_allowlist() {
+    let dir = setup_mixed_dir();
+    let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
+    let scope = SourceScope {
+        include: vec!["**/*.md".to_string(), "**/*.py".to_string()],
+        ..SourceScope::default()
+    };
+    let assets = conn.scan(&scope).unwrap();
+    let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
+    assert!(names.contains(&"a.md".to_string()), "a.md should be accepted; got: {names:?}");
+    assert!(names.contains(&"b.py".to_string()), "b.py should be accepted; got: {names:?}");
+    assert!(
+        !names.contains(&"c.png".to_string()),
+        "c.png must be rejected by include allowlist; got: {names:?}"
+    );
+    assert!(
+        !names.contains(&"d.pdf".to_string()),
+        "d.pdf must be rejected by include allowlist; got: {names:?}"
+    );
+    assert_eq!(names.len(), 2, "expected exactly 2 files; got: {names:?}");
+}
+
+/// include + exclude are ANDed: a file matching include but also matching
+/// exclude must be rejected.
+#[test]
+fn include_and_exclude_are_anded() {
+    let dir = tempfile::tempdir().unwrap();
+    let root = dir.path();
+    fs::write(root.join("keep.md"), b"keep").unwrap();
+    fs::write(root.join("drop.md"), b"drop").unwrap();
+    fs::write(root.join("other.py"), b"py").unwrap();
+
+    let conn = FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())).unwrap();
+    let scope = SourceScope {
+        include: vec!["**/*.md".to_string()],
+        exclude: vec!["drop.md".to_string()],
+        ..SourceScope::default()
+    };
+    let assets = conn.scan(&scope).unwrap();
+    let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
+    assert!(names.contains(&"keep.md".to_string()), "keep.md should be accepted; got: {names:?}");
+    assert!(
+        !names.contains(&"drop.md".to_string()),
+        "drop.md should be excluded (matched exclude); got: {names:?}"
+    );
+    assert!(
+        !names.contains(&"other.py".to_string()),
+        "other.py should be excluded (not in include); got: {names:?}"
+    );
+}
--- a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json
+++ b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json
@@ -56,5 +56,6 @@
  "skipped_kebabignore": 0,
  "skipped_size_exceeded": 0,
  "unchanged": 0,
+  "purged_deleted_files": 0,
  "updated": 1
 }
--- a/crates/kebab-store-sqlite/src/documents.rs
+++ b/crates/kebab-store-sqlite/src/documents.rs
@@ -286,6 +286,88 @@ impl kebab_core::DocumentStore for SqliteStore {
        }
    }

+    fn get_document_by_workspace_path(
+        &self,
+        path: &kebab_core::WorkspacePath,
+    ) -> Result<Option<kebab_core::CanonicalDocument>> {
+        let conn = self.lock_conn();
+        let row: Option<DocumentRow> = conn
+            .query_row(
+                "SELECT
+                    doc_id, asset_id, workspace_path, title, lang,
+                    source_type, trust_level, parser_version,
+                    doc_version, schema_version, metadata_json,
+                    provenance_json, created_at, updated_at,
+                    last_chunker_version, last_embedding_version
+                FROM documents WHERE workspace_path = ?",
+                params![path.0],
+                document_row_from_sql,
+            )
+            .map(Some)
+            .or_else(rows_optional)
+            .map_err(StoreError::from)?;
+        let Some(row) = row else { return Ok(None) };
+
+        let doc_id = kebab_core::DocumentId(row.doc_id.clone());
+        let mut blocks_stmt = conn
+            .prepare(
+                "SELECT payload_json FROM blocks
+                 WHERE doc_id = ? ORDER BY ordinal ASC",
+            )
+            .map_err(StoreError::from)?;
+        let block_rows = blocks_stmt
+            .query_map(params![row.doc_id], |r| {
+                let payload_json: String = r.get(0)?;
+                Ok(payload_json)
+            })
+            .map_err(StoreError::from)?;
+        let mut blocks: Vec<kebab_core::Block> = Vec::new();
+        for block_row in block_rows {
+            let payload_json = block_row.map_err(StoreError::from)?;
+            let block: kebab_core::Block = serde_json::from_str(&payload_json)
+                .context("deserialize block payload_json")?;
+            blocks.push(block);
+        }
+
+        let metadata: kebab_core::Metadata = serde_json::from_str(&row.metadata_json)
+            .context("deserialize metadata_json")?;
+        let provenance: kebab_core::Provenance =
+            serde_json::from_str(&row.provenance_json)
+                .context("deserialize provenance_json")?;
+
+        Ok(Some(kebab_core::CanonicalDocument {
+            doc_id,
+            source_asset_id: kebab_core::AssetId(row.asset_id),
+            workspace_path: kebab_core::WorkspacePath(row.workspace_path),
+            title: row.title.unwrap_or_default(),
+            lang: kebab_core::Lang(row.lang.unwrap_or_default()),
+            blocks,
+            metadata,
+            provenance,
+            parser_version: kebab_core::ParserVersion(row.parser_version),
+            schema_version: row.schema_version as u32,
+            doc_version: row.doc_version as u32,
+            last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion),
+            last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion),
+        }))
+    }
+
+    fn all_workspace_paths(&self) -> Result<Vec<kebab_core::WorkspacePath>> {
+        let conn = self.lock_conn();
+        let mut stmt = conn
+            .prepare("SELECT workspace_path FROM documents")
+            .map_err(StoreError::from)?;
+        let rows = stmt
+            .query_map([], |r| r.get::<_, String>(0))
+            .map_err(StoreError::from)?;
+        let mut out = Vec::new();
+        for row in rows {
+            let path = row.map_err(StoreError::from)?;
+            out.push(kebab_core::WorkspacePath(path));
+        }
+        Ok(out)
+    }
+
    fn list_documents(
        &self,
        filter: &kebab_core::DocFilter,
--- a/crates/kebab-store-sqlite/src/filters.rs
+++ b/crates/kebab-store-sqlite/src/filters.rs
@@ -153,6 +153,34 @@ impl SqliteStore {
            }
        }

+        // p10-1A-1 fix (dogfood-discovered 2026-05-20): code_lang filter
+        // (IN-list on metadata_json.$.code_lang). Empty Vec = no filter.
+        if !filters.code_lang.is_empty() {
+            let placeholders = std::iter::repeat_n("?", filters.code_lang.len())
+                .collect::<Vec<_>>()
+                .join(",");
+            sql.push_str(&format!(
+                " AND json_extract(d.metadata_json, '$.code_lang') IN ({placeholders})"
+            ));
+            for lang in &filters.code_lang {
+                bind.push(Box::new(lang.clone()));
+            }
+        }
+
+        // p10-1A-1 fix (dogfood-discovered 2026-05-20): repo filter
+        // (IN-list on metadata_json.$.repo). Empty Vec = no filter.
+        if !filters.repo.is_empty() {
+            let placeholders = std::iter::repeat_n("?", filters.repo.len())
+                .collect::<Vec<_>>()
+                .join(",");
+            sql.push_str(&format!(
+                " AND json_extract(d.metadata_json, '$.repo') IN ({placeholders})"
+            ));
+            for repo in &filters.repo {
+                bind.push(Box::new(repo.clone()));
+            }
+        }
+
        // p9-fb-36: ingested_after filter.
        // `documents.updated_at` is RFC3339 TEXT (UTC `Z` per fb-32);
        // lexicographic >= compare is correct — but only when the filter
@@ -408,6 +436,78 @@ mod tests {
            .unwrap();
    }

+    /// Variant of `seed_committed_full` that additionally accepts a
+    /// `metadata_json` string so p10-1A-1 filter tests can set
+    /// `metadata.code_lang` / `metadata.repo` without going through the
+    /// full ingest pipeline.
+    #[allow(clippy::too_many_arguments)]
+    fn seed_committed_with_metadata(
+        store: &SqliteStore,
+        chunk_id: &str,
+        doc_id: &str,
+        workspace_path: &str,
+        media_type_json: &str,
+        metadata_json: &str,
+    ) {
+        let asset_id = format!("a{}", &doc_id[..31]);
+        {
+            let conn = store.lock_conn();
+            conn.execute(
+                "INSERT INTO assets (
+                    asset_id, source_uri, workspace_path, media_type, byte_len,
+                    checksum, storage_kind, storage_path, discovered_at
+                 ) VALUES (?, ?, ?, ?, 0, 'deadbeefdeadbeefdeadbeefdeadbeef',
+                           'reference', ?, '1970-01-01T00:00:00Z')",
+                params![
+                    asset_id,
+                    format!("file://{workspace_path}"),
+                    workspace_path,
+                    media_type_json,
+                    workspace_path,
+                ],
+            )
+            .unwrap();
+            conn.execute(
+                "INSERT INTO documents (
+                    doc_id, asset_id, workspace_path, title, lang, source_type,
+                    trust_level, parser_version, doc_version, schema_version,
+                    metadata_json, provenance_json, created_at, updated_at
+                 ) VALUES (?, ?, ?, NULL, 'en', 'code', 'primary', 'v1', 1, 1,
+                           ?, '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
+                params![doc_id, asset_id, workspace_path, metadata_json],
+            )
+            .unwrap();
+            conn.execute(
+                "INSERT INTO chunks (
+                    chunk_id, doc_id, text, heading_path_json, section_label,
+                    source_spans_json, token_estimate, chunker_version,
+                    policy_hash, block_ids_json, created_at
+                 ) VALUES (?, ?, 'code snippet', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
+                           '1970-01-01T00:00:00Z')",
+                params![chunk_id, doc_id],
+            )
+            .unwrap();
+        }
+
+        let embed_row = EmbeddingRecordRow {
+            embedding_id: format!("e{}", &chunk_id[..31]),
+            chunk_id: chunk_id.to_string(),
+            model_id: "m".to_string(),
+            model_version: "v1".to_string(),
+            dimensions: 4,
+            lance_table: "t".to_string(),
+            created_at: OffsetDateTime::UNIX_EPOCH,
+        };
+        store
+            .put_embedding_records_pending(std::slice::from_ref(&embed_row))
+            .unwrap();
+        store
+            .mark_embedding_records_committed(std::slice::from_ref(
+                &embed_row.embedding_id,
+            ))
+            .unwrap();
+    }
+
    fn cid(s: &str) -> ChunkId {
        ChunkId(s.to_string())
    }
@@ -671,6 +771,78 @@ mod tests {
        assert_eq!(out, vec![cid(c1)], "doc_id filter must scope to the target doc only");
    }

+    // ── p10-1A-1 new filter arms ─────────────────────────────────────────
+
+    #[test]
+    fn filter_chunks_code_lang_keeps_matching_lang() {
+        // c1 = python, c2 = rust, c3 = markdown (no code_lang).
+        // Filter code_lang=["python"] → only c1 survives.
+        let tmp = TempDir::new().unwrap();
+        let store = open_store(&tmp);
+        let c1 = "11111111111111111111111111111111";
+        let c2 = "22222222222222222222222222222222";
+        let c3 = "33333333333333333333333333333333";
+        seed_committed_with_metadata(
+            &store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1",
+            "src/main.py", r#""code""#,
+            r#"{"code_lang":"python"}"#,
+        );
+        seed_committed_with_metadata(
+            &store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2",
+            "src/lib.rs", r#""code""#,
+            r#"{"code_lang":"rust"}"#,
+        );
+        seed_committed_with_metadata(
+            &store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3",
+            "README.md", r#""markdown""#,
+            r#"{}"#,
+        );
+
+        let f = SearchFilters {
+            code_lang: vec!["python".to_string()],
+            ..Default::default()
+        };
+        let out = store
+            .filter_chunks(&[cid(c1), cid(c2), cid(c3)], &f)
+            .unwrap();
+        assert_eq!(out, vec![cid(c1)], "only python chunk should survive code_lang filter");
+    }
+
+    #[test]
+    fn filter_chunks_repo_keeps_matching_repo() {
+        // c1 = repo "httpx", c2 = repo "requests", c3 = no repo.
+        // Filter repo=["httpx"] → only c1 survives.
+        let tmp = TempDir::new().unwrap();
+        let store = open_store(&tmp);
+        let c1 = "11111111111111111111111111111111";
+        let c2 = "22222222222222222222222222222222";
+        let c3 = "33333333333333333333333333333333";
+        seed_committed_with_metadata(
+            &store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1",
+            "httpx/client.py", r#""code""#,
+            r#"{"repo":"httpx","code_lang":"python"}"#,
+        );
+        seed_committed_with_metadata(
+            &store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2",
+            "requests/api.py", r#""code""#,
+            r#"{"repo":"requests","code_lang":"python"}"#,
+        );
+        seed_committed_with_metadata(
+            &store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3",
+            "standalone.py", r#""code""#,
+            r#"{"code_lang":"python"}"#,
+        );
+
+        let f = SearchFilters {
+            repo: vec!["httpx".to_string()],
+            ..Default::default()
+        };
+        let out = store
+            .filter_chunks(&[cid(c1), cid(c2), cid(c3)], &f)
+            .unwrap();
+        assert_eq!(out, vec![cid(c1)], "only httpx chunk should survive repo filter");
+    }
+
    #[test]
    fn filter_chunks_ingested_after_non_utc_offset_compares_as_instant() {
        // Regression test for the non-UTC offset lex-compare bug.
--- a/crates/kebab-store-sqlite/src/lib.rs
+++ b/crates/kebab-store-sqlite/src/lib.rs
@@ -35,4 +35,4 @@ pub use error::StoreError;
 pub use eval::{EvalQueryResultRecord, EvalRunRecord, EvalRunRow};
 pub use fts::rebuild_chunks_fts;
 pub use jobs::IngestRunRow;
-pub use store::{CountSummary, NotIndexed, SqliteStore};
+pub use store::{CountSummary, NotIndexed, SqliteStore, purge_deleted_workspace_path};
--- a/crates/kebab-store-sqlite/src/store.rs
+++ b/crates/kebab-store-sqlite/src/store.rs
@@ -540,6 +540,114 @@ pub(crate) fn purge_orphan_at_workspace_path(
    Ok(())
 }

+/// Purge all stored data for a document whose on-disk file has been
+/// deleted (as opposed to content-changed, which is handled by
+/// `purge_orphan_at_workspace_path`).
+///
+/// Returns the `chunk_id`s that were associated with the document so
+/// the caller can issue a matching `VectorStore::delete_by_chunk_ids`
+/// on the LanceDB side.
+///
+/// Deletion order:
+/// 1. Collect chunk_ids (before cascade removes them).
+/// 2. DELETE the `documents` row → CASCADE clears `blocks`, `chunks`,
+///    `embedding_records`.
+/// 3. DELETE the `assets` row **only if no other document still
+///    references it** (twin-file protection — `assets` can be shared
+///    across identical-content files via the blake3 PK).
+/// 4. If the asset was `storage_kind = 'copied'`, best-effort delete
+///    the on-disk byte file at `storage_path`.
+///
+/// Returns `Ok(vec![])` when no document exists at `workspace_path`
+/// (idempotent — caller doesn't need to pre-check).
+pub fn purge_deleted_workspace_path(
+    store: &SqliteStore,
+    workspace_path: &kebab_core::WorkspacePath,
+) -> anyhow::Result<Vec<kebab_core::ChunkId>> {
+    let conn = store.lock_conn();
+
+    // Look up the document + its asset_id.
+    let doc_row: Option<(String, String)> = conn
+        .query_row(
+            "SELECT doc_id, asset_id FROM documents WHERE workspace_path = ?",
+            rusqlite::params![workspace_path.0],
+            |r| Ok((r.get(0)?, r.get(1)?)),
+        )
+        .optional()
+        .map_err(StoreError::from)?;
+
+    let Some((doc_id, asset_id)) = doc_row else {
+        return Ok(Vec::new());
+    };
+
+    // 1. Collect chunk_ids before CASCADE removes them.
+    let mut stmt = conn
+        .prepare("SELECT chunk_id FROM chunks WHERE doc_id = ?")
+        .map_err(StoreError::from)?;
+    let rows = stmt
+        .query_map(rusqlite::params![doc_id], |r| r.get::<_, String>(0))
+        .map_err(StoreError::from)?;
+    let chunk_ids: Vec<kebab_core::ChunkId> = rows
+        .map(|r| r.map(kebab_core::ChunkId))
+        .collect::<rusqlite::Result<Vec<_>>>()
+        .map_err(StoreError::from)?;
+    drop(stmt);
+
+    // 2. DELETE the document row (CASCADE clears blocks / chunks /
+    //    embedding_records via the FK constraints in V001).
+    conn.execute(
+        "DELETE FROM documents WHERE doc_id = ?",
+        rusqlite::params![doc_id],
+    )
+    .map_err(StoreError::from)?;
+
+    // 3. Delete the asset row only when no other document still
+    //    references it (twin-file safety: two files with identical
+    //    bytes share a single asset row via the blake3 PK).
+    let remaining_refs: i64 = conn
+        .query_row(
+            "SELECT COUNT(*) FROM documents WHERE asset_id = ?",
+            rusqlite::params![asset_id],
+            |r| r.get(0),
+        )
+        .map_err(StoreError::from)?;
+
+    if remaining_refs == 0 {
+        // 4. Capture storage details before deleting the row.
+        let asset_storage: Option<(String, String)> = conn
+            .query_row(
+                "SELECT storage_kind, storage_path FROM assets WHERE asset_id = ?",
+                rusqlite::params![asset_id],
+                |r| Ok((r.get(0)?, r.get(1)?)),
+            )
+            .optional()
+            .map_err(StoreError::from)?;
+
+        conn.execute(
+            "DELETE FROM assets WHERE asset_id = ?",
+            rusqlite::params![asset_id],
+        )
+        .map_err(StoreError::from)?;
+
+        // 5. Best-effort: remove the on-disk copied asset file.
+        if let Some((storage_kind, storage_path)) = asset_storage {
+            if storage_kind == "copied" {
+                let _ = std::fs::remove_file(&storage_path);
+            }
+        }
+    }
+
+    tracing::debug!(
+        target: "kebab-store-sqlite",
+        workspace_path = %workspace_path.0,
+        doc_id = %doc_id,
+        chunk_count = chunk_ids.len(),
+        "purged deleted-file document from store"
+    );
+
+    Ok(chunk_ids)
+}
+
 /// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
 /// path (which has bytes + computed `storage_kind/path`) and the
 /// `DocumentStore::put_asset` path (which only has the `RawAsset` and
@@ -701,6 +809,39 @@ impl SqliteStore {
        }
        Ok(out)
    }
+
+    /// p10-1A-2 follow-up (dogfooding 2026-05-20): per-repo doc count for
+    /// `schema.v1`.
+    ///
+    /// Reads `metadata_json->'$.repo'`, groups by the value, and skips rows
+    /// where `repo` is NULL (documents without an explicit repo tag).
+    /// Returns `BTreeMap<String, u32>` — key is the repo name as stored in
+    /// frontmatter, value is the doc count.
+    pub fn repo_breakdown(
+        &self,
+    ) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
+        use anyhow::Context;
+        let conn = self.read_conn();
+        let mut stmt = conn
+            .prepare(
+                "SELECT json_extract(metadata_json, '$.repo') AS rp, COUNT(*) \
+                 FROM documents \
+                 WHERE rp IS NOT NULL \
+                 GROUP BY rp",
+            )
+            .context("prepare repo_breakdown")?;
+        let rows = stmt
+            .query_map([], |r| {
+                Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
+            })
+            .context("query repo_breakdown")?;
+        let mut out = std::collections::BTreeMap::new();
+        for row in rows {
+            let (k, v) = row.context("read repo_breakdown row")?;
+            out.insert(k, v);
+        }
+        Ok(out)
+    }
 }

 /// Apply the design §5 / task-spec pragmas. Called once per connection.
@@ -817,5 +958,79 @@ mod tests {
        // only one key total
        assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
    }
+
+    /// p10-1A-2 follow-up: `repo_breakdown` counts docs by
+    /// `metadata_json.repo`.
+    ///
+    /// Inserts:
+    /// - one doc with `repo = "my-repo"` → must appear with count 1
+    /// - one doc with `repo = null`       → must NOT appear (NULL skipped)
+    ///
+    /// Uses a side rusqlite connection that bypasses the `assets` FK via
+    /// `PRAGMA foreign_keys = OFF` so the test is self-contained.
+    #[test]
+    fn repo_breakdown_counts_by_repo() {
+        let (dir, store) = open_fresh_store();
+
+        let db_path = dir.path().join("kebab.sqlite");
+        let conn = rusqlite::Connection::open(&db_path).unwrap();
+        conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
+
+        // Doc 1: doc with repo = "my-repo"
+        conn.execute(
+            "INSERT INTO documents (
+                doc_id, asset_id, workspace_path,
+                source_type, trust_level, parser_version,
+                doc_version, schema_version,
+                metadata_json, provenance_json,
+                created_at, updated_at
+            ) VALUES (
+                'doc-repo-1', 'asset-r1', 'my-repo/README.md',
+                'markdown', 'primary', 'test-v1',
+                1, 1,
+                '{\"repo\":\"my-repo\"}', '{}',
+                '2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
+            )",
+            [],
+        )
+        .unwrap();
+
+        // Doc 2: doc with repo absent (null in JSON)
+        conn.execute(
+            "INSERT INTO documents (
+                doc_id, asset_id, workspace_path,
+                source_type, trust_level, parser_version,
+                doc_version, schema_version,
+                metadata_json, provenance_json,
+                created_at, updated_at
+            ) VALUES (
+                'doc-norepo-1', 'asset-r2', 'standalone/notes.md',
+                'markdown', 'primary', 'test-v1',
+                1, 1,
+                '{\"repo\":null}', '{}',
+                '2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
+            )",
+            [],
+        )
+        .unwrap();
+
+        drop(conn); // release side connection before querying via store
+
+        let bd = store.repo_breakdown().unwrap();
+
+        // "my-repo" must appear with count 1
+        assert_eq!(
+            bd.get("my-repo"),
+            Some(&1u32),
+            "expected my-repo=1 in repo_breakdown, got: {bd:?}"
+        );
+        // null repo must NOT appear as any key
+        assert!(
+            !bd.contains_key("null"),
+            "null repo must not appear in breakdown, got: {bd:?}"
+        );
+        // only one key total
+        assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
+    }
 }

--- a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs
+++ b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs
@@ -41,6 +41,7 @@ fn fixture_report() -> IngestReport {
        skipped_generated: 0,
        skipped_size_exceeded: 0,
        skip_examples: kebab_core::SkipExamples::default(),
+        purged_deleted_files: 0,
        items: Some(vec![
            IngestItem {
                kind: IngestItemKind::New,
--- a/docs/wire-schema/v1/reset_report.schema.json
+++ b/docs/wire-schema/v1/reset_report.schema.json
@@ -14,12 +14,18 @@
    "schema_version":           { "const": "reset_report.v1" },
    "scope":                    {
      "type": "string",
-      "enum": ["all", "data_only", "vector_only", "config_only"]
+      "enum": ["all", "data_only", "vector_only", "config_only", "orphans_only"]
    },
    "removed_paths":            {
      "type": "array",
      "items": { "type": "string" }
    },
-    "embedding_rows_truncated": { "type": "integer", "minimum": 0 }
+    "embedding_rows_truncated": { "type": "integer", "minimum": 0 },
+    "orphans_purged":           { "type": "integer", "minimum": 0, "default": 0 },
+    "purged_paths":             {
+      "type": "array",
+      "items": { "type": "string" },
+      "default": []
+    }
  }
 }
Author	SHA1	Message	Date
altair823	1e6de9fe9f	chore: bump version 0.10.0 → 0.11.0 dogfood follow-up (PR #149) lands: kebab reset --orphans-only explicit complement to PR #148's conservative sweep. minor bump 사유: 새 CLI flag (--orphans-only) + 새 ResetScope variant + ResetReport additive 필드 = surface 확장. design §10.4 트리거 충족. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:53:55 +00:00
altair823	9fa2a1ebac	Merge pull request 'feat(dogfood): kebab reset --orphans-only — explicit complement to PR #148 sweep' (#149 ) from feat/dogfood-reset-orphans-only into main	2026-05-20 07:50:43 +00:00
altair823	749c6ae240	docs(dogfood): sync reset_report schema + README for --orphans-only (PR #149 review) Round 1 review found 2 doc gaps: - docs/wire-schema/v1/reset_report.schema.json: 'orphans_only' missing from scope enum; orphans_purged/purged_paths properties absent - README: --orphans-only not listed in the reset prose Schema additions are additive minor (default values keep back-compat). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:47:44 +00:00
altair823	5f2bd9e97e	feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope PR #148 auto-purges only filesystem-missing files (conservative — leaves on-disk-but-out-of-scope docs alone for data safety). This is the explicit complement: when the user has narrowed include / widened exclude / removed a sub-directory from the workspace and WANTS the stored docs reconciled, they invoke 'kebab reset --orphans-only'. Confirm prompt with orphan count + sample paths; --yes required in non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148) + vector store delete_by_chunk_ids when configured. No fs existence check — orphans-only is the explicit 'I know what I'm doing' variant. dogfood follow-up to PR #148 (file deletion auto-purge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:38:10 +00:00
altair823	1ce06c1e2d	chore: bump version 0.9.0 → 0.10.0 dogfood-discovered file-deletion auto-purge (PR #148) lands. minor bump 사유: additive wire field IngestReport.purged_deleted_files + 새 CLI summary surface (purged N) + 새 사용자-가시 동작 (rm a.md 후 ingest 시 자동 정리). design §10.4 도그푸딩-ready surface 확장 트리거. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:12:58 +00:00
altair823	d26efe167f	Merge pull request 'fix(dogfood): auto-purge stored docs for filesystem-deleted files' (#148 ) from fix/dogfood-file-deletion-auto-purge into main	2026-05-20 07:10:33 +00:00
altair823	d6d165df01	docs(dogfood): sync sweep_deleted_files algorithm doc with try_exists (PR #148 nit) Round 2 review found the function-level doc-comment still referenced the old fs::exists() (now replaced by try_exists().unwrap_or(true) in commit `2baa846`). One-line clarification — describes the conservative-on-Err semantics so future readers don't reintroduce the data-safety bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:10:27 +00:00
altair823	2baa846c6b	fix(dogfood): conservative try_exists() in sweep_deleted_files (PR #148 review) Round 1 review found a data-safety bug: fs::exists() returns false on errors like EACCES / EPERM / NFS-hiccup / ownership-change, which would trigger purge on a file that is in fact still on disk (just unreadable this moment). Switched to try_exists().unwrap_or(true) so transient FS errors are CONSERVATIVELY treated as 'file present' — never purge on uncertain signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:04:03 +00:00
altair823	27baec82ea	fix(dogfood): auto-purge stored docs for filesystem-deleted files Files deleted from disk (rm a.md) were leaving stale documents + chunks + embeddings in the store, surfacing as ghost citations in search/ask. Existing purge_orphan_at_workspace_path only handled content-changed stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no new asset_id. Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths), for each candidate check filesystem existence — only purge when the file is TRULY missing. Scope-narrowing case (file on disk but outside include glob) is explicitly NOT purged to protect users from accidental data loss via config edits. Adds: - DocumentStore::all_workspace_paths trait method + SqliteStore impl - purge_deleted_workspace_path in store-sqlite (returns chunk_ids for vector delete; deletes doc CASCADE + asset row + copied storage file) - sweep_deleted_files in kebab-app::ingest path; called once per ingest before the per-asset loop - IngestReport.purged_deleted_files counter (additive, serde default) - CLI ingest summary mentions purge count when > 0 - 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash). Per user decision: only filesystem deletion auto-purges; scope narrowing requires explicit kebab reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:51:07 +00:00
altair823	acf8cf3be2	chore: bump version 0.8.3 → 0.9.0 dogfood-discovered routing additions (PR #147) land: - .mts / .cts → MediaType::Code(typescript) - .mdx → MediaType::Markdown minor bump 사유: 사용자 도그푸딩 surface 확장 — 이전에 skip 되던 28+ 파일이 이제 색인됨. design §10.4 dogfooding-ready surface 확장 = minor trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:29:27 +00:00
altair823	ea5f7b22c8	Merge pull request 'feat(dogfood): route .mts/.cts → typescript + .mdx → markdown' (#147 ) from feat/dogfood-routing-cts-mts-mdx into main	2026-05-20 06:28:41 +00:00
altair823	5497c6e7b5	feat(dogfood): route .mts/.cts to typescript + .mdx to markdown Dogfood (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash) showed 28 files skipped by extension that are routable to existing extractors: - .mts (ESM TypeScript) / .cts (CommonJS TypeScript) — same grammar as .ts in tree-sitter-typescript 0.23 (LANGUAGE_TYPESCRIPT covers JSX- agnostic variants; LANGUAGE_TSX stays for .tsx only) - .mdx (Markdown + JSX) — routed as MediaType::Markdown; the md parser folds JSX islands through as raw passthrough Changes: - crates/kebab-source-fs/src/media.rs: 'mts'\|'cts' → Code(typescript), 'mdx' → Markdown. +2 unit tests. - crates/kebab-parse-code/src/lang.rs: code_lang_for_path matches mts/cts; module_path_for_tsjs strips .mts/.cts as well. Test cases extended. - crates/kebab-parse-code/src/typescript.rs: doc comment on select_grammar refreshed to mention .mts/.cts. - crates/kebab-parse-code/tests/lang.rs: 2 new assertions. verify: kebab-source-fs 44 / kebab-parse-code lib 20 + lang 4 all pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:24:21 +00:00
altair823	5a90940f1c	chore: bump version 0.8.2 → 0.8.3 dogfood-discovered fix (PR #146) lands: idempotent re-ingest now correctly returns Unchanged for twin files (identical content at different paths) via document-centric try_skip_unchanged lookup. patch bump 사유: advertised idempotency 의 정상 동작 복원. 새 wire / config / surface 변경 없음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:20:34 +00:00
altair823	4389b887f0	Merge pull request 'fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency' (#146 ) from fix/dogfood-bug4-idempotent-twin-files into main	2026-05-20 06:16:28 +00:00
altair823	360f825f3a	docs(dogfood): refresh try_skip_unchanged doc-comment to match new flow (PR #146 review) Round 1 review found the function-level doc-comment still described the old asset-side algorithm (item 2 asset-row checksum, item 3 id_for_doc miss). Updated to the document-centric flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:35:17 +00:00
altair823	641b92af7d	fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency Identical-content files at different workspace paths share one assets row (assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made twin files overwrite each other's workspace_path on every ingest, so `get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or None) — break idempotent unchanged-detection for both files. Fix: switch try_skip_unchanged to document-centric lookup. `documents. workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)` includes path, so each twin has its own stable document row. Compare `doc.source_asset_id` with the new asset's checksum instead of going through the assets table. Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of 726 docs marked Updated on every idempotent re-ingest — all 27 are twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md same content, duplicate logo PDFs/JPGs). After: re-ingest reports 0 new / 0 updated / 726 unchanged. No schema migration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:27:21 +00:00
altair823	08fb743598	chore: bump version 0.8.1 → 0.8.2 dogfood-discovered fixes (PR #145) land in production: - schema.v1.repo_breakdown 가 실제로 채워짐 (이전: 항상 빈 BTreeMap) - workspace.include glob 가 walker 에서 enforce 됨 (이전: 완전 무시) patch bump 사유: 둘 다 advertised surface 의 정상 동작 복원. 새 wire / config / surface 변경 없음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:20:48 +00:00
altair823	0a2a7ae214	Merge pull request 'fix(dogfood): schema.repo_breakdown + workspace.include walker enforcement (dogfood-discovered)' (#145 ) from fix/dogfood-bugs-schema-walker-incremental into main	2026-05-20 05:18:59 +00:00
altair823	803d02b68b	fix(dogfood): enforce workspace.include in walker (allow-list semantics) config.workspace.include was completely ignored by the walker — connector.rs log_scope_include_warning literally said "handled by extractor router" but no extractor router exists. Dogfooding (PR #142 1B + multi-root corpus kebab-docs + httpx + zod + lodash) showed user-set include of code+md still ingested 84 .png + 8 .pdf files. Fix: walker treats scope.include as an allow-list — empty Vec preserves backward-compat (all files pass), non-empty requires file path to match at least one pattern (AND with the existing exclude rules). Removed the misleading debug log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:15:04 +00:00
altair823	4e8b84c4e0	fix(dogfood): populate schema.v1.repo_breakdown (Task 9 follow-up) Dogfooding (PR #142 1B + multi-root corpus: kebab-docs + httpx + zod + lodash) revealed schema.v1.repo_breakdown is always {} despite the 1A-2 Task 9 having added the code_lang_breakdown sibling. The schema.rs:171 placeholder `BTreeMap::new()` was left in place. Mirror Task 9's code_lang_breakdown query for the repo field — same metadata_json JSON-path pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:09:19 +00:00
altair823	16dc02cfa2	chore: bump version 0.8.0 → 0.8.1 dogfood-discovered code_lang/repo filter bug (PR #144) fix lands in production. patch bump because: - 1A-1 advertised CLI flags --code-lang / --repo were live but inert (SearchFilters fields propagated but never applied to retriever SQL) - fix restores intended behavior; no new wire surface - user has dogfooded against httpx + zod + lodash and re-validating needs the fixed binary Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 03:35:36 +00:00
altair823	74f1b0571b	Merge pull request 'fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood)' (#144 ) from fix/p10-1a-1-code-lang-repo-filter-sql into main	2026-05-20 03:34:53 +00:00
altair823	918ee6c0be	fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood-discovered) p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI --code-lang / --repo flags propagate them correctly into SearchFilters, but neither the lexical retriever's FTS SQL nor the shared filter_chunks helper (used by the vector retriever) ever applied them — so a code-lang-filtered search returned all-doc hits (markdown / pdf / code mixed). Discovered while dogfooding p10-1B with httpx + zod + lodash clones: `kebab search 'AsyncClient' --code-lang python --json` returned markdown hits from httpx/docs/ first. Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang') and '$.repo' to both lexical.rs and filters.rs, mirroring the existing media filter pattern. Two regression tests added in each crate covering the new filter behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 03:27:01 +00:00