Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.
Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.
Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge
dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reviewer-flagged: aa2a6ea claimed build clean but missed:
- crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs (test fixture)
- crates/kebab-cli/src/wire.rs (test fixture)
- crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json (snapshot)
All three add `unchanged: 0` (or `\"unchanged\": 0`) to match the new
IngestReport.unchanged field. cargo clippy --workspace --all-targets
-- -D warnings now clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`kebab-normalize::derive_title(frontmatter_title, blocks, file_stem)` 가
다음 단계로 비어있지 않은 첫 결과를 사용:
1. frontmatter `title` (trim 후)
2. 첫 H1 텍스트
3. 첫 H2 텍스트
4. 첫 Paragraph (Quote / List / Code / Table / ImageRef 제외) 의 첫 80 자
5. 파일 stem (확장자 제외)
6. (sentinel) `"untitled"` — 위 다섯 단계가 모두 blank 인 병적 케이스
선택된 문자열은 NFC 정규화. 빈 문자열은 절대 반환하지 않음.
`build_canonical_document` 가 metadata lift 직후 helper 호출. 기존 단순
lift 로직 (metadata.user["title"] → CanonicalDocument.title) 은 fallback
chain 의 1 단계 입력으로 자리 이동.
`KEBAB_PARSE_MD_VERSION` 상수를 `pulldown-cmark-0.x` → `md-frontmatter-v2`
로 bump. parser_version 변경 → §4.2 doc_id 입력 변화 → 기존 markdown
doc 의 `doc_id` 갱신, 다음 ingest 시 idempotent upsert 로 자동 재처리
(design §9 cascade). `kebab-store-sqlite` 의 snapshot fixture 도 같은
literal 로 갱신.
기존 M7 정책 ("metadata.user[\"title\"] = '' 가 빈 title 로 lift") 은
폐기. 빈 문자열 입력은 fallback chain 을 타고 file stem 까지 떨어진다.
spec p9-fb-07 line 37: "빈 문자열 반환 금지".
테스트 (kebab-normalize):
- 8 개 단위 테스트 (각 fallback 단계 + NFC + sentinel)
- `build_canonical_document` 통합 테스트 2 개 (H1 / file stem)
- 기존 M7 테스트 2 개를 새 정책에 맞춰 갱신
문서:
- README: `kebab ingest` 행에 "title 자동 채움" 안내 + 기존 doc 도
다음 ingest 에서 갱신
- HANDOFF: 2026-05-03 머지 후 발견 entry
- spec status: `planned` → `in_progress`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>