P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
UNIQUE constraint failed: chunks.chunk_id
Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.
Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).
Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).
This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
to try_skip_unchanged + the stored_is_tier3_fallback detection branch
(skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
that can fall back: rust / python / ts / js / go / java / kotlin /
yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
(p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
regression tests (the originally-promised PR #155 regression coverage
that also never made it to main).
Smoke tests: 14 + 2 = 16 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).
Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).
Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.
UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.
Dogfood follow-up (PR #142 1B + multi-root corpus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #148 auto-purges only filesystem-missing files (conservative — leaves
on-disk-but-out-of-scope docs alone for data safety). This is the explicit
complement: when the user has narrowed include / widened exclude / removed
a sub-directory from the workspace and WANTS the stored docs reconciled,
they invoke 'kebab reset --orphans-only'.
Confirm prompt with orphan count + sample paths; --yes required in
non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148)
+ vector store delete_by_chunk_ids when configured. No fs existence
check — orphans-only is the explicit 'I know what I'm doing' variant.
dogfood follow-up to PR #148 (file deletion auto-purge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.
Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.
Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge
dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identical-content files at different workspace paths share one assets row
(assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT
`ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made
twin files overwrite each other's workspace_path on every ingest, so
`get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or
None) — break idempotent unchanged-detection for both files.
Fix: switch try_skip_unchanged to document-centric lookup. `documents.
workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)`
includes path, so each twin has its own stable document row. Compare
`doc.source_asset_id` with the new asset's checksum instead of going
through the assets table.
Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of
726 docs marked Updated on every idempotent re-ingest — all 27 are
twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md
same content, duplicate logo PDFs/JPGs).
After: re-ingest reports 0 new / 0 updated / 726 unchanged.
No schema migration needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task H implementer chose code-typescript-v1 but plan + design §3.3 use the
short form (chunker is code-ts-ast-v1 / code-js-ast-v1). Aligning parser
versions to match: rust=code-rust-v1 / python=code-python-v1 / ts=code-ts-v1
/ js=code-js-v1 (Task K). Fixes 2 sites: const PARSER_VERSION + integration
test assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- wire schema: relax effective_end.minimum 1 → 0 + expand
description to cover line-clamp + out-of-range sentinel
(panic-fix R1 emits Some(0) when line_start=1 and range is
beyond doc end — schema must accept it)
- tests: tighten first-chunk-target boundary test to assert ≤ 2
total neighbors (3-chunk doc, N=2). Strict "first chunk →
context_before empty" not assertable until chunks.ordinal
column lands (R1 #9 architectural caveat)
- store: trim contradiction in list_chunk_ids_for_doc warning
comment — drop "good enough for sequentially chunked
markdown" phrase that conflicts with "hash sort dominates"
paragraph above
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line-based slice over fmt_canonical_to_markdown output.
PDF / audio source_type → span_not_supported StructuredError.
Out-of-range line_end clamps to total; effective_end reflects
post-budget trim. invalid_input on zero / inverted bounds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks CanonicalDocument blocks, serializes to markdown, applies
chars/4 budget when opts.max_tokens is set. doc_not_found
preserved through StructuredError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Chunk mode + +-N context. doc / span modes return placeholder
errors (filled by subsequent tasks). fmt_canonical_to_markdown
helper introduced now since doc mode (Task 4) consumes it.
Errors are typed StructuredError so classify preserves
chunk_not_found / doc_not_found through the wire layer.
Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive
+-N neighbors without leaking direct rusqlite usage into kebab-app.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous round-1 fix dropped the speculative cursor branch on
the truncated path, leaving a contradiction with the docs:
- snippet-only shrunk → cursor emitted (returned == k_effective)
- k-popped → cursor null (returned < k_effective)
But docs promised the opposite.
R2 resolution: emit cursor whenever more hits may be reachable
(either retriever filled the page OR budget popped hits — the
popped ones remain fetchable from offset+returned). Drop the
artificial "widen vs paginate" copy; truncated and next_cursor
are now independent signals — caller may do either or both.
Updates: app.rs::search_with_opts logic + SearchResponse doc +
schema description + SKILL.md two bullets + max_tokens=0 test
asserts cursor IS emitted on k-pop case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- error_wire: StructuredError wrapper preserves ErrorV1 through
anyhow → classify pipeline. Adds downcast short-circuit so
cursor::decode's typed code = "stale_cursor" reaches the wire
instead of being string-formatted to code = "generic".
- app: search_with_opts now wraps cursor::decode error in
StructuredError instead of anyhow! string format.
- test: error_wire pins both negative (bare anyhow → not
stale_cursor) AND positive (StructuredError → stale_cursor)
invariants. CLI integration test runs end-to-end and asserts
error.v1.code on stderr.
- app: next_cursor only emitted on full-page (k-pop) path; drop
speculative emit on snippet-only truncation that would point at
a different page than the agent expected.
- cursor: differentiate malformed-base64 / malformed-payload /
revision-mismatch error messages; all keep code = stale_cursor.
- test: cursor_rejected fixture uses .expect() to fail loud on
cursor non-emission instead of silent skip.
- test: max_tokens=0 → 1-hit floor + truncated=true.
- docs: SKILL.md + schema description distinguish snippet-shrink
(widen) vs k-pop (paginate) truncated cases. HOTFIXES notes
--no-cache semantic shift (cached path + clear vs uncached path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opaque base64(JSON{offset, corpus_revision}). Mismatch or
malformed input returns ErrorV1 with code = stale_cursor.
base64 promoted to workspace dep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.
Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ingest_file_with_config: lowercase normalize ext (caller-side) +
early error on unsupported extension (`.docx` etc. now Err with
helpful message instead of silent skipped_by_extension counter).
New test ingest_file_errors_on_unsupported_extension.
- ingest_stdin_with_config: doc comment explaining intentional
double-call of ensure helpers (idempotent + ~ms negligible).
- external::inject_frontmatter: simplify precheck via single
trim_start binding + add CR-only line ending edge case.
- external::inject_frontmatter: doc note on yaml_quote escape
contract (agent-supplied titles with special chars are safe).
Round 1 review summary: #111 (comment)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three scenarios — copies external md + reports new=1, idempotent on
second call (unchanged=1), errors on missing path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plug coverage hole flagged in code review — test 1 was asserting only
doc_count + last_ingest_at, leaving count_summary's chunk_count and
asset_count queries un-pinned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scenarios: freshly-ingested 2-doc KB (stats reflect counts +
last_ingest_at populated) and empty-but-initialized KB (counts zero,
last_ingest_at None). The empty case runs ingest_with_config over an
empty workspace dir to seed kebab.sqlite before calling schema_with_config,
since open_existing (used internally) returns NotIndexed if the DB is absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reviewer-flagged: task 1 missed test files using cfg.workspace.include.
- crates/kebab-app/tests/common/mod.rs: SourceScope literal switched
to ..Default::default().
- crates/kebab-app/tests/image_pipeline.rs (×3): drop dead-no-op
cfg.workspace.include.push(...) calls; comment explains removal.
- crates/kebab-app/tests/pdf_pipeline.rs: same treatment.
Pre-fb-25 these pushes were no-ops (include was dead config field
not enforced anywhere). Removal is purely mechanical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the per-asset incremental-ingest skip block to all three flows
(markdown / image / pdf). When `IngestOpts::force_reingest = false`
AND the asset's blake3 checksum + parser/chunker/embedding versions
all match the existing DB record, ingest emits
`AssetFinished { result: Unchanged }`, bumps `aggregate.unchanged`,
and skips parse / chunk / embed / vector upsert entirely.
Shared `try_skip_unchanged` helper performs the four checks; per-flow
callers supply the active parser_version + chunker_version + optional
embedding_version. `force_reingest = true` bypasses the skip path so
`incremental_ingest::force_reingest_bypasses_skip` still sees `Updated`.
Tests:
- new `incremental_ingest.rs` covers both paths.
- existing `ingest_idempotent_on_second_run` /
`re_ingest_image_produces_*` / `re_ingest_identical_pdf_produces_*`
updated to assert `Unchanged` on identical-bytes re-ingest (the
pre-task behaviour was `Updated`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three ingest flows (markdown, image, pdf) now set
last_chunker_version and last_embedding_version on the CanonicalDocument
before calling put_document, giving Task 7's skip detection the data it
needs on the second run. No skip path is added yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p9-fb-10: verifies that a Korean (Hangul) token survives the
ingest → FTS5 lexical search round-trip via the kebab-app facade.
NFC normalization is wired upstream in kebab-normalize; this test
only exercises end-to-end correctness — no AVX, no fastembed required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
도그푸딩 item 15 — TUI / 같은 process 안에서 동일 query 반복 시 SQLite
FTS + Lance + RRF 재계산이 매번 발생하던 비용 해소. in-process LRU
캐시 + 모노토닉 corpus_revision 카운터로 ingest commit 발생 시 모든
entry 자동 stale.
## 핵심 변경
- **SQLite V004 migration**: `kv (key TEXT PRIMARY KEY, value TEXT)
STRICT` + `corpus_revision = '0'` seed. 미래의 다른 scalar 도 같은
테이블에 들어갈 수 있는 generic shape.
- **`SqliteStore::corpus_revision()` / `bump_corpus_revision()`** —
`UPDATE ... CAST AS INTEGER + 1` atomic. INSERT-OR-IGNORE 도 함께
실행 (V004 seed 가 무슨 이유로 누락된 케이스 paranoid).
- **`kebab-app::ingest_with_config_cancellable`** — `new + updated > 0`
시 bump, no-op (skipped-only) reingest 는 cache 보존.
- **`App.search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<
SearchHit>>>>`** — `config.search.cache_capacity` (default 256, 0
비활성). `lru = "0.12"` workspace dep 추가.
- **`SearchCacheKey`** = `query_norm` (NFKC + trim + lowercase) +
`mode` + `k` + `snippet_chars` + `embedding_version` (vector/hybrid
만, lexical 은 빈 문자열) + `chunker_version` + `corpus_revision`
snapshot.
- **`App::search`** rewrite — cache 활성 시 lookup → miss 면 기존
`search_uncached` 호출 후 put. cache 비활성이거나 lock 실패면
straight-line.
- **`App::search_uncached`** (rename of pre-fb-19 `search` body) +
`search_uncached_with_config` facade — CLI `kebab search --no-cache`
로 진입.
- **`Config.search.cache_capacity: usize`** field, `#[serde(default)]`
로 기존 config 호환.
- **CLI `--no-cache`** flag — 디버깅용 (CLI 는 매 호출이 새 process
라 사실상 no-op 이지만 spec 명시 + 향후 long-lived process 호환).
- **frozen design §9 versioning** 표에 `corpus_revision` row 추가
(기존 `index_version` 라벨과 다른 차원: 라벨은 retrieval 형상,
corpus_revision 은 ingest commit ack).
## 테스트
- `kebab-store-sqlite` 신규 3 unit (fresh=0, monotonic bump, persist
across reopen)
- `kebab-app` 신규 4 integration (cached repeat 같은 hits, NFKC 정규화
로 case/whitespace collapse, --no-cache parity, first ingest bumps
corpus_revision)
- 워크스페이스 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0
- `cargo clippy --workspace --all-targets -- -D warnings` clean
## 문서
- README `kebab search` 행: 캐시 동작 + `--no-cache` 안내 + corpus_
revision 무효화 메커니즘
- docs/SMOKE.md `[search]` 절에 `cache_capacity` 라인 추가
- HANDOFF: 2026-05-03 entry
- spec status planned → in_progress
## Out of scope
- patch-and-merge incremental (RRF 정규화 전체 hit set 기준이라 어려움)
- SQLite 영속 cache (P+)
- 다른 process 간 cache 공유 (in-process 만 — corpus_revision 이
cross-process 무효화는 O(1))
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec PR #59 의 §3.8 multi-turn behaviour 구현. RAG facade 가 prior
turns 받아 prompt 에 prepend, retrieval query expansion 적용,
Answer 에 conversation_id / turn_index 채움.
신규 (kebab-core):
- Answer 에 conversation_id (Option<String>) / turn_index (Option<u32>)
field 추가. serde skip_serializing_if 로 single-shot 의 wire
output 변경 0 (기존 외부 wrapper 영향 없음).
- Turn struct (question + answer + citations + created_at).
- RefusalReason::LlmStreamAborted variant.
신규 (kebab-rag):
- AskOpts 에 history (Vec<Turn>) / conversation_id / turn_index 3 field.
- AskOpts::single_shot(mode) helper.
- RagPipeline::ask_with_history(query, history, conversation_id,
turn_index, opts) — combined opts 로 ask 호출.
- expand_query_with_history: history.last() 의 answer 첫 200 자
concat 해 SearchQuery.text 확장 (spec §3.8 의 \"cheap concat\";
LLM-based standalone-question rewriting 은 P+).
- serialize_history + remaining_history_budget_chars: spec 의 priority
enforcement — system+question 필수, retrieved chunks 가 차지한
뒤 남은 char budget 안에서 newest 우선, oldest drop.
- ask 본문: history 가 비어있지 않으면 [이전 대화] 블록을 user
prompt 위에 prepend. Answer 생성 site 3 곳 (정상 / NoChunks /
ScoreGate refuse) 모두 conversation_id / turn_index 채움.
신규 (kebab-store-sqlite):
- refusal_reason_label 가 LlmStreamAborted → 'llm_stream_aborted'.
기존 caller 변경 (single-shot 동작 동일):
- kebab-cli main.rs Cmd::Ask: AskOpts 에 history=Vec::new(),
conversation_id=None, turn_index=None 명시 (CLI multi-turn 은
p9-fb-18 의 --session/--repl 가 채움).
- kebab-tui src/ask.rs spawn site 동일 (multi-turn UI 는 p9-fb-16).
- kebab-eval runner.rs golden eval 동일 (single-shot per query).
- kebab-app tests/ask_smoke.rs / kebab-tui tests/ask.rs / kebab-rag
tests/pipeline.rs / kebab-eval metrics.rs Answer literal 갱신.
Test:
- 9 신규 lib unit (expand_query 4 / serialize_history 3 / remaining_budget 2).
- 기존 12 PASS 회귀 0.
Plan 갱신:
- p9-fb-15 status planned → in_progress. 머지 후 한 줄 commit
으로 completed flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
회차 1 actionable 2건 반영 + 1건 (CLI Ctrl-C integration test)
은 본 PR 에서 별도 task 로 미룸 (signal handler subprocess test 의
flaky 위험 + facade 3 PASS + tui lib 3 PASS 가 안정 surface).
- cancel.rs::install_sigint_cancel: SIGNAL_COUNT 위에 process-lifetime
invariant 코멘트 — multi-install 차단 (ctrlc::set_handler) 덕분에
reset 불필요. 미래 다중 caller 가 같은 cancel token 공유하려면
install 함수 분리 필요.
- ingest_cancel.rs::cancel_mid_loop: redundant `report.new == 1 || 0
|| 2` 제거, race timing 의도 코멘트로 대체 (0=listener 승, 1=first
only, 2=extra slipped in 모두 valid; 3 = cancel never propagated
= 유일한 fail).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ctrl-C / Esc 가 ingest 를 즉시 중단. 현재 in-flight asset 마무리 후
이후 asset 미실행, IngestEvent::Aborted { partial_counts } 발신,
Ok(IngestReport) 정상 반환 (Err 아님). 부분 commit 보존, 다음 ingest
가 idempotent 재개.
신규 facade: kebab-app::ingest_with_config_cancellable(.., progress,
cancel: Option<Arc<AtomicBool>>). 기존 _progress 가 cancel=None
forwarding wrapper. asset loop 시작 boundary 마다 atomic load —
true 면 break + Aborted emit + 정상 종료. Lock 없음.
CLI: ctrlc crate 신규 dep. SIGINT handler 가 첫 신호에 cancel.store(true)
+ stderr hint, 두 번째 신호에 std::process::exit(130) (canonical SIGINT
exit code). install_sigint_cancel() helper 가 Arc<AtomicBool> 반환,
Cmd::Ingest 가 facade 에 전달.
TUI: IngestState 에 cancel: Arc<AtomicBool> field 추가 (회차 1 review
결과의 reshape 정확). start_ingest 가 둘 다 만들어 worker 에 clone
move. cancel_running_ingest(&app) helper — Esc / Ctrl-C 가
ingest 진행 중일 때만 cancel 우선, 그 외에는 quit.
Test:
- 3 facade integration (cancel-before / cancel-mid / no-cancel
default).
- 3 tui lib unit (cancel_running_ingest no-state / in-flight /
terminated).
Plan 갱신: p9-fb-04 status planned → in_progress. 머지 후 한 줄
commit 으로 completed flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P7-3 통합 테스트가 노출한 storage 레이어 버그 fix.
`assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의
`ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산
re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE
충돌. md / image / pdf 모두 영향.
Fix:
- 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path`
의 *다른* `asset_id` 를 발견하면 documents → assets 순서로 sweep.
documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks /
embedding_records 정리. copied 모드면 storage_path 의 byte 파일도
best-effort 삭제.
- `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore
::put_asset` 모두 호출.
- 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전
의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로
대체).
- `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 →
9 통합 테스트 모두 default 로 통과.
Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게
운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가
SQLite join 통해 surface).
Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2):
- doctor pass
- 첫 ingest: 5 new
- list docs: 5 docs all media types
- search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit
- search hybrid → cross-media 결과
- inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page
- 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency)
- byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix)
- corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0
- 정리 후 다시 ingest: errors=0
- RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` +
`path: <pdf_path>` 정확히 노출
운영 fixture 보조:
- `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release
--example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로
reportlab/qpdf 없이 in-tree PDF 생성.
- `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG
fixture 생성.
- SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시
errors+=1 → new+=1) 반영.
HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로
업데이트, vector store orphan caveat 만 남음.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>