kebab

Author	SHA1	Message	Date
altair823	6ac7fea7b9	feat(v0.17.0/A5): trigram-aware build_match_string + SearchResponse.hint PR-A 본체. plan Task A4 Step 1c + A5. - lexical.rs::build_match_string 재설계: whole-phrase + token-AND OR-combined, 3자 미만 토큰 drop, 후보 없음 시 None (빈 MATCH 회피). raw single-quote mode 유지. - SearchResponse.hint additive — empty result + trimmed < 3 chars + non-raw 케이스에 short_query_hint helper 가 set. - CLI 'kebab search' 가 [hint] stderr 한 줄 (text mode). - TUI SearchState.short_query_hint + poll_worker stale-aware set + fire_search/mark_input_changed reset + dynamic_status 표시. - docs/wire-schema/v1/search_response.schema.json hint additive. - 신규 unit tests (lexical 9 PASS, 기존 2 expectation 갱신) + 통합 회귀 (search_korean: multi_token + mixed, 3 PASS) + BM25 snapshot regen (trigram token stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 11:54:25 +00:00
altair823	1969c8e3b5	fix(dogfood): k8s multi-resource YAML chunk_id collision P10 dogfooding found that a k8s manifest with 2+ documents (e.g. Deployment + Service in one file) fails to ingest: UNIQUE constraint failed: chunks.chunk_id Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per resource; with split_key None every resource from the same document gets the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's code_text_paragraph_v1 had the same bug (fixed in `df3c5b8`) but it calls build_chunk_no_symbol directly — the push_chunks_with_oversize path was never fixed. Fix: push_chunks_with_oversize gains a base_split_key parameter for the non-oversize single-chunk case. k8s chunker passes Some(resource.line_start) so each resource gets a distinct chunk_id; dockerfile / manifest pass None (1 chunk per file — no sibling collision, chunk_id stays stable). Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts chunk_id distinctness; new integration test tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real 2-document YAML end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 23:49:37 +00:00
altair823	192835e5bf	test(p10-1d): integration smoke tests for C + C++ Verifies end-to-end ingest + search + Citation::Code shape: - tier1_c_ingest_searchable: .c file → --code-lang c search → symbol = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1". - tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search → symbol starts with namespace::Class prefix, lang = "cpp", chunker_version = "code-cpp-ast-v1". Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3, Tier 3: 4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 14:31:35 +00:00
altair823	1034de25a2	fix(p10-3+p10-1d): land the missing try_skip_unchanged fallback-aware fix PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix — the implementer reported a commit SHA (2a39513) that never made it to main. Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid YAML, AST extractor failure) re-runs full extract + chunk + embed because the parser/chunker version comparison can never match (stored is code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch values). This commit: 1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>` to try_skip_unchanged + the stored_is_tier3_fallback detection branch (skip parser/chunker equality, keep embedder check). 2. Threads `None` through non-code call sites (md / image / pdf). 3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs that can fall back: rust / python / ts / js / go / java / kotlin / yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp (p10-1D additions). 4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged regression tests (the originally-promised PR #155 regression coverage that also never made it to main). Smoke tests: 14 + 2 = 16 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 14:19:17 +00:00
altair823	df3c5b8caf	test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback) Two new tests verify end-to-end Tier 3 wiring: - tier3_shell_ingest_searchable: .sh file → --code-lang shell search → Citation::Code { symbol: None, lang: "shell" }, chunker_version "code-text-paragraph-v1". - tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and chunker_version "code-text-paragraph-v1". Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs (≤80 lines) were emitted with split_key=None, causing all paragraphs from the same document to share the same chunk_id (UNIQUE constraint violation at put_chunks). Fix: always use para.line_start as split_key so every paragraph gets a distinct id regardless of size. Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:37:44 +00:00
altair823	166e1ddfaf	test(p10-2): integration smoke tests for Tier 2 (k8s yaml + Dockerfile + Cargo.toml) Three new tests in code_ingest_smoke.rs verifying isolated-TempDir ingest + --code-lang filter + Citation::Code.lang / .symbol shape for each Tier 2 chunker. Brings the suite to 12 tests (Rust 3 + Python 1 + TS 1 + JS 1 + Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 13:23:01 +00:00
altair823	ff1bedbef5	feat(p10-1c-jk): activate Kotlin in ingest_one_code_asset dispatch Replaces Kotlin bail! arms with KotlinAstExtractor + CodeKotlinAstV1Chunker. Adds kotlin_file_ingests_and_searches_as_code_citation integration test — asserts citation.lang=kotlin, symbol=com.foo.Foo.bar, code_lang=kotlin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:54:55 +00:00
altair823	ebc4ef2eea	feat(p10-1c-jk): activate Java in ingest_one_code_asset dispatch Replaces Java bail! arms with JavaAstExtractor + CodeJavaAstV1Chunker. Adds java_file_ingests_and_searches_as_code_citation integration test — asserts citation.lang=java, symbol=com.foo.Foo.bar, code_lang=java. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:44:05 +00:00
altair823	c19aa006d0	feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds go_file_ingests_and_searches_as_code_citation integration test — asserts citation.lang=go, symbol=chunk.ParseDoc, code_lang=go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:13:47 +00:00
altair823	453ec15df4	fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path assets.workspace_path is INTENTIONALLY 'last-registered path' for twin files (identical content at different paths share one asset row PK'd by blake3 content hash). PR #146 made try_skip_unchanged document-centric; PR #149 made reset --orphans-only document-centric; this PR removes the last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span, which used it to reject PDF/audio media — for twins this could read the wrong asset's media_type and pick the wrong branch). Replaced with the natural 2-step lookup: get_document_by_workspace_path (PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id is PRIMARY KEY so flip-flop-immune by construction). Then removed get_asset_by_workspace_path trait method + SqliteStore impl — 0 callers after the refactor. UPSERT doc-comment refreshed in store.rs to make the 'last-registered' semantics explicit so future readers don't try to 'fix' the flip-flop. Dogfood follow-up (PR #142 1B + multi-root corpus). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:03:38 +00:00
altair823	5f2bd9e97e	feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope PR #148 auto-purges only filesystem-missing files (conservative — leaves on-disk-but-out-of-scope docs alone for data safety). This is the explicit complement: when the user has narrowed include / widened exclude / removed a sub-directory from the workspace and WANTS the stored docs reconciled, they invoke 'kebab reset --orphans-only'. Confirm prompt with orphan count + sample paths; --yes required in non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148) + vector store delete_by_chunk_ids when configured. No fs existence check — orphans-only is the explicit 'I know what I'm doing' variant. dogfood follow-up to PR #148 (file deletion auto-purge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:38:10 +00:00
altair823	27baec82ea	fix(dogfood): auto-purge stored docs for filesystem-deleted files Files deleted from disk (rm a.md) were leaving stale documents + chunks + embeddings in the store, surfacing as ghost citations in search/ask. Existing purge_orphan_at_workspace_path only handled content-changed stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no new asset_id. Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths), for each candidate check filesystem existence — only purge when the file is TRULY missing. Scope-narrowing case (file on disk but outside include glob) is explicitly NOT purged to protect users from accidental data loss via config edits. Adds: - DocumentStore::all_workspace_paths trait method + SqliteStore impl - purge_deleted_workspace_path in store-sqlite (returns chunk_ids for vector delete; deletes doc CASCADE + asset row + copied storage file) - sweep_deleted_files in kebab-app::ingest path; called once per ingest before the per-asset loop - IngestReport.purged_deleted_files counter (additive, serde default) - CLI ingest summary mentions purge count when > 0 - 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash). Per user decision: only filesystem deletion auto-purges; scope narrowing requires explicit kebab reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:51:07 +00:00
altair823	641b92af7d	fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency Identical-content files at different workspace paths share one assets row (assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made twin files overwrite each other's workspace_path on every ingest, so `get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or None) — break idempotent unchanged-detection for both files. Fix: switch try_skip_unchanged to document-centric lookup. `documents. workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)` includes path, so each twin has its own stable document row. Compare `doc.source_asset_id` with the new asset's checksum instead of going through the assets table. Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of 726 docs marked Updated on every idempotent re-ingest — all 27 are twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md same content, duplicate logo PDFs/JPGs). After: re-ingest reports 0 new / 0 updated / 726 unchanged. No schema migration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:27:21 +00:00
altair823	d53995a6d4	feat(p10-1b): code-js-ast-v1 chunker + activate JavaScript in app dispatch Chunker: duplicate-with-substitution from code-ts-ast-v1 / code-rust-ast-v1. Dispatch: replaces JS bail! arms with JavascriptAstExtractor + CodeJsAstV1Chunker. Integration test javascript_file_ingests_and_searches_as_code_citation asserts citation.lang=javascript, symbol=src/Bar.Bar.baz, code_lang=javascript. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:16:07 +00:00
altair823	31245a4328	fix(p10-1b): TS parser_version code-typescript-v1 → code-ts-v1 (naming consistency) Task H implementer chose code-typescript-v1 but plan + design §3.3 use the short form (chunker is code-ts-ast-v1 / code-js-ast-v1). Aligning parser versions to match: rust=code-rust-v1 / python=code-python-v1 / ts=code-ts-v1 / js=code-js-v1 (Task K). Fixes 2 sites: const PARSER_VERSION + integration test assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:05:17 +00:00
altair823	acb61b6830	feat(p10-1b): activate TypeScript in ingest_one_code_asset dispatch Replaces TS bail! arms with TypescriptAstExtractor + CodeTsAstV1Chunker. Adds typescript_file_ingests_and_searches_as_code_citation integration test — asserts citation.lang=typescript, symbol=src/Foo.Foo.bar, code_lang=typescript. JS arms remain bail!() (Task L). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:59:41 +00:00
altair823	1815091247	feat(p10-1b): activate Python in ingest_one_code_asset dispatch Replaces Python bail! arms with PythonAstExtractor + CodePythonAstV1Chunker. Adds python_file_ingests_and_searches_as_code_citation integration test — asserts citation.lang=python, symbol=kebab_eval.metrics.compute_mrr, code_lang=python. TS/JS arms remain bail!() (Tasks J/L). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:49:01 +00:00
altair823	b5d1fe8c1e	feat(p10-1a-2): backfill SearchHit.repo from doc metadata (Task 8b) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:13:01 +00:00
altair823	580576c2c6	feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset) Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`, `ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`), and `backfill_code_lang` post-processing in `App::search_uncached`. Integration test `code_ingest_smoke.rs` verifies full pipeline: ingest `.rs` → Citation::Code hit with lang/symbol/line_start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:14:59 +00:00
th-kim0823	231d80e82d	feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37) Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold for callers that need real stale counts. Mirrors all new fields onto the wire-bound Stats struct in kebab-app::schema with #[serde(default)] for backwards-compat. Also fixes search_budget_integration.rs for the trace field added to SearchOpts in Task 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 12:34:57 +09:00
th-kim0823	b86b763dfb	fix(fb-35): address PR #126 round 2 review - wire schema: relax effective_end.minimum 1 → 0 + expand description to cover line-clamp + out-of-range sentinel (panic-fix R1 emits Some(0) when line_start=1 and range is beyond doc end — schema must accept it) - tests: tighten first-chunk-target boundary test to assert ≤ 2 total neighbors (3-chunk doc, N=2). Strict "first chunk → context_before empty" not assertable until chunks.ordinal column lands (R1 #9 architectural caveat) - store: trim contradiction in list_chunk_ids_for_doc warning comment — drop "good enough for sequentially chunked markdown" phrase that conflicts with "hash sort dominates" paragraph above Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:55:29 +09:00
th-kim0823	7dddc1d706	fix(fb-35): address PR #126 round 1 review - fetch_span: panic-fix on line_start > total / empty doc (return empty text + effective_end = line_start - 1 instead of out-of-bounds slice) - truncated: reserved for budget-driven truncation only; line range clamp signaled via effective_end < line_end - spec / SKILL.md / README: align rejection wording to "PDF / audio" (matches code; Image OCR allowed for span) - store: warning comment on list_chunk_ids_for_doc — chunk_id hash sort does NOT preserve document position; real fix is a chunks.ordinal column, tracked as follow-up - surrounding_chunks: saturating_add to defend against u32::MAX context arg on 32-bit targets - tests: line_start > total returns empty + chunk context at doc boundary clamps lower bound Deferred nits (follow-up): table-separator strict CommonMark form; MCP per-mode strict validation; CLI chunk_id truncation in plain output. None block correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:45:29 +09:00
th-kim0823	1b9d89eb3a	feat(app): App::fetch span mode + PDF/audio rejection (fb-35) Line-based slice over fmt_canonical_to_markdown output. PDF / audio source_type → span_not_supported StructuredError. Out-of-range line_end clamps to total; effective_end reflects post-budget trim. invalid_input on zero / inverted bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 23:54:22 +09:00
th-kim0823	7d1f855f7e	feat(app): App::fetch doc mode with budget (fb-35) Walks CanonicalDocument blocks, serializes to markdown, applies chars/4 budget when opts.max_tokens is set. doc_not_found preserved through StructuredError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 23:48:40 +09:00
th-kim0823	610d29f053	feat(app): App::fetch chunk mode + markdown serializer (fb-35) Chunk mode + +-N context. doc / span modes return placeholder errors (filled by subsequent tasks). fmt_canonical_to_markdown helper introduced now since doc mode (Task 4) consumes it. Errors are typed StructuredError so classify preserves chunk_not_found / doc_not_found through the wire layer. Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive +-N neighbors without leaking direct rusqlite usage into kebab-app. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 23:44:51 +09:00
th-kim0823	e084b306e5	fix(fb-34): align next_cursor semantics with docs (PR #125 round 2) Previous round-1 fix dropped the speculative cursor branch on the truncated path, leaving a contradiction with the docs: - snippet-only shrunk → cursor emitted (returned == k_effective) - k-popped → cursor null (returned < k_effective) But docs promised the opposite. R2 resolution: emit cursor whenever more hits may be reachable (either retriever filled the page OR budget popped hits — the popped ones remain fetchable from offset+returned). Drop the artificial "widen vs paginate" copy; truncated and next_cursor are now independent signals — caller may do either or both. Updates: app.rs::search_with_opts logic + SearchResponse doc + schema description + SKILL.md two bullets + max_tokens=0 test asserts cursor IS emitted on k-pop case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 21:07:04 +09:00
th-kim0823	f485608108	fix(fb-34): address PR #125 round 1 review - error_wire: StructuredError wrapper preserves ErrorV1 through anyhow → classify pipeline. Adds downcast short-circuit so cursor::decode's typed code = "stale_cursor" reaches the wire instead of being string-formatted to code = "generic". - app: search_with_opts now wraps cursor::decode error in StructuredError instead of anyhow! string format. - test: error_wire pins both negative (bare anyhow → not stale_cursor) AND positive (StructuredError → stale_cursor) invariants. CLI integration test runs end-to-end and asserts error.v1.code on stderr. - app: next_cursor only emitted on full-page (k-pop) path; drop speculative emit on snippet-only truncation that would point at a different page than the agent expected. - cursor: differentiate malformed-base64 / malformed-payload / revision-mismatch error messages; all keep code = stale_cursor. - test: cursor_rejected fixture uses .expect() to fail loud on cursor non-emission instead of silent skip. - test: max_tokens=0 → 1-hit floor + truncated=true. - docs: SKILL.md + schema description distinguish snippet-shrink (widen) vs k-pop (paginate) truncated cases. HOTFIXES notes --no-cache semantic shift (cached path + clear vs uncached path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 20:49:27 +09:00
th-kim0823	af80cedd81	feat(app): App::search_with_opts + SearchResponse (fb-34) Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor encode/decode threads corpus_revision; mismatch surfaces as stale_cursor anyhow error. App::search retained as thin wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:59:48 +09:00
th-kim0823	ebbc3a46ae	feat(app): cursor encode/decode for paginated search (fb-34) Opaque base64(JSON{offset, corpus_revision}). Mismatch or malformed input returns ErrorV1 with code = stale_cursor. base64 promoted to workspace dep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:49:23 +09:00
th-kim0823	dfef65f196	feat(app): staleness module + post-process search hits (fb-32) compute_stale: strict > boundary, threshold=0 disables, future timestamps treated as fresh (clock skew safety). App::search re-stamps on cache hit so config threshold changes take effect without flushing the cache. Also unblocks the workspace build by plugging placeholder indexed_at/stale into the two AnswerCitation construction sites in kebab-rag/pipeline.rs (the score-gate refusal path forwards from SearchHit; the LLM-citation path uses UNIX_EPOCH/false until Task 7 wires the real values through pack_context). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 01:30:10 +09:00
th-kim0823	7f5739d8fb	🏗️ refactor(fb-31): apply round 1 review nits - ingest_file_with_config: lowercase normalize ext (caller-side) + early error on unsupported extension (`.docx` etc. now Err with helpful message instead of silent skipped_by_extension counter). New test ingest_file_errors_on_unsupported_extension. - ingest_stdin_with_config: doc comment explaining intentional double-call of ensure helpers (idempotent + ~ms negligible). - external::inject_frontmatter: simplify precheck via single trim_start binding + add CR-only line ending edge case. - external::inject_frontmatter: doc note on yaml_quote escape contract (agent-supplied titles with special chars are safe). Round 1 review summary: #111 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:46:55 +09:00
th-kim0823	a42f907640	🧪 test(kebab-app): ingest_stdin_with_config integration (fb-31) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:04:52 +09:00
th-kim0823	73ee64c73f	🧪 test(kebab-app): ingest_file_with_config integration (fb-31) Three scenarios — copies external md + reports new=1, idempotent on second call (unchanged=1), errors on missing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:02:52 +09:00
th-kim0823	366b647a1a	✨ feat(kebab-app): capability flag mcp_server: false → true (fb-30) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:12:23 +09:00
th-kim0823	3e33daaa9b	🧪 test(kebab-app): assert chunk_count + asset_count in schema_report (fb-27) Plug coverage hole flagged in code review — test 1 was asserting only doc_count + last_ingest_at, leaving count_summary's chunk_count and asset_count queries un-pinned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:06:04 +09:00
th-kim0823	ab96335174	🧪 test(kebab-app): schema_with_config integration coverage (fb-27) Two scenarios: freshly-ingested 2-doc KB (stats reflect counts + last_ingest_at populated) and empty-but-initialized KB (counts zero, last_ingest_at None). The empty case runs ingest_with_config over an empty workspace dir to seed kebab.sqlite before calling schema_with_config, since open_existing (used internally) returns NotIndexed if the DB is absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:02:20 +09:00
altair823	9545367904	feat(kebab-app): p9-fb-25 task 5 — Skipped warnings + skipped_by_extension aggregation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:13:13 +00:00
altair823	d64282433c	feat(kebab-app): p9-fb-25 task 3 — init_workspace header lists supported extensions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:55:38 +00:00
altair823	ef5d0770ae	review(p9-fb-25-task1): fix kebab-app test references to removed WorkspaceCfg.include reviewer-flagged: task 1 missed test files using cfg.workspace.include. - crates/kebab-app/tests/common/mod.rs: SourceScope literal switched to ..Default::default(). - crates/kebab-app/tests/image_pipeline.rs (×3): drop dead-no-op cfg.workspace.include.push(...) calls; comment explains removal. - crates/kebab-app/tests/pdf_pipeline.rs: same treatment. Pre-fb-25 these pushes were no-ops (include was dead config field not enforced anywhere). Removal is purely mechanical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:53:19 +00:00
altair823	0e6d6073e7	feat(kebab-app): p9-fb-23 task 7 — early-skip Unchanged path in ingest Adds the per-asset incremental-ingest skip block to all three flows (markdown / image / pdf). When `IngestOpts::force_reingest = false` AND the asset's blake3 checksum + parser/chunker/embedding versions all match the existing DB record, ingest emits `AssetFinished { result: Unchanged }`, bumps `aggregate.unchanged`, and skips parse / chunk / embed / vector upsert entirely. Shared `try_skip_unchanged` helper performs the four checks; per-flow callers supply the active parser_version + chunker_version + optional embedding_version. `force_reingest = true` bypasses the skip path so `incremental_ingest::force_reingest_bypasses_skip` still sees `Updated`. Tests: - new `incremental_ingest.rs` covers both paths. - existing `ingest_idempotent_on_second_run` / `re_ingest_image_produces_` / `re_ingest_identical_pdf_produces_` updated to assert `Unchanged` on identical-bytes re-ingest (the pre-task behaviour was `Updated`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:12:47 +00:00
altair823	4874304d5d	refactor(kebab-app): p9-fb-23 task 6 — IngestOpts struct + ingest_with_config_opts entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:04:50 +00:00
altair823	a16e9c9215	feat(kebab-app): p9-fb-23 task 5 — stamp chunker + embedding versions on CanonicalDocument before put_document All three ingest flows (markdown, image, pdf) now set last_chunker_version and last_embedding_version on the CanonicalDocument before calling put_document, giving Task 7's skip detection the data it needs on the second run. No skip path is added yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:01:48 +00:00
altair823	3f0b00439a	review(p9-fb-10-task5): promote lexical_query to common + tighten Korean hit assertion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-03 10:14:17 +00:00
altair823	60e583252e	test(kebab-app): Korean query → FTS5 smoke pin p9-fb-10: verifies that a Korean (Hangul) token survives the ingest → FTS5 lexical search round-trip via the kebab-app facade. NFC normalization is wired upstream in kebab-normalize; this test only exercises end-to-end correctness — no AVX, no fastembed required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-03 10:08:32 +00:00
altair823	0e408fb1b5	feat(kebab-app + kebab-store-sqlite): p9-fb-19 search LRU cache + corpus_revision 도그푸딩 item 15 — TUI / 같은 process 안에서 동일 query 반복 시 SQLite FTS + Lance + RRF 재계산이 매번 발생하던 비용 해소. in-process LRU 캐시 + 모노토닉 corpus_revision 카운터로 ingest commit 발생 시 모든 entry 자동 stale. ## 핵심 변경 - SQLite V004 migration: `kv (key TEXT PRIMARY KEY, value TEXT) STRICT` + `corpus_revision = '0'` seed. 미래의 다른 scalar 도 같은 테이블에 들어갈 수 있는 generic shape. - `SqliteStore::corpus_revision()` / `bump_corpus_revision()` — `UPDATE ... CAST AS INTEGER + 1` atomic. INSERT-OR-IGNORE 도 함께 실행 (V004 seed 가 무슨 이유로 누락된 케이스 paranoid). - `kebab-app::ingest_with_config_cancellable` — `new + updated > 0` 시 bump, no-op (skipped-only) reingest 는 cache 보존. - `App.search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec< SearchHit>>>>` — `config.search.cache_capacity` (default 256, 0 비활성). `lru = "0.12"` workspace dep 추가. - `SearchCacheKey` = `query_norm` (NFKC + trim + lowercase) + `mode` + `k` + `snippet_chars` + `embedding_version` (vector/hybrid 만, lexical 은 빈 문자열) + `chunker_version` + `corpus_revision` snapshot. - `App::search` rewrite — cache 활성 시 lookup → miss 면 기존 `search_uncached` 호출 후 put. cache 비활성이거나 lock 실패면 straight-line. - `App::search_uncached` (rename of pre-fb-19 `search` body) + `search_uncached_with_config` facade — CLI `kebab search --no-cache` 로 진입. - `Config.search.cache_capacity: usize` field, `#[serde(default)]` 로 기존 config 호환. - CLI `--no-cache` flag — 디버깅용 (CLI 는 매 호출이 새 process 라 사실상 no-op 이지만 spec 명시 + 향후 long-lived process 호환). - frozen design §9 versioning 표에 `corpus_revision` row 추가 (기존 `index_version` 라벨과 다른 차원: 라벨은 retrieval 형상, corpus_revision 은 ingest commit ack). ## 테스트 - `kebab-store-sqlite` 신규 3 unit (fresh=0, monotonic bump, persist across reopen) - `kebab-app` 신규 4 integration (cached repeat 같은 hits, NFKC 정규화 로 case/whitespace collapse, --no-cache parity, first ingest bumps corpus_revision) - 워크스페이스 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0 - `cargo clippy --workspace --all-targets -- -D warnings` clean ## 문서 - README `kebab search` 행: 캐시 동작 + `--no-cache` 안내 + corpus_ revision 무효화 메커니즘 - docs/SMOKE.md `[search]` 절에 `cache_capacity` 라인 추가 - HANDOFF: 2026-05-03 entry - spec status planned → in_progress ## Out of scope - patch-and-merge incremental (RRF 정규화 전체 hit set 기준이라 어려움) - SQLite 영속 cache (P+) - 다른 process 간 cache 공유 (in-process 만 — corpus_revision 이 cross-process 무효화는 O(1)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:01:31 +00:00
altair823	2c058ab175	feat(rag): multi-turn ask — Turn struct + ask_with_history + token budget (p9-fb-15) Spec PR #59 의 §3.8 multi-turn behaviour 구현. RAG facade 가 prior turns 받아 prompt 에 prepend, retrieval query expansion 적용, Answer 에 conversation_id / turn_index 채움. 신규 (kebab-core): - Answer 에 conversation_id (Option<String>) / turn_index (Option<u32>) field 추가. serde skip_serializing_if 로 single-shot 의 wire output 변경 0 (기존 외부 wrapper 영향 없음). - Turn struct (question + answer + citations + created_at). - RefusalReason::LlmStreamAborted variant. 신규 (kebab-rag): - AskOpts 에 history (Vec<Turn>) / conversation_id / turn_index 3 field. - AskOpts::single_shot(mode) helper. - RagPipeline::ask_with_history(query, history, conversation_id, turn_index, opts) — combined opts 로 ask 호출. - expand_query_with_history: history.last() 의 answer 첫 200 자 concat 해 SearchQuery.text 확장 (spec §3.8 의 \"cheap concat\"; LLM-based standalone-question rewriting 은 P+). - serialize_history + remaining_history_budget_chars: spec 의 priority enforcement — system+question 필수, retrieved chunks 가 차지한 뒤 남은 char budget 안에서 newest 우선, oldest drop. - ask 본문: history 가 비어있지 않으면 [이전 대화] 블록을 user prompt 위에 prepend. Answer 생성 site 3 곳 (정상 / NoChunks / ScoreGate refuse) 모두 conversation_id / turn_index 채움. 신규 (kebab-store-sqlite): - refusal_reason_label 가 LlmStreamAborted → 'llm_stream_aborted'. 기존 caller 변경 (single-shot 동작 동일): - kebab-cli main.rs Cmd::Ask: AskOpts 에 history=Vec::new(), conversation_id=None, turn_index=None 명시 (CLI multi-turn 은 p9-fb-18 의 --session/--repl 가 채움). - kebab-tui src/ask.rs spawn site 동일 (multi-turn UI 는 p9-fb-16). - kebab-eval runner.rs golden eval 동일 (single-shot per query). - kebab-app tests/ask_smoke.rs / kebab-tui tests/ask.rs / kebab-rag tests/pipeline.rs / kebab-eval metrics.rs Answer literal 갱신. Test: - 9 신규 lib unit (expand_query 4 / serialize_history 3 / remaining_budget 2). - 기존 12 PASS 회귀 0. Plan 갱신: - p9-fb-15 status planned → in_progress. 머지 후 한 줄 commit 으로 completed flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:09:46 +00:00
altair823	6260df5b30	review(회차1): SIGNAL_COUNT lifetime 명시 + cancel-mid race 코멘트 회차 1 actionable 2건 반영 + 1건 (CLI Ctrl-C integration test) 은 본 PR 에서 별도 task 로 미룸 (signal handler subprocess test 의 flaky 위험 + facade 3 PASS + tui lib 3 PASS 가 안정 surface). - cancel.rs::install_sigint_cancel: SIGNAL_COUNT 위에 process-lifetime invariant 코멘트 — multi-install 차단 (ctrlc::set_handler) 덕분에 reset 불필요. 미래 다중 caller 가 같은 cancel token 공유하려면 install 함수 분리 필요. - ingest_cancel.rs::cancel_mid_loop: redundant `report.new == 1 \|\| 0 \|\| 2` 제거, race timing 의도 코멘트로 대체 (0=listener 승, 1=first only, 2=extra slipped in 모두 valid; 3 = cancel never propagated = 유일한 fail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:39:39 +00:00
altair823	fa02a7c68d	feat: ingest cooperative cancellation (p9-fb-04) Ctrl-C / Esc 가 ingest 를 즉시 중단. 현재 in-flight asset 마무리 후 이후 asset 미실행, IngestEvent::Aborted { partial_counts } 발신, Ok(IngestReport) 정상 반환 (Err 아님). 부분 commit 보존, 다음 ingest 가 idempotent 재개. 신규 facade: kebab-app::ingest_with_config_cancellable(.., progress, cancel: Option<Arc<AtomicBool>>). 기존 _progress 가 cancel=None forwarding wrapper. asset loop 시작 boundary 마다 atomic load — true 면 break + Aborted emit + 정상 종료. Lock 없음. CLI: ctrlc crate 신규 dep. SIGINT handler 가 첫 신호에 cancel.store(true) + stderr hint, 두 번째 신호에 std::process::exit(130) (canonical SIGINT exit code). install_sigint_cancel() helper 가 Arc<AtomicBool> 반환, Cmd::Ingest 가 facade 에 전달. TUI: IngestState 에 cancel: Arc<AtomicBool> field 추가 (회차 1 review 결과의 reshape 정확). start_ingest 가 둘 다 만들어 worker 에 clone move. cancel_running_ingest(&app) helper — Esc / Ctrl-C 가 ingest 진행 중일 때만 cancel 우선, 그 외에는 quit. Test: - 3 facade integration (cancel-before / cancel-mid / no-cancel default). - 3 tui lib unit (cancel_running_ingest no-state / in-flight / terminated). Plan 갱신: p9-fb-04 status planned → in_progress. 머지 후 한 줄 commit 으로 completed flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:36:17 +00:00
altair823	eb331f9b29	feat(app): add IngestEvent + ingest_with_config_progress (p9-fb-01) Streaming progress channel for ingest. Facade emits one IngestEvent per step boundary into an optional `mpsc::Sender<IngestEvent>` injected by the caller. CLI (p9-fb-02), TUI (p9-fb-03), and future desktop UI all consume the same stream. 신규: - crates/kebab-app/src/ingest_progress.rs: `IngestEvent` enum (`#[serde(tag = "kind", rename_all = "snake_case")]` matching wire schema ingest_progress.v1) + `AggregateCounts` struct + `media_label` helper + best-effort `emit` helper. - ingest_with_config_progress(cfg, scope, summary_only, progress) — 존재 시 `mpsc::Sender<IngestEvent>` 로 ScanStarted → ScanCompleted → (AssetStarted < AssetFinished)* → Completed 발신. dropped receiver 는 silent absorb (hot path stall 금지). - 기존 ingest_with_config 가 `progress=None` forwarding wrapper. 미적용 (계약 상 향후 task 가 채움): - IngestEvent::Aborted: cancel token wiring 은 p9-fb-04. - embed_batch_started / embed_batch_finished: spec 의 \"asset 이벤트 사이 임의 위치\" 에 해당. v1 단순화 — asset 단위 해상도면 CLI / TUI 충분. Test: - 6 lib unit (media_label / serde discriminator / emit corner cases). - 3 integration (이벤트 sequence 가 §2.4a invariant 준수 / forwarding wrapper / dropped receiver tolerance). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 19:44:34 +00:00
altair823	3a57cab1eb	fix(kebab-store-sqlite): purge stale assets row on workspace_path orphan + smoke P7-3 통합 테스트가 노출한 storage 레이어 버그 fix. `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산 re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE 충돌. md / image / pdf 모두 영향. Fix: - 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path` 의 다른 `asset_id` 를 발견하면 documents → assets 순서로 sweep. documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks / embedding_records 정리. copied 모드면 storage_path 의 byte 파일도 best-effort 삭제. - `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore ::put_asset` 모두 호출. - 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전 의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로 대체). - `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 → 9 통합 테스트 모두 default 로 통과. Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게 운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가 SQLite join 통해 surface). Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2): - doctor pass - 첫 ingest: 5 new - list docs: 5 docs all media types - search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit - search hybrid → cross-media 결과 - inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page - 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency) - byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix) - corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0 - 정리 후 다시 ingest: errors=0 - RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 정확히 노출 운영 fixture 보조: - `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로 reportlab/qpdf 없이 in-tree PDF 생성. - `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG fixture 생성. - SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시 errors+=1 → new+=1) 반영. HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로 업데이트, vector store orphan caveat 만 남음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:41:23 +00:00

1 2

57 Commits