kebab

Author	SHA1	Message	Date
altair823	e03d03cb26	test: 별칭 전용 테스트 삭제 + 영향 테스트/fixture 갱신 kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases 헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거 (embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases 키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical 주석은 V013 비-bump 명시로 갱신. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:37:58 +00:00
altair823	ecaf224381	refactor(chunk): Chunk 생성부의 aliases 리터럴 + store 컬럼 제거 kebab-chunk/* AST·md·tier2·pdf chunker 의 aliases: None 리터럴 삭제, store-sqlite documents.rs chunks INSERT 컬럼/바인딩 + get_chunk 매핑에서 aliases 제거. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:36:44 +00:00
altair823	b1c5feb3f3	refactor(core): Chunk.aliases 필드 제거 doc-side expansion(별칭) 제거 — Chunk 의 aliases: Option<String> 필드와 serde default 테스트 제거. Metadata.aliases(Vec, 문서 메타)는 유지. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:36:44 +00:00
altair823	e9b520216e	fix(expansion): per-alias sentinel orphan cleanup + 캐시 견고성 (PR #195 리뷰) MAJOR: 별칭 dense 벡터의 chunk_id 가 레거시 단일 `{id}#alias` 에서 줄별 `{id}#alias#0`, `#alias#1`, … 로 바뀌었으나 orphan cleanup 이 단일 sentinel 하나만 삭제해 `#alias#N` 벡터가 LanceDB / embedding_records 에 누수됐다. - kebab-app: `alias_sentinel_ids_to_delete` 헬퍼 추가(접근법 A) — 본문 + legacy `{id}#alias` + `{id}#alias#0`..`{id}#alias#{max-1}` 를 모두 delete-set 에 포함. max=expansion.max_aliases_per_chunk(= parse_aliases 의 하드 cap)와 일치. parser-bump / edited-asset / deleted-file 세 LanceDB cleanup 경로 모두 이 헬퍼를 사용. - kebab-store-sqlite: embedding_records 명시 DELETE 4 경로(put_chunks / purge_*_except_doc_id / purge_orphan_at_workspace_path / purge_deleted_workspace_path)를 정확 일치(`\|\| '#alias'`)에서 `{id}#alias%` 프리픽스 LIKE 로 전환. 본문 chunk_id 는 32자 hex 라 LIKE 와일드카드 없음. MINOR 1: alias 캐시 히트 시 비-UTF8 payload 를 미스로 강등(재생성 분기로) — embedding 경로의 decode-실패→미스 강등과 동작 일치. MINOR 2: embedding version_key 맨 앞에 kind 토큰("doc") 추가 — 임베더가 kind 별 프리픽스를 붙이므로 미래에 query 임베딩이 같은 캐시를 타도 충돌 방지. 회귀 테스트: - kebab-app: alias_sentinel_ids_to_delete 단위 테스트 2건. - kebab-store-sqlite: per-alias sentinel embedding_records 가 세 cleanup 경로 모두에서 사라지는지 핀하는 통합 테스트 3건. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 09:14:34 +00:00
altair823	a8fd76499c	feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012) 별칭을 줄별 개별 dense 벡터(sentinel `{chunk}#alias#N`)로 색인하고 boilerplate 청크는 별칭 생성을 skip. 묶음 1벡터 방식은 평균화로 특정 표현이 희석돼 오히려 회귀(13/18)했던 것을 폐기. 변형 일관성 14/18 → 16/18, mean_spread@10 0.222 → 0.111 (나무위키 ~1000 문서 CS corpus). `kebab-core::strip_alias_suffix` 가 suffix 형과 per-alias 형 둘 다 처리. 파생물 캐시(V012): embedding 벡터 + 별칭 LLM 결과를 청크 내용 해시 키로 캐싱해 재색인 시 내용 불변 청크의 재계산을 skip. cache_key = blake3(kind ‖ text_blake3 ‖ version_key)[:32], version_key 에 model/prompt/dimensions 포함 → §9 cascade 와 정합(버전 bump 시 자동 miss). 측정: 정답 3개 cold 1879s → warm 13s ≈ 145배. 순수 가산이라 corpus_revision bump 없음. search/ask 는 kebab.sqlite+lancedb 만으로 동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능. V012 schema migration + 신규 surface 로 workspace version 0.20.2 → 0.21.0 (minor) bump. README/HANDOFF/ARCHITECTURE/HOTFIXES sync. known limitation: stack·svm 설명형 2개 잔존 + grounded 판정이 부분 인용을 grounded 로 오분류(후속 후보). 측정 상세: docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 08:24:04 +00:00
altair823	0282a81c67	fix(store): CASCADE 대체 4번째 경로 + V011 CHECK 복원 (Task 4.5 리뷰) 리뷰 MAJOR: purge_document_at_workspace_path_except_doc_id(parser-bump 경로)에 원본+sentinel embedding_records 명시 DELETE 누락 → tombstone 누적. 추가 + 회귀 테스트. MINOR: V011 status CHECK(pending/committed/tombstone) 복원. NIT: foreign_keys PRAGMA no-op 주석. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 14:02:46 +00:00
altair823	f3587b7143	feat(store): filter_chunks sentinel 별칭 candidate strip (committed 통과) LanceDB 후보의 sentinel chunk_id({orig}#alias)는 chunks JOIN 에서 탈락해 VectorRetriever strip 이전에 사라진다. candidate 를 kebab_core::strip_alias_suffix 로 원본 chunk_id 로 strip 해 IN-list/JOIN 에 넣어(committed 판정은 원본 body chunk 기준) 통과시키되, 반환은 입력 candidate 형태(sentinel 유지) — VectorRetriever 가 그 sentinel 을 받아 strip+dedup 한다. SQL replace 대신 (b) Rust strip 채택(명확). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 13:41:28 +00:00
altair823	483b1ec06b	feat(store): V011 embedding_records FK 제거 + CASCADE 대체 명시 DELETE (sentinel 별칭 벡터) 별칭 dense 벡터를 sentinel chunk_id({orig}#alias)로 색인하려면 chunks 에 없는 chunk_id 가 embedding_records 에 들어가야 한다. V001 의 chunk_id REFERENCES chunks ON DELETE CASCADE FK 가 이를 SQLite 787 로 막으므로 테이블을 FK 없이 재생성한다. status/vector_committed(V003) + 3개 인덱스 보존, chunks_bd_tombstone_embeddings trigger 무수정. DROP→RENAME 시 dangling trigger 재파싱을 피하려 legacy_alter_table=ON. 사라진 CASCADE 는 put_chunks + purge 두 경로(purge_orphan_at_workspace_path, purge_deleted_workspace_path)의 명시 DELETE 로 대체 — chunks 삭제 직전 원본 + {id}#alias sentinel embedding_records 를 함께 정리. corpus_revision baseline 2→3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 13:41:20 +00:00
altair823	0df47febf0	test(store): doc-side expansion Task 2 리뷰 보강 (M1/M2/N1) - M1: chunk_aliases trigger 가드에 AND aliases <> '' (빈 문자열 미색인) - M2: 재색인 멱등 테스트 (재-put 후 별칭 행 1개) - N1: 본문 격리 음성 단언 (별칭 term 이 chunks_fts 로 누출 안 됨) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 02:24:24 +00:00
altair823	b12a616ab2	feat(store): V010 chunk_aliases_fts + put_chunks 별칭 영속화 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:15:27 +00:00
altair823	848b75c069	feat(core): Chunk.aliases 필드 (doc-side expansion) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 02:09:39 +00:00
altair823	21b52bc285	style: cargo fmt --all (S3+S4+S5+S7 follow-up) V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5 short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)	2026-05-28 12:06:01 +00:00
altair823	c5de5f812b	test(fts,app): V009 morphological tokenizer integration tests 신규 4 test 추가: - crates/kebab-store-sqlite/tests/fts.rs: - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text column 이 채워진 chunk 의 '한국' 2-char query hit. - fts_v009_english_whole_token_only: V007 trigram substring 매칭 회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit. - crates/kebab-app/tests/search_korean.rs: - korean_morphological_2char_query_lexical_mode: end-to-end 한국어 wiki fixture ingest → '한국' / '서울' query hit. - korean_morphological_mixed_english_korean_query: 'Rust' English whole-token + '최적화' Korean morpheme hit. crates/kebab-search/src/lexical.rs: - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2). V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme query 가 통과되어야 함. 1자 단독은 여전히 필터. - 관련 unit test 2개 V009 동작으로 갱신. fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존 (spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:38:52 +00:00
altair823	b63af20b72	feat(app): first-boot eager backfill for tokenized_korean_text V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가 NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic 으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index. 사용자 재-ingest 불필요. - crates/kebab-store-sqlite/src/store.rs: backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다 commit + progress 콜백. idempotent (IS NULL 필터로 partial completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지. - crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후 backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid mode 계속 가능). 500 row 마다 info log progress. - crates/kebab-store-sqlite/tests/fts.rs: backfill_tokenized_korean_text_populates_nullable_rows 단위 test (idempotency 포함). - clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or, cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs, pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs). Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:01:00 +00:00
altair823	4b4a8cbb3a	fix(test): retire V007 trigram-only fts tests, add V009 unicode61 sanity V007 trigram tokenizer 의 substring 매칭을 검증하던 3 test 는 V009 unicode61 으로 의도된 회귀 (spec §3 Non-Goals Path A) 가 발생하므로 obsolete: - fts_trigram_korean_3char_substring_hits: '발생한' → '발생한다' hit 은 trigram 의 substring 매칭이라 V009 의 whole-token 매칭에서 fail. - fts_trigram_korean_short_query_zero_hit_pinned: 2-char Korean query 의 0-hit 동작은 V009 의 형태소 column 으로 해소되므로 이 핀 자체가 obsolete (S7 이 신규 2-char hit test 로 대체). - fts_trigram_english_substring_hits: 'token' → 'tokenizer' hit 은 V009 unicode61 의 whole-token only 에서 fail. 신규 추가: - fts_v009_unicode61_space_separated_korean_token_hits: V009 unicode61 의 whole-token 매칭 sanity (token '충돌은' hit, substring '발생한' 0-hit). S7 이 추가할 morphological 검증 test 와 별개의 baseline. S7 (plan §2 Step 7) 가 v009_korean_morphological_2char_query_hits + v009_english_whole_token_only 를 추가하여 회귀 + 신규 동작 모두 핀할 예정. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §3, §9.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:36:32 +00:00
altair823	4dc1c10be1	fix(test): update corpus_revision baseline to post-V009 migration V009 migration bumps corpus_revision by 1 at apply time (spec §5.2 — invalidates pre-V009 LRU search cache). Existing tests assumed V004 seed (0) was the final baseline; updated to expect 1 after migration: - fresh_store_starts_at_zero → fresh_store_starts_at_post_migration_baseline - bump_increments_monotonically: expected 1,2,3 → 2,3,4 (post-baseline) - revision_persists_across_reopen: 2 → 3 (V009 +1, +bump,+bump) Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:33:50 +00:00
altair823	bd86f61c9c	fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade S3 spec compliance reviewer (sonnet) 가 2 blocker 발견: 1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가 tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실. SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가. ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade. 2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가 tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드 파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로 tokenize_korean_morphological(text) 호출 cascade. 이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary 유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에 tokenize 호출") 충족. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:30:53 +00:00
altair823	b134ae9dd5	feat(chunk): integrate lindera korean morphological tokenizer V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence 를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성 직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store 의 put_chunks 가 단일 transaction 안에서 INSERT. - crates/kebab-core/src/chunk.rs: Chunk struct 에 tokenized_korean_text: Option<String> field 추가 (#[serde(default)]). - crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological() helper + OnceLock 캐싱 + fallback (None) 정책. - crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"] 추가 (DictionaryKind::KoDic 활성화에 필요). - 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개 code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill. - crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT SQL column list + placeholder + binding 갱신 (12번째 column). - crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개. lindera 3.0.7 API 정정: load_dictionary_from_kind → load_embedded_dictionary, Token.text → Token.surface. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:22:15 +00:00
altair823	b106120e93	feat(fts): add V009 korean morphological tokenizer migration V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를 위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소 분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식. - migrations/V009__fts_korean_morphological.sql 신규: column ADD, chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT, corpus_revision bump. - design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE expression trigger 본문. - crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를 V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit test 추가. - store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정 (pdf_ocr_events 관련 코드, S1 작업 중 발견). 영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3 Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 09:48:46 +00:00
altair823	9a36a06f97	style: cargo fmt --all (v0.20.x logging r2 feature follow-up) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:34:01 +00:00
altair823	6482bf1321	feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2) Add migrations/V008__pdf_ocr_events.sql with the events table + 3 indices (doc_id, run_id, ts). SqliteStore gains two pub fn: record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events (delete rows older than retention_days; returns the affected row count). Both follow the existing Mutex<Connection> lock pattern. Wiring into ingest path lands in the next commit. Closure r1 F2: explicit lock acquisition in both methods. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 05:56:54 +00:00
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	d9c7aabce1	feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13 ) 이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 → multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest) 의 version cascade audit 누락 risk. 이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String> 추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는 optional (#[serde(default)] + JSON schema required 미포함). source: - kebab_store_sqlite::fetch_distinct_parser_versions() 가 documents.parser_version DISTINCT + ORDER BY 반환. - fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern. - collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결). wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분. integrations/claude-code/kebab/SKILL.md 동기 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:15:58 +00:00
altair823	b9ee09f176	feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4) Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 7 spillover (IngestEvent variant + IngestItem field for compile boundary) + Step 4 reviewer Minor M-4 fix. E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry, mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled \|\| always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출 + fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1 A1 cosmetic fix 정합). E2 — `ingest_one_pdf_asset` signature extension: +3 param (`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<& mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`). `ingest_one_asset` dispatch wrapper + caller (dispatch loop) update. E3 — post-extract enrichment block at `extract_for` 직후 (line 1779). `pdf.ocr.enabled \|\| always_on` 시 `apply_ocr_to_pdf_pages` 호출, PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field 로 carry. PR #187 registry dispatch invariant 보존 (`extract_for(&asset.media_type, ...)` 그대로). E4 — cancel handle propagation: ingest_with_config_cancellable → IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset → ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain. spec §4.8 line 1159 production wiring. Step 7 spillover (compile boundary): - `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } + PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치). - `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> + pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF IngestItem construct site 가 `None` 채움. - `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`: 새 variant no-op handler (v1에서 per-page progress 미노출, future refinement 시 활성화 가능). - `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가. - Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot regenerate 는 별 commit (G3/H1/H2 deliverable). M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합: - workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`. - kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`. Verifier evidence: - workspace test (`cargo test --workspace --no-fail-fast -j 1`): 175 test result summary lines, 0 failures, 0 FAILED. - workspace clippy (`-D warnings`): exit 0, 0 warning. - dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`): empty diff for both. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2) prior: `4672cba` (Step 5 fix) + `fd918a6` (Step 5) + `9f003ef` (Step 4 helper) contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 08:18:34 +00:00
altair823	710945c4b0	refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성 design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를 경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC) + kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수. 설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md 플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation) - 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export. - build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module. - 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성. - kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거. - kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift. - test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/). - workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경). - design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger). - design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary). - ARCHITECTURE.md crate graph + directory tree mechanical 갱신. - tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry). - tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom). - HANDOFF.md cross-link 한 줄. - 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11). Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path / parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성. Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks). cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s). cargo metadata --no-deps --format-version 1 \| jq '.workspace_members \| length' = 22. cargo tree -p kebab-app --depth 2 \| grep -E "kebab_(parse_types\|normalize)" = 0 줄.	2026-05-26 15:00:59 +00:00
altair823	7c85de065a	chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게". ## Workspace lints - `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가: - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation. - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style. - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only. - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등). - default_trait_access — `Foo::default()` 가 idiomatic. - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도. - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도. - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존. - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style. - 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가. ## Auto-fix `cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로: - uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`. - redundant_closure_for_method_calls (~33): `.map(\|x\| x.foo())` → `.map(T::foo)`. - 그 외 mechanical refactor. ## 검증 - `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group). - `cargo test --workspace --no-fail-fast -j 1` — 1293 tests pass + 1 pre-existing flaky fail (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0. Wire 영향: 없음. Behavior 영향: 없음 (mechanical refactor only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 03:01:58 +00:00
altair823	546c1564b0	feat(rag): fb-41 PR-9c-1 — core types + wire scaffolding (NLI verification) Surface-only PR (no behavior wiring — that's PR-9c-2): - kebab-core: RefusalReason::NliVerificationFailed + NliModelUnavailable (serde rename_all="snake_case", wire = identical strings). - kebab-core: Answer.verification: Option<VerificationSummary> field (additive minor wire — pre-v0.18 reader 무영향). - kebab-core: VerificationSummary { nli_score: f32, nli_threshold: f32, nli_passed: bool } struct + lib.rs 재-export. - kebab-config: NliCfg { model, provider } + ModelsCfg.nli (default Xenova/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7). - kebab-config: RagCfg.nli_threshold: f32 (default 0.0 = disabled, spec §2.6 single gate). - kebab-config: env override KEBAB_MODELS_NLI_MODEL/PROVIDER + KEBAB_RAG_NLI_THRESHOLD (parse 실패 시 tracing::warn + default 유지). - kebab-rag: RagPipeline.verifier: Option<Arc<dyn NliVerifier>> field + with_verifier builder (모두 #[allow(dead_code)] — PR-9c-2 의 step 8.5 hook 가 활성화 시 제거). RagPipeline::new signature 유지 (round-2 NEW-M1 Option B). - kebab-rag: Cargo.toml 에 kebab-nli path 의존 추가. - kebab-store-sqlite + kebab-tui: 두 신규 RefusalReason variant 에 대한 exhaustive match arm 추가 (snake_case label / 표시 문구). - 모든 Answer 구축 site (rag 6 + cli/tui/eval 3 fixture) 에 verification: None 추가. - wire schemas: answer.schema.json verification field + \$defs.VerificationSummary + refusal_reason.enum 2 추가. error.schema.json code.enum + details.description 2 추가 (forward-looking reserved). - docs/ARCHITECTURE.md: Mermaid Adapters subgraph 의 nli 노드 + rag→nli + app→nli (forward-looking) + nli→config edges. nli→core edge 는 skip (kebab-nli/Cargo.toml direct dep 가 config 만, ARCHITECTURE 컨벤션 = direct deps only). 디렉토리 트리에 crates/kebab-nli/ 추가. Tests: kebab-core 3 (serde rename + verification skip + struct shape) + kebab-config 6 (defaults + legacy + env + malformed env) + kebab-cli wire 5 (schema verification + enum 검증). 검증: cargo test --workspace -j 1 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop flaky 1개 동일 — spec 에 명시된 known-flaky). cargo clippy --workspace --all-targets -D warnings clean. Wire 영향: additive minor — answer.v1 의 verification optional + refusal_reason.enum 확장 + error.v1.code 확장. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 23:27:36 +00:00
altair823	cf35f36f88	feat(rag): fb-41 PR-2 — RagPipeline::ask_multi_hop skeleton (fixed depth=2) PR-2 of fb-41 multi-hop RAG. Decompose + retrieve + synthesize 3-stage pipeline가 `opts.multi_hop=true` 일 때 dispatch. Dynamic decide loop 는 PR-3. - `AskOpts.multi_hop: bool` 필드 추가 + `impl Default for AskOpts` 도입 (HOTFIXES 2026-05-07 의 known limitation 해소). 9 explicit init site 모두 `multi_hop: false` 추가 — Default 도입으로 향후 `..Default::default()` 점진 migrate 가능. - `RagPipeline::ask` 의 entry 에 dispatcher 한 줄 (`if opts.multi_hop { return self.ask_multi_hop(...) }`). - `RagPipeline::ask_multi_hop` 신규 method. 1) decompose LLM call → JSON array of strings parse, 2) 각 sub-query 로 retrieve + chunk_id dedup pool, 3) score gate / no-chunks 가드, 4) pack_context (single-pass 와 helper 공유), 5) synthesize LLM call w/ MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT, 6) citation extract + Answer build. `prompt_template_version` = "rag-multi-hop-v1" 로 stamp — eval `compare` 가 single-pass vs multi-hop 분리. - Prompt const 신규: MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT + MULTI_HOP_DECOMPOSE_USER_TEMPLATE + MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT + PROMPT_TEMPLATE_VERSION_MULTI_HOP + MULTI_HOP_MAX_SUB_QUERIES_DEFAULT. - `kebab_core::RefusalReason::MultiHopDecomposeFailed` variant 신규. Cascade: kebab-store-sqlite `refusal_reason_label` + kebab-tui `ask refusal render` exhaustive match 갱신. - `parse_decompose_response` + `strip_markdown_json_fence` helper — markdown code fence (```json / ```) strip + JSON array of strings parse + trim + drop empty + cap at MULTI_HOP_MAX_SUB_QUERIES_DEFAULT. None 반환 시 caller 가 `MultiHopDecomposeFailed` refusal. Tests (55 passing total, 8 신규): - 6 unit (parse_decompose_response 의 bare array / fence variants / garbage / cap / trim 회귀 핀). - 2 integration: `ask_multi_hop_dispatches_and_decompose_garbage_refuses` (decompose garbage → MultiHopDecomposeFailed + 정확히 1 LLM call) + `ask_with_multi_hop_false_keeps_single_pass_path` (회귀 핀, 기존 caller 자동 backwards-compat). Happy-path multi-hop (decompose 성공 → synthesize) 의 integration test 는 ScriptedLm helper 가 PR-3 의 decide loop 와 함께 도입될 때 같이 추가. 현 `MockLanguageModel` 는 canned single response 라 2-LLM-call sequence 핀 불가. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 06:45:32 +00:00
altair823	0def913abd	feat(v0.17.0/PR-C): code_lang_chunk_breakdown additive wire field closure of HOTFIXES 2026-05-22 "code_lang_breakdown chunk granularity" LOW. Chunk-level companion of the existing doc-count metric. - crates/kebab-store-sqlite/src/store.rs: code_lang_chunk_breakdown() method. chunks INNER JOIN documents → COUNT(c.chunk_id) GROUP BY metadata_json.code_lang, NULL skipped. BTreeMap<String, u32>. + lib unit test code_lang_chunk_breakdown_counts_chunks_not_docs (1 rust doc + 3 chunks → rust=3 chunks vs rust=1 doc). - crates/kebab-app/src/schema.rs: Stats.code_lang_chunk_breakdown additive field + collect_stats builder. tests_stats_ext 의 stats_includes_code_lang_and_repo_breakdown_fields 가 신규 필드도 검증. - docs/wire-schema/v1/schema.schema.json: 신규 additive 필드 명세 + 기존 code_lang_breakdown / repo_breakdown description 정정 ("code chunk count" → "doc count", Gemini round 2 권고). - tasks/HOTFIXES.md: 2026-05-24 PR-C closure entry. wire additive, schema_version bump 불필요. v0.16.x 호출 호환. cargo test --workspace --no-fail-fast -j 1 + clippy 통과. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 20:35:01 +00:00
altair823	93ddece111	feat(v0.17.0/PR-B/B1): C typedef extractor + parser_version bump + orphan purge cascade closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous struct/enum/union 이 typedef alias 이름으로 symbol unit 방출. - crates/kebab-parse-code/src/c.rs: type_definition 분기 추가. inner anonymous struct_specifier / enum_specifier / union_specifier 탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic unit (typedef alias). named inner aggregate / plain alias 는 기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2. recover_typedef_alias + extract_typedef_alias_name helper 추가. - crates/kebab-store-sqlite/src/store.rs: 두 helper 신규 (parser_version bump cascade 용 doc-id 기반 orphan purge). - stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path, keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반. - purge_document_at_workspace_path_except_doc_id(workspace_path, keep_doc_id) — CASCADE document/chunks 제거, assets 보존. keep_doc_id="" 가 "모든 doc 제거" 사용. - crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch 분기에서 purge_workspace_path_for_parser_bump 호출. helper 가 app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시 idx_docs_workspace_path UNIQUE 충돌 회피. - tests: - c.rs unit tests 4 신규 — typedef_struct_emits_unit / typedef_enum_emits_unit / typedef_union_emits_unit / typedef_to_existing_type_stays_glue (negative). - tier1_c_ingest_searchable: parser_version assertion code-c-v1 → code-c-v2. - 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path + purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와 공존, 기존 test 모두 PASS. 미해결 (Risks): nested typedef (typedef struct { struct {...} inner; } Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는 top-level typedef alias 만. cargo test --workspace --no-fail-fast -j 1 + clippy 통과. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 20:30:57 +00:00
altair823	fe123c0c6d	test(A4): korean + english trigram matching at FTS level 3개 신규 unit tests in tests/fts.rs §7: 1. fts_trigram_korean_3char_substring_hits — Codex sqlite 3.45.1 검증 동작 5개 assert pin: raw 3자 substring hit (충돌은/발생한), quoted phrase hit (\"해시 충돌\"/\"시 충\"), raw 해시충 0-hit (원문 미존재). 2. fts_trigram_korean_short_query_zero_hit_pinned — 2자 한국어 query (충돌·키) 0-hit 회귀 감지. trigram 구조 변경 시 먼저 fail. 3. fts_trigram_english_substring_hits — substring recall 동작 변경 pin (token→tokenizer, to 0-hit). 검증: cargo test -p kebab-store-sqlite --test fts → 13/13 PASS (신규 3 + 기존 10). Step 1c (multi-token 한국어 query e.g. \"해시 충돌\") 와 Step 5 (lexical BM25 snapshot 갱신) 는 Task A5 의 build_match_string() 재설계 후 진행. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:57:37 +00:00
altair823	8dcedc4b11	feat(p10-r2): V007 trigram migration + design §5.5 + fts diff-check Task A2 + A3 한 묶음. migrations/V007__fts_trigram.sql 신규: - chunks_fts shadow 를 DROP + 재생성 (tokenize = trigram). - chunks_ai/ad/au trigger 재생성 (V002 와 동일). - chunks 에서 backfill INSERT — 사용자 re-ingest 불필요, V007 자동. - V002 는 historical cold-upgrade replay 위해 그대로 유지. design §5.5 갱신: - verbatim block 의 tokenize 만 trigram 으로 교체. - §5.5 본문 상단에 한국어 채택 사유 + trade-off (영어 lexical 변경, BM25 분포, 디스크 ~2-10x, contentless 아님) prose 한 단락 추가. crates/kebab-store-sqlite/tests/fts.rs: - fts_v002_matches_design_section_5_5_verbatim → fts_v007_matches_design_section_5_5_verbatim 으로 rename. - extract_migration_5_5_verbatim_block() 의 include_str! path 를 V007__fts_trigram.sql 로 변경. 주석/assertion msg V007 로. - V002 cold-upgrade test 들 (fts_v002_backfill_*) 은 그대로 유지. 검증: cargo test -p kebab-store-sqlite --test fts → 10/10 PASS (`fts_v007_matches_design_section_5_5_verbatim` 포함). Codex round 1/2 의 design §5.5 contentless 정정·trigram tokenizer 채택 사유 명시 발견 반영. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:52:40 +00:00
altair823	453ec15df4	fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path assets.workspace_path is INTENTIONALLY 'last-registered path' for twin files (identical content at different paths share one asset row PK'd by blake3 content hash). PR #146 made try_skip_unchanged document-centric; PR #149 made reset --orphans-only document-centric; this PR removes the last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span, which used it to reject PDF/audio media — for twins this could read the wrong asset's media_type and pick the wrong branch). Replaced with the natural 2-step lookup: get_document_by_workspace_path (PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id is PRIMARY KEY so flip-flop-immune by construction). Then removed get_asset_by_workspace_path trait method + SqliteStore impl — 0 callers after the refactor. UPSERT doc-comment refreshed in store.rs to make the 'last-registered' semantics explicit so future readers don't try to 'fix' the flip-flop. Dogfood follow-up (PR #142 1B + multi-root corpus). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:03:38 +00:00
altair823	27baec82ea	fix(dogfood): auto-purge stored docs for filesystem-deleted files Files deleted from disk (rm a.md) were leaving stale documents + chunks + embeddings in the store, surfacing as ghost citations in search/ask. Existing purge_orphan_at_workspace_path only handled content-changed stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no new asset_id. Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths), for each candidate check filesystem existence — only purge when the file is TRULY missing. Scope-narrowing case (file on disk but outside include glob) is explicitly NOT purged to protect users from accidental data loss via config edits. Adds: - DocumentStore::all_workspace_paths trait method + SqliteStore impl - purge_deleted_workspace_path in store-sqlite (returns chunk_ids for vector delete; deletes doc CASCADE + asset row + copied storage file) - sweep_deleted_files in kebab-app::ingest path; called once per ingest before the per-asset loop - IngestReport.purged_deleted_files counter (additive, serde default) - CLI ingest summary mentions purge count when > 0 - 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod + lodash). Per user decision: only filesystem deletion auto-purges; scope narrowing requires explicit kebab reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:51:07 +00:00
altair823	641b92af7d	fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency Identical-content files at different workspace paths share one assets row (assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made twin files overwrite each other's workspace_path on every ingest, so `get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or None) — break idempotent unchanged-detection for both files. Fix: switch try_skip_unchanged to document-centric lookup. `documents. workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)` includes path, so each twin has its own stable document row. Compare `doc.source_asset_id` with the new asset's checksum instead of going through the assets table. Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of 726 docs marked Updated on every idempotent re-ingest — all 27 are twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md same content, duplicate logo PDFs/JPGs). After: re-ingest reports 0 new / 0 updated / 726 unchanged. No schema migration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:27:21 +00:00
altair823	4e8b84c4e0	fix(dogfood): populate schema.v1.repo_breakdown (Task 9 follow-up) Dogfooding (PR #142 1B + multi-root corpus: kebab-docs + httpx + zod + lodash) revealed schema.v1.repo_breakdown is always {} despite the 1A-2 Task 9 having added the code_lang_breakdown sibling. The schema.rs:171 placeholder `BTreeMap::new()` was left in place. Mirror Task 9's code_lang_breakdown query for the repo field — same metadata_json JSON-path pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 05:09:19 +00:00
altair823	918ee6c0be	fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood-discovered) p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI --code-lang / --repo flags propagate them correctly into SearchFilters, but neither the lexical retriever's FTS SQL nor the shared filter_chunks helper (used by the vector retriever) ever applied them — so a code-lang-filtered search returned all-doc hits (markdown / pdf / code mixed). Discovered while dogfooding p10-1B with httpx + zod + lodash clones: `kebab search 'AsyncClient' --code-lang python --json` returned markdown hits from httpx/docs/ first. Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang') and '$.repo' to both lexical.rs and filters.rs, mirroring the existing media filter pattern. Two regression tests added in each crate covering the new filter behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 03:27:01 +00:00
altair823	da51e59081	feat(p10-1a-2): populate schema.v1 code_lang_breakdown Add `SqliteStore::code_lang_breakdown()` that queries `json_extract(metadata_json, '$.code_lang')`, groups by it, and skips NULL rows — returning `BTreeMap<String, u32>`. Wire it into `collect_stats` in `kebab-app::schema`, replacing the `BTreeMap::new()` placeholder inserted by 1A-1. Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts rust=1 and that a null-code_lang doc does NOT appear in the map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:41:52 +00:00
th-kim0823	bf4ebf8d2a	feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang Four optional, serde-skipped-when-None fields added to `Metadata` for code ingest context. All 11 downstream construction sites patched with `repo: None, git_branch: None, git_commit: None, code_lang: None`. Full workspace check (`--tests`) and per-crate test suite pass clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:44:18 +09:00
th-kim0823	351c7a0826	feat(p10-1a-1): add IngestReport skip counters + SkipExamples Adds five new u32 counters (skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded) and a SkipExamples struct (≤5 sample paths per category) to IngestReport. All new fields are #[serde(default)] for backward-compat deserialization. Downstream literal construction sites patched with zeros/empty; snapshot re-baked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:28:19 +09:00
th-kim0823	126559ce7a	fix(fb-40): update test fixtures for rag-v2 default	2026-05-10 19:15:15 +09:00
th-kim0823	231d80e82d	feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37) Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold for callers that need real stale counts. Mirrors all new fields onto the wire-bound Stats struct in kebab-app::schema with #[serde(default)] for backwards-compat. Also fixes search_budget_integration.rs for the trace field added to SearchOpts in Task 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 12:34:57 +09:00
th-kim0823	69c6e23432	feat(store): breakdowns + index_bytes helpers (fb-37) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 12:24:43 +09:00
th-kim0823	84287d0ef6	fix(fb-36): address PR #127 round 1 review - ingested_after: convert OffsetDateTime to UTC before formatting so non-Z offsets compare correctly against UTC TEXT storage (lexical.rs + filters.rs) - README: --tag is repeatable-only, not csv (only --media is csv) - test(cli): add multi-value --tag OR-within IN-list coverage - test(store): add UTC-offset regression test for ingested_after - mcp: use ERROR_V1_ID const instead of hardcoded "error.v1" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 04:47:55 +09:00
th-kim0823	c6cc1e2bfe	feat(search/vector): media / ingested_after / doc_id filters (fb-36) filter_chunks helper in kebab-store-sqlite extended with the same 3 WHERE clauses as lexical. Vector still over-fetches k*2 then post-filters via SqliteStore::filter_chunks; small k can return < k hits when filters drop a lot — agent is expected to widen k or paginate. AND combinator with existing filters. - kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after lexicographic >= compare, doc_id equality; mirrors lexical SQL arms - 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id) that run without AVX/Lance - common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search helpers on HybridEnv for integration-test use - hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests (vector_filter_by_media, vector_filter_by_doc_id) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 03:50:56 +09:00
th-kim0823	b86b763dfb	fix(fb-35): address PR #126 round 2 review - wire schema: relax effective_end.minimum 1 → 0 + expand description to cover line-clamp + out-of-range sentinel (panic-fix R1 emits Some(0) when line_start=1 and range is beyond doc end — schema must accept it) - tests: tighten first-chunk-target boundary test to assert ≤ 2 total neighbors (3-chunk doc, N=2). Strict "first chunk → context_before empty" not assertable until chunks.ordinal column lands (R1 #9 architectural caveat) - store: trim contradiction in list_chunk_ids_for_doc warning comment — drop "good enough for sequentially chunked markdown" phrase that conflicts with "hash sort dominates" paragraph above Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:55:29 +09:00
th-kim0823	7dddc1d706	fix(fb-35): address PR #126 round 1 review - fetch_span: panic-fix on line_start > total / empty doc (return empty text + effective_end = line_start - 1 instead of out-of-bounds slice) - truncated: reserved for budget-driven truncation only; line range clamp signaled via effective_end < line_end - spec / SKILL.md / README: align rejection wording to "PDF / audio" (matches code; Image OCR allowed for span) - store: warning comment on list_chunk_ids_for_doc — chunk_id hash sort does NOT preserve document position; real fix is a chunks.ordinal column, tracked as follow-up - surrounding_chunks: saturating_add to defend against u32::MAX context arg on 32-bit targets - tests: line_start > total returns empty + chunk context at doc boundary clamps lower bound Deferred nits (follow-up): table-separator strict CommonMark form; MCP per-mode strict validation; CLI chunk_id truncation in plain output. None block correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:45:29 +09:00
th-kim0823	610d29f053	feat(app): App::fetch chunk mode + markdown serializer (fb-35) Chunk mode + +-N context. doc / span modes return placeholder errors (filled by subsequent tasks). fmt_canonical_to_markdown helper introduced now since doc mode (Task 4) consumes it. Errors are typed StructuredError so classify preserves chunk_not_found / doc_not_found through the wire layer. Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive +-N neighbors without leaking direct rusqlite usage into kebab-app. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 23:44:51 +09:00
th-kim0823	61aae1c1d5	🏗️ refactor(kebab-app): consolidate PARSER_VERSION + clarify intent (fb-27) Replace kebab-app's private `KEBAB_PARSE_MD_VERSION` literal with a direct reference to `kebab_parse_md::PARSER_VERSION` so the parser version cascade has a single source of truth (design §9 invariant). Add maintenance comment on schema.rs WIRE_SCHEMAS const pointing to docs/wire-schema/v1/ + kebab-cli wire helpers as the authoritative sources to keep in sync. Tighten open_existing doc comment to match the actual SQLITE_OPEN_READ_WRITE flag (needed for WAL pragma application) — callers should still avoid issuing mutations through this connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:58:06 +09:00
th-kim0823	39b4433549	feat(kebab-app): schema_with_config facade (fb-27) New `SchemaV1` struct + `schema_with_config(&Config)` builder. Surfaces wire schemas list, capabilities (current + future placeholders), model versions (parser/chunker/embedding/prompt_template/index/corpus_revision), and stats (doc/chunk/asset counts + last ingest). kebab-store-sqlite gains `count_summary()` to back the stats block. Deviations from plan: - `cfg.models.embedding.id` → `cfg.models.embedding.model` (actual field name) - No `Config::expand_path` method → free fn `kebab_config::expand_path(&cfg.storage.data_dir, "")` - `PARSER_VERSION` added to `kebab-parse-md/src/lib.rs` (was absent; synced with `KEBAB_PARSE_MD_VERSION` literal in kebab-app) - `INDEX_VERSION_STR` added to `kebab-store-vector/src/store.rs` + re-exported from `lib.rs` (was a private `const`) - `corpus_revision()` returns `u64` directly (not `Result<u64>`) — no `?` in collect_models - `SchemaV1` carries `schema_version: "schema.v1"` field (wire schema convention) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:46:37 +09:00

1 2

69 Commits