kebab

Author	SHA1	Message	Date
altair823	58ac62d53a	feat(search): provenance 출처 필터 — [[workspace.sources]] 멀티소스 + --source/--source-type 혼합 출처 KB(위키+jira 등)에서 색인은 전부 하되 질의 시 출처로 좁히는 provenance 레버. 전역 trust 곱셈가중(weighted-RRF)은 A/B 에서 반증(θ=0.85 만으로 incident MRR 0.918→0.340 절벽, 점수 압축) — 필터가 see-saw 없는 올바른 레버. - config [[workspace.sources]] (각 id/root/exclude/trust_level/source_type); 단일 root 는 implicit `default` source 로 정규화. validate: id 유일·비어있지 않음. - config schema v3→v4 (step_3_to_4, root→[[workspace.sources]] id=default 미러, 멱등) - V014 documents.source_id 컬럼+인덱스 (additive, DEFAULT 'default', 재색인 0) - Metadata.source_id + BodyHints trust precedence(frontmatter > source 기본값 > Primary) - ingest: --root 미지정 시 resolved_sources() 순회 + doc 마다 source_id/trust stamp - 검색 SearchFilters.source_type/source_id → lexical + vector 두 site (IN, OR) - CLI kebab search --source <id> / --source-type <type> (repeatable/comma-sep) 도그푸딩(620 doc, jira400+wiki220): --source wiki 로 개념 질의 MRR 0.780→0.810, --source jira 로 incident 0.918→0.975. trust precedence 실측(jira=secondary 기본값). version bump 0.28.0 → 0.29.0 (신규 CLI flag + config 키 + V014 migration → minor). follow-up: MCP search 필터 미노출 · kebab list source_id 미표시 · RAG provenance 라벨. 자세한 내용: tasks/HOTFIXES.md (2026-06-21), docs/release-notes/v0.29.0-draft.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Mc6W1fgsrbFKTsqA6P8La	2026-06-21 08:35:19 +00:00
altair823	3d45994693	refactor(config): signature paddle 경로 미디어화 + 바이트 불변 골든 ocr_engine_version_for_sig 가 det/rec/dict 를 호출자(미디어별)로부터 받도록 인자화 — image 는 [ingest.image.ocr], pdf 는 [ingest.pdf.ocr]. v2 의 pdf↔image paddle 비대칭 제거. engine_version_for_paths 신설(kebab-parse-image). 출력 문자열은 값 기반이라 v2 와 바이트 동일(불변식 #1). test seam + 골든 추가. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 12:44:27 +00:00
altair823	d5c69f6715	refactor(config): v3 경로 call-site sweep (kebab-app/kebab-eval/kebab-parse-image) 부모 경로에 .ingest 삽입(leaf 구조체 불변). src + 테스트 call-site 전부. kebab-cli 테스트의 v2 TOML fixture 는 from_file 자동변환(T6) 경로 검증용으로 유지. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 12:40:06 +00:00
altair823	f3a7222ec5	fix(ocr): PR #206 round-1 리뷰 반영 — 골든 CI 테스트 + PDF 튜닝 문서 + threshold const + mutex 복구 - [MEDIUM] 골든 CI 단위테스트 2건 추가: ctc_greedy_decode_golden (argmax_idx one-hot → decoded 문자열 검증), det_box_score_golden (box_score/unclip_rect golden corner 검증). 모델/ONNX 불요, CI 상주. ctc_greedy_decode를 자유 함수(ctc_greedy_decode_with_dict)로 추출하여 테스트 가능하게 함. - [MEDIUM] PDF paddle 튜닝 비대칭 문서화: build_pdf_ocr_engine에 paddle-onnx가 image.ocr.* 사용(pdf.ocr.* 아님) 이유 명시 + PdfOcrCfg.engine 필드 doc 갱신. - [MEDIUM] DBNet 이진화 매직넘버 0.3 → DET_BIN_THRESH const 추출 + score_thresh 기본값 느슨한 이유 1줄 주석. - [LOW] Mutex poison 복구: det/rec .expect("poisoned") → .unwrap_or_else(PoisonError::into_inner). 자산 panic이 ingest abort 안 되도록. - [LOW] DetBox.score dead field 제거 (box_score 결과는 필터에만 사용, 저장 불요). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 09:13:27 +00:00
altair823	375a0693e4	chore(ocr): T11/T12 — clippy clean + docs + v0.27.0 bump T11: fix 12 clippy lints in paddle_onnx.rs/paddle_e2e.rs (doc overindent, finish_non_exhaustive, map_or_else, RangeInclusive::contains, cast_lossless, is_some_and, usize::from). Full-workspace clippy -D warnings = 0. Smoke (paddle-onnx, real binary): clean_paragraph OCR verbatim-correct, real per-region confidence (0.99/0.96/0.95), FTS5 lexical hit on Korean(검색)+ English(embedding), parser_version folds \|ocr:1:paddle-onnx:<ver>. Big page <4s inference (5.6s ingest incl. one-time session load). T12: README [image.ocr].engine + ARCHITECTURE OCR row + SMOKE paddle-onnx config + HANDOFF + HOTFIXES dated entry. Workspace version 0.26.2 → 0.27.0 (minor: new engine value + config keys). .gitattributes: onnx as plain blobs (no git-lfs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 08:36:19 +00:00
altair823	901416d8e9	feat(ocr): T7-T9 — config overrides + engine factory + signature cascade T7: OcrCfg gains det_model/rec_model/dict overrides + score_thresh/ unclip_ratio/max_boxes (serde default, KEBAB_IMAGE_OCR_* env). OnnxPaddleOcr::new threads them via ModelPaths::from_config. T8: build_image_ocr_engine / build_pdf_ocr_engine factories return Box<dyn OcrEngine>; match on engine string (ollama-vision\|paddle-onnx\|err). ImagePipeline.ocr_engine + pdf_ocr_engine signatures switched to &dyn OcrEngine. OcrEngine gains model() for the progress label. T9: ingest_config_signature image/pdf branches emit \|ocr:1:{engine}:{engine_version} (memoized blake3 per asset-triple, m3-safe). Unit tests (a)(b)(c) added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 08:15:30 +00:00
altair823	03b0745e9d	test(ingest): config invalidation e2e + parser_version assert 갱신 - config_invalidation.rs(신규): 동일config=전skip / 청킹변경=md+code재색인 / [ingest.code]변경=코드만 / search변경=재색인0 (회귀가드) end-to-end. - code_ingest_smoke / pdf_pipeline: 저장 parser_version 이 이제 "{base}\|{sig}" composite 라, exact assert 를 base 접두사(split('\|').next()) 비교로 갱신. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 14:14:15 +00:00
altair823	e7cb20990a	feat(ingest): ingest 설정 변경 시 영향 자산 자동 재색인 (signature 폴딩) ingest 산출에 영향 주는 설정(청킹/이미지 OCR·caption/pdf.ocr/[ingest.code])의 결정적 서명을 effective parser_version 에 폴딩 → 변경 시 --force-reingest 없이 영향 자산만 자동 재색인. - ingest_config_signature(config, media_type): per-type 산출-영향 설정만 직렬화. 비산출 설정(search/rag/ui/log + max_pixels/languages/timeout)은 제외. - effective_parser_version(config, asset, base) = "{base}\|{signature}". - md/image/pdf/code 경로: composite 를 (a) try_skip_unchanged 비교값, (b) persist 전 canonical.parser_version override 에 사용. - doc_id 는 base parser_version 으로 계속 파생 → 설정 변경에도 안정(orphan churn 회피). - code Tier-3 fallback 은 bare "none-v1" sentinel 유지(skip bypass 의존). - 단위테스트 8: 결정성/청킹=전타입/이미지·pdf·code 토글/무관설정 회귀가드. spec: docs/superpowers/specs/2026-06-03-ocr-toggle-invalidation-spec.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 14:14:06 +00:00
altair823	6c9c8df43e	chore(version): 0.27.0 → 0.26.1 — 새 bump 규칙상 patch 진행 로그 개선은 검색·색인 결과 불변 + 새 명령/플래그/config 없음 + additive-only wire(asset_phase)라 CLAUDE.md 신규 규칙(기능/인터페이스 변경=minor, 없으면 patch)상 patch 가 맞음. version·라벨·HOTFIXES 헤더를 0.26.1 로 정정. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 11:02:16 +00:00
altair823	4918983d9c	chore(ingest): PR #204 회차1 리뷰 반영 — 버전 라벨 v0.26.0 → v0.27.0 신규 진행로깅 표면(asset_phase / ocr_ms / caption_ms + progress.rs heartbeat· slowest 주석)이 v0.26.0 으로 잘못 표기돼 있던 것을 v0.27.0(실제 추가 버전)으로 정정. wire schema 의 "추가 버전" 정확성(외부 통합 참조). 로직 변경 없음(주석/doc). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 10:57:17 +00:00
altair823	aeaa18a564	feat(ingest): 진행 로그 개선 — 파일명/phase/heartbeat/slowest 요약 OCR/caption 켜진 볼트 ingest 가 중간부터 느릴 때 TTY 진행바가 파일명·phase· 모델·경과시간을 안 보여 "멈춤"처럼 보이던 문제 해결. - 신규 wire AssetPhase{idx,total,phase,model} + AssetTimings.ocr_ms/caption_ms (additive, ingest_progress.v1 유지) - app: apply_ocr/apply_caption/embed 진입 시 AssetPhase emit + ocr/caption 시간 측정 - cli: TTY 진행바에 현재 파일명 + phase(model) + asset 경과초(heartbeat), 종료 시 최장 소요 파일 top-5 요약(quiet 여도 출력, --json 미출력) - wire schema / README / HANDOFF / HOTFIXES 동기화, version 0.26.0 → 0.27.0 검증(리더): clippy 0, kebab-app/cli 61그룹·parse-image/tui 14그룹 0실패(-j8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 10:52:26 +00:00
altair823	72c99c452c	feat(config,app): embedding provider=ollama 배선 + endpoint, version 0.26.0 kebab-config: EmbeddingModelCfg.endpoint: Option<String>(serde default, ollama용, None→models.llm.endpoint 폴백) + provider 문서에 ollama + env KEBAB_MODELS_EMBEDDING_ENDPOINT. kebab-app embedder(): provider match 에 ollama 분기(facade 경유). workspace member += kebab-embed-ollama, app dep 추가. version 0.25.0 → 0.26.0(minor, +Cargo.lock) — 신규 임베딩 백엔드/모델은 CLAUDE.md §Release 의 surface 변경 트리거. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 04:59:23 +00:00
altair823	e03d03cb26	test: 별칭 전용 테스트 삭제 + 영향 테스트/fixture 갱신 kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases 헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거 (embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases 키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical 주석은 V013 비-bump 명시로 갱신. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:37:58 +00:00
altair823	a48c405826	refactor(wire): ExpansionProgress 이벤트 + 렌더 제거 IngestEvent::ExpansionProgress variant + 직렬화 테스트 제거(AssetChunked/ AssetTimings 유지). CLI/TUI 의 expansion 렌더 제거, AssetTimings 한 줄에서 expand 세그먼트 제거. ingest_progress.v1 schema 의 expansion_progress kind 제거, expansion_ms 설명을 "값 0 유지"로 갱신. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:37:44 +00:00
altair823	21e02d8a93	refactor(app): ingest 별칭 생성·캐시·sentinel 벡터 루프 제거 ingest_one_asset 의 청크당 별칭 LLM 생성·derivation_cache 조회/저장· embed_aliases sentinel 벡터(`{orig}#alias#N`) upsert 루프 제거. expansion_ms 는 wire 호환 위해 0 고정. alias_sentinel_ids_to_delete 와 orphan purge 3개 호출부를 본문 chunk_id 직접 삭제로 단순화. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:37:43 +00:00
altair823	b1c5feb3f3	refactor(core): Chunk.aliases 필드 제거 doc-side expansion(별칭) 제거 — Chunk 의 aliases: Option<String> 필드와 serde default 테스트 제거. Metadata.aliases(Vec, 문서 메타)는 유지. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:36:44 +00:00
altair823	8bfa4ba76e	fix(ingest-progress): 리뷰 반영 — store_ms 경계 정정 + 중복 expansion 프레임 가드 - store_ms 에서 stale-vector orphan purge(LanceDB I/O) 제거 → embed/vector phase (embed_ms)로 이동. store_ms 가 이제 SQLite put_* 만 의미(진단 정확도; 편집 재색인 시 920ms 오귀속 제거). purge 는 여전히 unconditional + upsert 이전. - 최종 expansion_progress 프레임을 done != last_done 로 가드 (throttle 배수 시 중복 프레임 + chunks==0 시 0/0 프레임 제거). - schema/HOTFIXES: store_ms/embed_ms 설명 정정 + dangling IMPL_REPORT 참조 제거. clippy -D warnings 0, test 312 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 14:49:02 +00:00
altair823	a48b055358	feat(ingest): asset 내부 phase 진행 로깅 (asset_chunked/expansion_progress/asset_timings) + v0.24.0 asset(문서) 단위뿐이던 ingest 진행 이벤트에 문서 내부 phase 가시성을 추가. 큰 문서가 expansion(별칭 LLM, 청크당 순차)으로 수십 분 걸려도 진행바가 1/N 에 멈춘 듯 보이던 문제 해결. wire ingest_progress.v1 additive (backward-compat): - asset_chunked {idx,total,chunks} — 청킹 직후, markdown/image/pdf 전 경로 - expansion_progress {idx,total,done,chunks} — expansion 루프 스로틀 (25청크 또는 1s, 종료 시 done==chunks). 캐시 히트도 done 에 포함 - asset_timings {idx,total,parse_ms,chunk_ms,expansion_ms,embed_ms,store_ms} — markdown 경로 phase별 wall-clock 설계: timing 은 kebab_core::IngestItem(wire-stable) 변경을 피해 신규 AssetTimings 이벤트로 ingest_one_asset 가 직접 emit (AssetFinished 무변경). CLI(progress.rs): 진행바 sub-message(→ N chunks / 별칭 확장 done/chunks) + asset 종료 시 phase timing 한 줄(fmt_ms). TUI reducer no-op arm. 검증: clippy -D warnings exit 0; cargo test -p kebab-app -p kebab-cli 312 passed/0 failed. ordering-invariant 테스트 재작성 + 신규 직렬화 테스트. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 13:58:27 +00:00
altair823	369aeb3d24	feat(embed): candle Metal (Apple Silicon GPU) opt-in build feature + v0.23.0 - kebab-embed-candle: `metal` feature → candle metal backend; select_device() picks Device::new_metal(0) (CPU fallback) under the feature, else Device::Cpu. .contiguous() before to_vec2 (Metal rejects strided views; CPU tolerates). - feature passthrough: kebab-app/embed_metal → kebab-cli/embed_metal. Build on macOS: cargo build --release --features embed_metal. - default (non-metal) path unchanged: clippy 0, candle units + thread_cap + parity pass. - README + HOTFIXES: Mac-GPU-ingest → copy sqlite+lancedb → server CPU-query workflow. - version 0.22.0 → 0.23.0 (opt-in build surface). macOS-only compile; Metal execution/speed/parity validated by user on M4 Pro (not buildable on the Linux CI/dev machine). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 11:37:08 +00:00
altair823	8f7b6ee538	feat(embed): candle 임베딩 provider (NUMA-안전, opt-in) + v0.22.0 duo-socket NUMA 서버에서 fastembed(onnxruntime)가 intra-op 스레드를 48개로 하드코딩해 NUMA 힙 손상 → double-free 로 ingest 가 죽는 문제를 회피하기 위해, 같은 multilingual-e5-large 모델을 순수 Rust(candle)로 돌리는 opt-in 임베딩 provider 를 추가한다. - 신규 crate kebab-embed-candle: CandleEmbedder (kebab_core::Embedder). hf-hub safetensors → XLMRobertaModel forward → mask mean-pool → L2 → e5 prefix. candle 의존성 트리를 이 crate 에 격리 (core/config 외 kebab-* 의존 0). - 스레드 캡: [models.embedding].num_threads + env KEBAB_EMBED_THREADS → 글로벌 rayon 풀 1회 캡 (NUMA-안전 레버). - kebab-app::embedder() 가 provider 분기 (fastembed/onnx/"" → 기존 경로 불변, candle → CandleEmbedder, 미지값 → 에러). - Phase 0 스파이크 crate 제거 (production 흡수). - 버전 0.21.1 → 0.22.0 (신규 config surface, pre-1.0 minor bump). 패리티: cosine_min=1.000000, max abs diff=2.01e-7 (< 1e-5) → embedding_version 유지, 재색인 0. fastembed default 동작/벡터 불변. wire schema 변경 없음. 검증(파일+exit code): clippy -D warnings EXIT=0(warning 0), test EXIT=0 (candle unit 5 + thread_cap rayon=4 + config 68), parity #[ignore] EXIT=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 14:52:25 +00:00
altair823	f2cc325cf3	feat(cli): kebab config migrate 서브커맨드 + wire config_migration.v1 - Cmd::Config { Migrate { --dry-run } }, --json 시 config_migration.v1. - wire_config_migration (ConfigMigrationReport 가 schema_version 자체 보유). - schema.rs WIRE_SCHEMAS 에 config_migration.v1 등록 + JSON schema 파일. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 12:09:31 +00:00
altair823	b7e022a5e3	feat(app): config migrate facade + init 주석 공유 + doctor 체크 - config_migrate_with_config_path: 백업(.bak)+atomic write(tmp→rename)+dry-run, round-trip 검증으로 실패 시 원본 보존. ConfigMigrationReport 반환. - init_workspace 가 annotated_default_document() 사용(섹션 주석 포함). - doctor 에 config_migration 체크 추가(미동기 시 ok=false + hint). - tests/config_migrate.rs 4개(백업/atomic/dry-run/멱등/doctor) 통과. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 12:09:31 +00:00
altair823	e9b520216e	fix(expansion): per-alias sentinel orphan cleanup + 캐시 견고성 (PR #195 리뷰) MAJOR: 별칭 dense 벡터의 chunk_id 가 레거시 단일 `{id}#alias` 에서 줄별 `{id}#alias#0`, `#alias#1`, … 로 바뀌었으나 orphan cleanup 이 단일 sentinel 하나만 삭제해 `#alias#N` 벡터가 LanceDB / embedding_records 에 누수됐다. - kebab-app: `alias_sentinel_ids_to_delete` 헬퍼 추가(접근법 A) — 본문 + legacy `{id}#alias` + `{id}#alias#0`..`{id}#alias#{max-1}` 를 모두 delete-set 에 포함. max=expansion.max_aliases_per_chunk(= parse_aliases 의 하드 cap)와 일치. parser-bump / edited-asset / deleted-file 세 LanceDB cleanup 경로 모두 이 헬퍼를 사용. - kebab-store-sqlite: embedding_records 명시 DELETE 4 경로(put_chunks / purge_*_except_doc_id / purge_orphan_at_workspace_path / purge_deleted_workspace_path)를 정확 일치(`\|\| '#alias'`)에서 `{id}#alias%` 프리픽스 LIKE 로 전환. 본문 chunk_id 는 32자 hex 라 LIKE 와일드카드 없음. MINOR 1: alias 캐시 히트 시 비-UTF8 payload 를 미스로 강등(재생성 분기로) — embedding 경로의 decode-실패→미스 강등과 동작 일치. MINOR 2: embedding version_key 맨 앞에 kind 토큰("doc") 추가 — 임베더가 kind 별 프리픽스를 붙이므로 미래에 query 임베딩이 같은 캐시를 타도 충돌 방지. 회귀 테스트: - kebab-app: alias_sentinel_ids_to_delete 단위 테스트 2건. - kebab-store-sqlite: per-alias sentinel embedding_records 가 세 cleanup 경로 모두에서 사라지는지 핀하는 통합 테스트 3건. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 09:14:34 +00:00
altair823	a8fd76499c	feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012) 별칭을 줄별 개별 dense 벡터(sentinel `{chunk}#alias#N`)로 색인하고 boilerplate 청크는 별칭 생성을 skip. 묶음 1벡터 방식은 평균화로 특정 표현이 희석돼 오히려 회귀(13/18)했던 것을 폐기. 변형 일관성 14/18 → 16/18, mean_spread@10 0.222 → 0.111 (나무위키 ~1000 문서 CS corpus). `kebab-core::strip_alias_suffix` 가 suffix 형과 per-alias 형 둘 다 처리. 파생물 캐시(V012): embedding 벡터 + 별칭 LLM 결과를 청크 내용 해시 키로 캐싱해 재색인 시 내용 불변 청크의 재계산을 skip. cache_key = blake3(kind ‖ text_blake3 ‖ version_key)[:32], version_key 에 model/prompt/dimensions 포함 → §9 cascade 와 정합(버전 bump 시 자동 miss). 측정: 정답 3개 cold 1879s → warm 13s ≈ 145배. 순수 가산이라 corpus_revision bump 없음. search/ask 는 kebab.sqlite+lancedb 만으로 동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능. V012 schema migration + 신규 surface 로 workspace version 0.20.2 → 0.21.0 (minor) bump. README/HANDOFF/ARCHITECTURE/HOTFIXES sync. known limitation: stack·svm 설명형 2개 잔존 + grounded 판정이 부분 인용을 grounded 로 오분류(후속 후보). 측정 상세: docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 08:24:04 +00:00
altair823	afa8af0f88	feat(app): 별칭 dense 별도 벡터 색인 + purge (sentinel)	2026-05-30 10:48:58 +00:00
altair823	116b3e6377	fix(app): clippy unused_self — build_request 를 associated fn 으로 CI 게이트(clippy --workspace --all-targets -D warnings) 통과. 동작 동일. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 03:47:06 +00:00
altair823	cde4d75f6b	feat(app): ingest 별칭 생성 hook (flag off 기본, fail-soft)	2026-05-30 03:03:09 +00:00
altair823	bddcd53688	fix(app): parse_aliases 접두 제거가 숫자/하이픈 선두 별칭 손상 (Task 4 리뷰 MAJOR-1) 탐욕적 trim_start_matches → 명시적 strip_list_marker(마커+공백 패턴만 1회). "3D 렌더링"/"2단계"/"-fast" 보존, "- "/"1. " 마커만 제거. 회귀 테스트 2개. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 02:49:25 +00:00
altair823	2a207f9868	feat(app): ExpansionGenerator — 청크당 별칭 생성 (fail-soft) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:36:20 +00:00
altair823	dece5e89fc	feat(bulk): document bulk search input schema + error shape hint (Todo #2 ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 02:16:21 +00:00
altair823	21b52bc285	style: cargo fmt --all (S3+S4+S5+S7 follow-up) V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5 short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)	2026-05-28 12:06:01 +00:00
altair823	c5de5f812b	test(fts,app): V009 morphological tokenizer integration tests 신규 4 test 추가: - crates/kebab-store-sqlite/tests/fts.rs: - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text column 이 채워진 chunk 의 '한국' 2-char query hit. - fts_v009_english_whole_token_only: V007 trigram substring 매칭 회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit. - crates/kebab-app/tests/search_korean.rs: - korean_morphological_2char_query_lexical_mode: end-to-end 한국어 wiki fixture ingest → '한국' / '서울' query hit. - korean_morphological_mixed_english_korean_query: 'Rust' English whole-token + '최적화' Korean morpheme hit. crates/kebab-search/src/lexical.rs: - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2). V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme query 가 통과되어야 함. 1자 단독은 여전히 필터. - 관련 unit test 2개 V009 동작으로 갱신. fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존 (spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:38:52 +00:00
altair823	f94e0c4a9b	feat(app): bump lexical_index_version to V009 (fts5-v009-korean-morphological) V009 의 FTS5 tokenizer 가 trigram → unicode61 + 한국어 형태소 분해 column 로 갱신됨. lexical_index_version 의 format 에 `fts5-v009-korean-morphological` suffix 추가하여 V007 baseline 과 구별. eval runner 의 config_snapshot 및 search cache 무효화에 자동 picks up. 기존 format: lex:{chunker_version} 신규 format: lex:{chunker_version}:fts5-v009-korean-morphological Wire schema shape 변경 없음 (SearchHit.index_version 의 string content 만 변화). lexical_index_version_is_returned_unchanged test 는 IndexVersion 의 임의 string 을 사용해 unchanged. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.1, §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S6)	2026-05-28 11:23:13 +00:00
altair823	923b959610	refactor(app): retire short_query_hint helper, keep wire field as None V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가 hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete. SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 + 항상 None. - kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제. 2 호출 site 가 hint = None 으로 정리. - kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거. - kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4 호출 cascade 제거. - kebab-cli/tests/wire_search_response.rs: search_plain_emits_short_query_hint_to_stderr test 삭제. search_json_emits_hint_field_for_short_query → search_json_hint_absent_for_short_query_v009 으로 교체 (hint 항상 None 검증). - kebab-search/src/lexical.rs::build_match_string: V007 의 trigram multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존 (future 확장성) — doc-comment 1 줄 추가. Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint field 보존, struct 에 None 으로 항상 셋팅). Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:13:45 +00:00
altair823	b63af20b72	feat(app): first-boot eager backfill for tokenized_korean_text V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가 NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic 으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index. 사용자 재-ingest 불필요. - crates/kebab-store-sqlite/src/store.rs: backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다 commit + progress 콜백. idempotent (IS NULL 필터로 partial completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지. - crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후 backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid mode 계속 가능). 500 row 마다 info log progress. - crates/kebab-store-sqlite/tests/fts.rs: backfill_tokenized_korean_text_populates_nullable_rows 단위 test (idempotency 포함). - clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or, cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs, pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs). Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:01:00 +00:00
altair823	e8f44a57e3	fix(test): update first_ingest_bumps_corpus_revision baseline for V009 V004 seeds corpus_revision=0, V009 migration bumps to 1 (spec §5.2 — LRU cache invalidation). Test previously asserted fresh store = 0; now reads post-migration baseline dynamically and verifies that the ingest commit increments past it. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:47:09 +00:00
altair823	597d8b70ad	feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk tokenize_korean_morphological() helper. - Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가. - crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에 embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화). - crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological (spec §6.3 Option A — marker role only, disable path 없음). License: lindera = MIT, lindera-ko-dic = MIT (cargo info 로 확인). cargo deny 도입은 P9 follow-up. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:03:58 +00:00
altair823	9a36a06f97	style: cargo fmt --all (v0.20.x logging r2 feature follow-up) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:34:01 +00:00
altair823	35c987df1c	feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4) LoggingCfg gains two fields with serde defaults: keep_recent_runs (default 100, top-N file retention) and retention_days (default 30, time-based retention for both ndjson files and the SQLite mirror). IngestLogWriter::open now runs cleanup_old_logs before creating a new ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <= cutoff). ingest_with_config_opts also calls SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so the SQLite mirror tracks the same retention window. Backward compat (AC-9): both new fields use #[serde(default = ...)], so a pre-v0.20.x config with only [logging] ingest_log_enabled + ingest_log_dir parses unchanged. kebab init writes the new defaults automatically via Config::default() -> toml::to_string_pretty (AC-12). docs/SMOKE.md config example synced. Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:17:47 +00:00
altair823	d9ec7b8dc3	feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor) Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine, top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or corpus-wide recent failures, with --doc-id + --limit). Both ship via new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`. App gains four facade methods: inspect_ocr_stats / inspect_ocr_failures plus their _with_config companions — required by CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI dispatch arms thread cfg explicitly into the _with_config form. Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two entries; the meta JSON Schema (schema.schema.json) is untouched because its wire.schemas is pattern-based, not enum-based. ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile. SKILL.md synced with the two new schemas (AC-13). Closure r2 G2 (facade _with_config pair) + G3 (runtime emit, not meta schema file) + closure r1 F4 (p99) resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:13:08 +00:00
altair823	4e451c9f7c	feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring) Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR emit_progress closure so both the ndjson file and the SQLite mirror carry the same doc_id for every event. File write is durable (errors propagate); SQLite insert is non-critical (tracing::warn on failure, ingest does not abort) per spec R-1. LogEvent::Ocr gains a doc_id: Option<&str> field as an additive Serde change — round 1 ndjson logs deserialize with doc_id=None. Closure r1 F1: doc_id NULL in dual-write resolved via let doc_id_for_log = canonical.doc_id.0.clone() pre-capture. Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a second SqliteStore — eliminates double-open lock contention and duplicate migration runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:06:03 +00:00
altair823	5977c8cdf1	feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1) Add extract_image_dimensions(bytes) helper using image::ImageReader and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs where page_image_bytes is in scope (OCR error path + success path). The no-DCTDecode skip path leaves None as page_image_bytes is absent. Result: LogEvent::Ocr carries non-null image_width/image_height on successful raster decode, enabling future size-conditioned timeout tuning. Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as direct [dependencies] entry (not relying on feature unification via kebab-parse-image). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 05:54:55 +00:00
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	445b096215	fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke * kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string (clippy::unnecessary_hashes). * kebab-app/tests/ingest_log_smoke.rs: \|e\| e.ok() → Result::ok, \|s\| s.as_u64() → Value::as_u64 (clippy::redundant_closure). * cargo fmt --all applied to pre-existing formatting drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:26:42 +00:00
altair823	415227bf76	test(app): ingest_log_smoke integration test (AC-9) crates/kebab-app/tests/ingest_log_smoke.rs 신규: * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF → ingest → assert log file exists + 각 line valid JSON + 각 kind ∈ {ocr,parse_error,skip,error,summary} + last line kind=summary + scanned>0. * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일 때 log_dir 안 ingest-*.ndjson 파일 0개 verify. fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf 재사용 (OCR disabled — Ollama 없이 smoke 실행). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:06:43 +00:00
altair823	f9dc0f749f	feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync) v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook: Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush) Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr) Hook 3: ingest_one_*_asset Err arm (LogEvent::Error) Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip) Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에 events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app 재호출 안 함 — cycle 회피). Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이 clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex> 로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded per-asset loop 라 deadlock/contention 위험 없음. Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:05:07 +00:00
altair823	bef0c98867	feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields v0.20.x ingest log feature 의 wire side. additive minor cascade: * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field: - image_byte_size: Option<u64> - image_width: Option<u32> - image_height: Option<u32> - failure_reason: Option<String> * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property (모두 optional, required 변경 없음 = additive minor) * integrations/claude-code/kebab/SKILL.md — wire schema description 동기 기존 ingest_progress.v1 consumer (CLI wire dump, integration test fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의 Option::None 으로 backward-compat. version bump 0 (additive minor = binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:57:59 +00:00
altair823	f8a4c79727	feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log) v0.20.x ingest log surface 의 module side. crates/kebab-app/src/ ingest_log.rs 신규: * IngestLogWriter — open/write_event/write_summary/flush + Drop flush * LogEvent enum 4 variant (ocr / parse_error / skip / error) * IngestSummary struct (kind="summary" literal + 11 stat field) * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex) * expand_log_dir ({state_dir} placeholder 의 hand-roll expand) * now_ts (Rfc3339 UTC helper) * percentiles helper (sorted Vec p50/p90/max) uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5). 본 step 은 self-contained writer + 5 unit test. ingest pipeline 의 emit hook 5개 wiring 은 step 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:53:09 +00:00
altair823	9b44e27dfe	test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up) schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나 Bug #9 fix (`760eee8`) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:58:10 +00:00
altair823	d9c7aabce1	feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13 ) 이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 → multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest) 의 version cascade audit 누락 risk. 이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String> 추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는 optional (#[serde(default)] + JSON schema required 미포함). source: - kebab_store_sqlite::fetch_distinct_parser_versions() 가 documents.parser_version DISTINCT + ORDER BY 반환. - fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern. - collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결). wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분. integrations/claude-code/kebab/SKILL.md 동기 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:15:58 +00:00

1 2 3 4

184 Commits