kebab

Author	SHA1	Message	Date
altair823	d13eb87401	docs(v0.20.x): sync README + HANDOFF + ARCH + SKILL + HOTFIXES for V009 V009 한국어 morphological tokenizer 의 사용자 visible surface 변경 + release notes scope 를 5 docs 에 cascade. - README.md: kebab search 명령 row 에 한국어 2자 query 지원 명시. - integrations/claude-code/kebab/SKILL.md: V007 3-char hint 제거 + V009 2자 한국어 query 지원 1줄. - HANDOFF.md: C task status 완료 flip + v0.20.1 release notes scope 에 본 변경 추가 + 머지 후 발견 summary 행. - docs/ARCHITECTURE.md: embedding upgrade (e5-small → e5-large), lindera-ko-dic FTS5 한국어 지원, version notes 추가. - tasks/HOTFIXES.md: 2026-05-28 entry — Bug #8 V009 해소, lindera-ko-dic 실제 crate name (spec deviation), cargo-deny deferred, Path A 영어 substring 회귀 명시. Spec: tasks/p9/p9-9-v0.20.x-korean-morphological-tokenizer-spec.md §7.4 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-05-28 11:55:25 +00:00
altair823	26f3a7756c	test(eval): regenerate runner_per_query_snapshot for V009 baseline V009 FTS5 tokenizer (trigram → unicode61 + 형태소) 로 인한 BM25 distribution + hit ordering 변경의 의도된 cascade. eval runner 의 per-query snapshot (run-1.json) 을 V009 baseline 으로 regenerate. Regenerate 절차: UPDATE_SNAPSHOTS=1 cargo test -p kebab-eval --test runner runner_per_query_snapshot_matches_fixture -j 4. 회귀 0 — kebab-eval workspace 의 모든 test (runner / loader / metrics_and_compare) pass. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S8)	2026-05-28 11:51:49 +00:00
altair823	881f949fcb	test(search): regenerate lexical_snapshot_run_1 for V009 BM25 distribution V009 FTS5 tokenizer 가 trigram → unicode61 + tokenized_korean_text column 로 갱신되면서 BM25 IDF 계산이 변화 → fusion_score 의 부동 소수점 값이 미세하게 다름 (히트 순서·snippet·구조 동일). S1 머지 직후 update 누락되었던 snapshot 을 V009 baseline 으로 regenerate. 회귀 없음: `cargo test -p kebab-search --test lexical lexical_snapshot_run_1 -j 4` exit 0. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1 follow-up via S7 reviewer)	2026-05-28 11:49:15 +00:00
altair823	c5de5f812b	test(fts,app): V009 morphological tokenizer integration tests 신규 4 test 추가: - crates/kebab-store-sqlite/tests/fts.rs: - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text column 이 채워진 chunk 의 '한국' 2-char query hit. - fts_v009_english_whole_token_only: V007 trigram substring 매칭 회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit. - crates/kebab-app/tests/search_korean.rs: - korean_morphological_2char_query_lexical_mode: end-to-end 한국어 wiki fixture ingest → '한국' / '서울' query hit. - korean_morphological_mixed_english_korean_query: 'Rust' English whole-token + '최적화' Korean morpheme hit. crates/kebab-search/src/lexical.rs: - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2). V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme query 가 통과되어야 함. 1자 단독은 여전히 필터. - 관련 unit test 2개 V009 동작으로 갱신. fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존 (spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:38:52 +00:00
altair823	f94e0c4a9b	feat(app): bump lexical_index_version to V009 (fts5-v009-korean-morphological) V009 의 FTS5 tokenizer 가 trigram → unicode61 + 한국어 형태소 분해 column 로 갱신됨. lexical_index_version 의 format 에 `fts5-v009-korean-morphological` suffix 추가하여 V007 baseline 과 구별. eval runner 의 config_snapshot 및 search cache 무효화에 자동 picks up. 기존 format: lex:{chunker_version} 신규 format: lex:{chunker_version}:fts5-v009-korean-morphological Wire schema shape 변경 없음 (SearchHit.index_version 의 string content 만 변화). lexical_index_version_is_returned_unchanged test 는 IndexVersion 의 임의 string 을 사용해 unchanged. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.1, §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S6)	2026-05-28 11:23:13 +00:00
altair823	923b959610	refactor(app): retire short_query_hint helper, keep wire field as None V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가 hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete. SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 + 항상 None. - kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제. 2 호출 site 가 hint = None 으로 정리. - kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거. - kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4 호출 cascade 제거. - kebab-cli/tests/wire_search_response.rs: search_plain_emits_short_query_hint_to_stderr test 삭제. search_json_emits_hint_field_for_short_query → search_json_hint_absent_for_short_query_v009 으로 교체 (hint 항상 None 검증). - kebab-search/src/lexical.rs::build_match_string: V007 의 trigram multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존 (future 확장성) — doc-comment 1 줄 추가. Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint field 보존, struct 에 None 으로 항상 셋팅). Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:13:45 +00:00
altair823	b63af20b72	feat(app): first-boot eager backfill for tokenized_korean_text V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가 NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic 으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index. 사용자 재-ingest 불필요. - crates/kebab-store-sqlite/src/store.rs: backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다 commit + progress 콜백. idempotent (IS NULL 필터로 partial completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지. - crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후 backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid mode 계속 가능). 500 row 마다 info log progress. - crates/kebab-store-sqlite/tests/fts.rs: backfill_tokenized_korean_text_populates_nullable_rows 단위 test (idempotency 포함). - clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or, cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs, pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs). Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:01:00 +00:00
altair823	e8f44a57e3	fix(test): update first_ingest_bumps_corpus_revision baseline for V009 V004 seeds corpus_revision=0, V009 migration bumps to 1 (spec §5.2 — LRU cache invalidation). Test previously asserted fresh store = 0; now reads post-migration baseline dynamically and verifies that the ingest commit increments past it. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:47:09 +00:00
altair823	4b4a8cbb3a	fix(test): retire V007 trigram-only fts tests, add V009 unicode61 sanity V007 trigram tokenizer 의 substring 매칭을 검증하던 3 test 는 V009 unicode61 으로 의도된 회귀 (spec §3 Non-Goals Path A) 가 발생하므로 obsolete: - fts_trigram_korean_3char_substring_hits: '발생한' → '발생한다' hit 은 trigram 의 substring 매칭이라 V009 의 whole-token 매칭에서 fail. - fts_trigram_korean_short_query_zero_hit_pinned: 2-char Korean query 의 0-hit 동작은 V009 의 형태소 column 으로 해소되므로 이 핀 자체가 obsolete (S7 이 신규 2-char hit test 로 대체). - fts_trigram_english_substring_hits: 'token' → 'tokenizer' hit 은 V009 unicode61 의 whole-token only 에서 fail. 신규 추가: - fts_v009_unicode61_space_separated_korean_token_hits: V009 unicode61 의 whole-token 매칭 sanity (token '충돌은' hit, substring '발생한' 0-hit). S7 이 추가할 morphological 검증 test 와 별개의 baseline. S7 (plan §2 Step 7) 가 v009_korean_morphological_2char_query_hits + v009_english_whole_token_only 를 추가하여 회귀 + 신규 동작 모두 핀할 예정. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §3, §9.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:36:32 +00:00
altair823	4dc1c10be1	fix(test): update corpus_revision baseline to post-V009 migration V009 migration bumps corpus_revision by 1 at apply time (spec §5.2 — invalidates pre-V009 LRU search cache). Existing tests assumed V004 seed (0) was the final baseline; updated to expect 1 after migration: - fresh_store_starts_at_zero → fresh_store_starts_at_post_migration_baseline - bump_increments_monotonically: expected 1,2,3 → 2,3,4 (post-baseline) - revision_persists_across_reopen: 2 → 3 (V009 +1, +bump,+bump) Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)	2026-05-28 10:33:50 +00:00
altair823	bd86f61c9c	fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade S3 spec compliance reviewer (sonnet) 가 2 blocker 발견: 1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가 tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실. SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가. ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade. 2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가 tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드 파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로 tokenize_korean_morphological(text) 호출 cascade. 이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary 유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에 tokenize 호출") 충족. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:30:53 +00:00
altair823	b134ae9dd5	feat(chunk): integrate lindera korean morphological tokenizer V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence 를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성 직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store 의 put_chunks 가 단일 transaction 안에서 INSERT. - crates/kebab-core/src/chunk.rs: Chunk struct 에 tokenized_korean_text: Option<String> field 추가 (#[serde(default)]). - crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological() helper + OnceLock 캐싱 + fallback (None) 정책. - crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"] 추가 (DictionaryKind::KoDic 활성화에 필요). - 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개 code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill. - crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT SQL column list + placeholder + binding 갱신 (12번째 column). - crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개. lindera 3.0.7 API 정정: load_dictionary_from_kind → load_embedded_dictionary, Token.text → Token.surface. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:22:15 +00:00
altair823	597d8b70ad	feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk tokenize_korean_morphological() helper. - Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가. - crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에 embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화). - crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological (spec §6.3 Option A — marker role only, disable path 없음). License: lindera = MIT, lindera-ko-dic = MIT (cargo info 로 확인). cargo deny 도입은 P9 follow-up. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:03:58 +00:00
altair823	b106120e93	feat(fts): add V009 korean morphological tokenizer migration V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를 위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소 분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식. - migrations/V009__fts_korean_morphological.sql 신규: column ADD, chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT, corpus_revision bump. - design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE expression trigger 본문. - crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를 V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit test 추가. - store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정 (pdf_ocr_events 관련 코드, S1 작업 중 발견). 영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3 Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 09:48:46 +00:00
altair823	43366b1b15	docs(handoff): C 한국어 morphological tokenizer (Bug #8 ) 새 session handoff v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위 C (한국어 morphological tokenizer) 의 self-contained context. 새 session 의 첫 step + workflow patterns + 환경/memory references + cascade risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C query-side expansion). spec/plan/executor cycle 동일 패턴. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:18:04 +00:00
altair823	70507e94ca	docs(handoff): logging round 2 closed (PR #190 merged) + v0.20.1 release notes scope 갱신 PR #190 (2026-05-28 commit `7bbdc89a`) merge 후: - "Logging feature future enhancements" section → ✅ closed (4 enhancement 모두 PR #190 에서 ship). - G (v0.20.1 release) section 의 release notes scope 갱신 — V008 migration + pdf_ocr_events + CLI inspect + retention 추가. 다음 작업 priorities 그대로 C → B → A → G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:15:07 +00:00
altair823	7bbdc89ae3	Merge pull request 'feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention' (#190 ) from feat/ingest-log-round2-enhancements into main Reviewed-on: #190	2026-05-28 08:12:25 +00:00
altair823	7c24734cc7	docs(superpowers): v0.20.x logging r2 spec + plan artifacts logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts. - spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability) - plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:04:32 +00:00
altair823	9a36a06f97	style: cargo fmt --all (v0.20.x logging r2 feature follow-up) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:34:01 +00:00
altair823	35c987df1c	feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4) LoggingCfg gains two fields with serde defaults: keep_recent_runs (default 100, top-N file retention) and retention_days (default 30, time-based retention for both ndjson files and the SQLite mirror). IngestLogWriter::open now runs cleanup_old_logs before creating a new ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <= cutoff). ingest_with_config_opts also calls SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so the SQLite mirror tracks the same retention window. Backward compat (AC-9): both new fields use #[serde(default = ...)], so a pre-v0.20.x config with only [logging] ingest_log_enabled + ingest_log_dir parses unchanged. kebab init writes the new defaults automatically via Config::default() -> toml::to_string_pretty (AC-12). docs/SMOKE.md config example synced. Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:17:47 +00:00
altair823	d9ec7b8dc3	feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor) Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine, top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or corpus-wide recent failures, with --doc-id + --limit). Both ship via new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`. App gains four facade methods: inspect_ocr_stats / inspect_ocr_failures plus their _with_config companions — required by CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI dispatch arms thread cfg explicitly into the _with_config form. Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two entries; the meta JSON Schema (schema.schema.json) is untouched because its wire.schemas is pattern-based, not enum-based. ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile. SKILL.md synced with the two new schemas (AC-13). Closure r2 G2 (facade _with_config pair) + G3 (runtime emit, not meta schema file) + closure r1 F4 (p99) resolved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:13:08 +00:00
altair823	4e451c9f7c	feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring) Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR emit_progress closure so both the ndjson file and the SQLite mirror carry the same doc_id for every event. File write is durable (errors propagate); SQLite insert is non-critical (tracing::warn on failure, ingest does not abort) per spec R-1. LogEvent::Ocr gains a doc_id: Option<&str> field as an additive Serde change — round 1 ndjson logs deserialize with doc_id=None. Closure r1 F1: doc_id NULL in dual-write resolved via let doc_id_for_log = canonical.doc_id.0.clone() pre-capture. Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a second SqliteStore — eliminates double-open lock contention and duplicate migration runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 06:06:03 +00:00
altair823	6482bf1321	feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2) Add migrations/V008__pdf_ocr_events.sql with the events table + 3 indices (doc_id, run_id, ts). SqliteStore gains two pub fn: record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events (delete rows older than retention_days; returns the affected row count). Both follow the existing Mutex<Connection> lock pattern. Wiring into ingest path lands in the next commit. Closure r1 F2: explicit lock acquisition in both methods. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 05:56:54 +00:00
altair823	5977c8cdf1	feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1) Add extract_image_dimensions(bytes) helper using image::ImageReader and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs where page_image_bytes is in scope (OCR error path + success path). The no-DCTDecode skip path leaves None as page_image_bytes is absent. Result: LogEvent::Ocr carries non-null image_width/image_height on successful raster decode, enabling future size-conditioned timeout tuning. Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as direct [dependencies] entry (not relying on feature unification via kebab-parse-image). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 05:54:55 +00:00
altair823	89d334a92b	docs(handoff): v0.20.0 sub-item 1 머지 후 next priorities (C/B/A/G order) PR #189 (2026-05-28 commit `09333d0`) 머지 후 다음 작업 priorities 기록: - C (다음 우선): 한국어 morphological tokenizer (Bug #8 V007 trigram 2-char 한계 follow-up) - B: OCR dense page coverage (metro-korea.pdf page 8/13 timeout — max_pixels 동적 / column-level OCR / 점진적 timeout 축소) - A: v0.20 의 deferred sub-items (sub-item 2 multi-region image / sub-item 3 PDF normalize integration / TODO #4 figure-table / TODO #5 Enricher trait) - G: v0.20.1 patch release + release notes (180s timeout + [logging] section + wire schema additive + CLI fix) Non-blocking known: Bug #12 falsified, ask phrasing-sensitive refusal. Logging feature 후속 enhancements (low priority): image_width/height capture, SQLite mirror, CLI inspect query, log retention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:48:44 +00:00
altair823	09333d0b05	Merge pull request 'feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1)' (#189 ) from feat/pdf-scanned-ocr into main Reviewed-on: #189	2026-05-28 04:37:41 +00:00
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	445b096215	fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke * kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string (clippy::unnecessary_hashes). * kebab-app/tests/ingest_log_smoke.rs: \|e\| e.ok() → Result::ok, \|s\| s.as_u64() → Value::as_u64 (clippy::redundant_closure). * cargo fmt --all applied to pre-existing formatting drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:26:42 +00:00
altair823	415227bf76	test(app): ingest_log_smoke integration test (AC-9) crates/kebab-app/tests/ingest_log_smoke.rs 신규: * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF → ingest → assert log file exists + 각 line valid JSON + 각 kind ∈ {ocr,parse_error,skip,error,summary} + last line kind=summary + scanned>0. * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일 때 log_dir 안 ingest-*.ndjson 파일 0개 verify. fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf 재사용 (OCR disabled — Ollama 없이 smoke 실행). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:06:43 +00:00
altair823	f9dc0f749f	feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync) v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook: Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush) Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr) Hook 3: ingest_one_*_asset Err arm (LogEvent::Error) Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip) Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에 events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app 재호출 안 함 — cycle 회피). Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이 clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex> 로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded per-asset loop 라 deadlock/contention 위험 없음. Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:05:07 +00:00
altair823	bef0c98867	feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields v0.20.x ingest log feature 의 wire side. additive minor cascade: * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field: - image_byte_size: Option<u64> - image_width: Option<u32> - image_height: Option<u32> - failure_reason: Option<String> * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property (모두 optional, required 변경 없음 = additive minor) * integrations/claude-code/kebab/SKILL.md — wire schema description 동기 기존 ingest_progress.v1 consumer (CLI wire dump, integration test fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의 Option::None 으로 backward-compat. version bump 0 (additive minor = binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:57:59 +00:00
altair823	f8a4c79727	feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log) v0.20.x ingest log surface 의 module side. crates/kebab-app/src/ ingest_log.rs 신규: * IngestLogWriter — open/write_event/write_summary/flush + Drop flush * LogEvent enum 4 variant (ocr / parse_error / skip / error) * IngestSummary struct (kind="summary" literal + 11 stat field) * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex) * expand_log_dir ({state_dir} placeholder 의 hand-roll expand) * now_ts (Rfc3339 UTC helper) * percentiles helper (sorted Vec p50/p90/max) uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5). 본 step 은 self-contained writer + 5 unit test. ingest pipeline 의 emit hook 5개 wiring 은 step 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:53:09 +00:00
altair823	f60304beb4	feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir) v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설: * ingest_log_enabled (bool, default true) * ingest_log_dir (PathBuf, default "{state_dir}/logs") #[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재 시 LoggingCfg::default() 자동 init (AC-10 backward compat). {state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter) 의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base 는 {state_dir} 미지원, spec §6 R-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:44:21 +00:00
altair823	6a9551e0fa	fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up) Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page (metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page 더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선 coverage 쪽으로 너무 깎임. Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터 dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default 로 직접 jump 안 함. - crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180 - unit test rename (_is_60s → _is_180s) + assertion 180 - crates/kebab-config/tests/pdf_ocr.rs assert_eq 180 - tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가 User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N 로 user 가 직접 tune. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 01:40:23 +00:00
altair823	46e99470eb	docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md 3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 01:21:34 +00:00
altair823	9b44e27dfe	test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up) schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나 Bug #9 fix (`760eee8`) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:58:10 +00:00
altair823	854a180365	fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13 ) Step 4 의 Models struct 확장 (active_parsers / active_chunkers 추가) 이 crates/kebab-cli/src/wire.rs 의 테스트 fixture 초기화를 누락 → E0063 컴파일 에러. #[serde(default)] 는 serde 역직렬화 전용 — struct literal 초기화 에는 모든 field 필요. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:18:27 +00:00
altair823	5bba95fd71	docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`) 의 frozen-spec deviation handoff. - tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause / Fix / Amends 5-field 포맷 (기존 entries 와 일치). - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link. prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible. HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:16:18 +00:00
altair823	2c7fa7142a	fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14 ) 이전: `kebab search "" --json` / `kebab search " " --json` / `kebab ask "" --json` 모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user mistake (typo, shell expansion 실수) 가 silent → debugging 비용. 이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError (ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존 exit_code() 의 generic non-zero path). --bulk mode 는 영향 0 (bulk arm 이 query 무시). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:16:08 +00:00
altair823	d9c7aabce1	feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13 ) 이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 → multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest) 의 version cascade audit 누락 risk. 이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String> 추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는 optional (#[serde(default)] + JSON schema required 미포함). source: - kebab_store_sqlite::fetch_distinct_parser_versions() 가 documents.parser_version DISTINCT + ORDER BY 반환. - fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern. - collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결). wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분. integrations/claude-code/kebab/SKILL.md 동기 갱신. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:15:58 +00:00
altair823	10b0e2f4f2	fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11 ) metro-korea.pdf v0.20 final-dogfood (2026-05-27): - page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout (`ms: 600000, chars: 0, skipped: true`) - 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비 cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다 훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는 config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수 있음 — Step 6 에서 HOTFIXES + parent spec cross-link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:15:21 +00:00
altair823	28f513795e	fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10 ) 이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 + `{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만 surface — debugging nightmare. 이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의 `Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path }) return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found, hint, details.path 채워서 stderr 에 ndjson 으로 emit. R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이 absolute + relative 모두 cover. integration test 두 개로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:14:54 +00:00
altair823	760eee89c8	fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9 ) capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로 보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade: - kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit - kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상 MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시 false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인. http_daemon 은 false 유지 (별도 sub-item 의 non-impl). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:13:57 +00:00
altair823	f763049923	test(cli): assert 'code' in search --help output (Bug #7 regression pin) Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬 하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의 --help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가 유일한 검출 수단. Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습). CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring assert. spec §4.4 의 acceptance row 1:1 mapping. Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4 / §5 (acceptance row 4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:47:55 +00:00
altair823	8cf73d1f43	docs(cli): list 'code' in --media help string + SKILL.md (Bug #7 ) Why: kebab search --media code 가 v0.18.0 부터 functional support 됨 (MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code 존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의 value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라 오해 가능. Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap doc-comment + integrations/claude-code/kebab/SKILL.md line 57. Rust binary surface / wire schema 변경 0. Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44, crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/ ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35 도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary 밖이므로 본 round 미포함. Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.3 / §4.3a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:47:16 +00:00
altair823	a58ee10dfb	fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6 ) Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의 ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit `?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char() 의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5 threshold bypass. Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계 (strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker list extensible. is_valid_text_char() 본체 변경 0. Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version / wire schema 변경 0. Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.1 / §4.2 / §6 R-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:42:59 +00:00
altair823	e674ff474b	fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4 ) v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf `get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte- level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만 Pages dict 의 `/Kids` reference 깨짐. - `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+ del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이 ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip (removed=1 invariant). Exit codes 2/3/4 for invariant fail. - `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible. - `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`: regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline bootstrap, no insta crate). - `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3 invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent, (3) PdfTextExtractor 1-block invariant. - `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는 CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족. spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4) prior: `241ded5` (Step 3 integration test) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:02:17 +00:00
altair823	241ded59df	test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression) v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression for Bug #3 (intra-doc chunk_id collision under aggressive overlap). - `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append. - `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift + `single` / `per_page` ctor (backward-compat single + per-page cursor). - `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`). - `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs` (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " + 500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape). - C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5. spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3) prior: `436fd01` (Step 2 Bug #3 chunk_id fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 13:45:38 +00:00
altair823	436fd015a2	fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3 ) v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf (1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation — `per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 + overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`. - `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash` suffix uses `segment_start` (pre-overlap boundary, strictly increasing) instead of `char_start` (post-overlap, may collapse to prev_min). - VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade, explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs: 168, 368` 의 hardcoded literal 도 v1.1 로 갱신. - module docs (`pdf_page_v1.rs:47-60`): workaround description 의 `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start invariant 명문 + HOTFIXES.md cross-ref. - `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk + overlap walk collapse trigger). - `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR, intra-doc collision root cause, second-iteration patch rationale). spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2) prior: `d9acda5` (Step 1 Bug #2 walker fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:32:09 +00:00
altair823	d9acda517a	fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2 ) v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes` 가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+) 이 walker pre-extract skip. fix: - `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path) -> bool` helper 추가 (= `code_lang_for_path(path).is_some()`). - `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가 `is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/ markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem, image OCR max_pixels) 가 cover. - `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가: `size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과 검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0. spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1) prior: `b4d9e60` (PR #189) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:20:38 +00:00

1 2 3 4 5 ...

959 Commits