kebab

Author	SHA1	Message	Date
altair823	fe20be8195	feat(chunk): N-gram supplement (Option β) — sub-token emit for Korean compounds #4 (사용자 요청): spec §6.2 의 Option β (sub-token 추가 emit) 를 v0.21.x P9 follow-up 에서 v0.20.1 implementation 으로 promote. dogfood 의 ko-dic compound noun limitation (`대한민국`, `한국정부`, `주민등록번호` 등 단일 token 정책) 해소. Implementation (`crates/kebab-chunk/src/lib.rs::tokenize_korean_morphological`): - 신규 helper `is_hangul()` — 한글 음절 (U+AC00..D7A3) + 자모 (U+1100..11FF, U+3130..318F) 판정. - lindera output 의 각 morpheme 에 대해, 한글만 + 길이 ≥ 3 인 경우 sliding window 2-gram 추가 emit. `[한국정부, 한국, 국정, 정부]` 형태로 token list expand. - 영어 / 숫자 / 혼합 token 은 supplement X (false positive 회피). Tests (`crates/kebab-chunk/tests/tokenize_korean.rs`): - `tokenize_korean_morphological_emits_2gram_for_long_morpheme`: 5 probe fixture 중 supplement 발화 case 확인 (실측 `서울특별시` → `[서울, 특별시, 특별, 별시]`, `대한민국` → `[대한민국, 대한, 한민, 민국]`). - `tokenize_korean_morphological_no_2gram_for_english`: Rust optimization fixture 에서 영어 substring (`Rus`, `ust`, `imi`) emit 없음 보장. Dogfood evidence (`tasks/HOTFIXES.md` 2026-05-28 entry 보강): - '대한', '한민', '민국' query 모두 hit (대한민국 의 sliding window). - '특별', '주민', '등록' 같은 sub-token query hit. - 영어 'tokenizer' query 는 corpus 부재로 0 hit (supplement X). - Trade-off: DB size +20-30% (Korean-heavy), false positive 작은 risk. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 (Option β promote) Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (post-implementation enhancement)	2026-05-28 13:48:05 +00:00
altair823	53ec9b4dc5	test(chunk): regenerate AST + long-section snapshots for V009 chunk field S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text: Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker + markdown long-section) 의 baseline 을 V009 형태로 regenerate. 각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text": null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None fallback). 동작 변경 없음 — serde representation 의 cascade만. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)	2026-05-28 12:27:37 +00:00
altair823	21b52bc285	style: cargo fmt --all (S3+S4+S5+S7 follow-up) V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5 short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)	2026-05-28 12:06:01 +00:00
altair823	bd86f61c9c	fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade S3 spec compliance reviewer (sonnet) 가 2 blocker 발견: 1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가 tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실. SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가. ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade. 2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가 tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드 파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로 tokenize_korean_morphological(text) 호출 cascade. 이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary 유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에 tokenize 호출") 충족. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:30:53 +00:00
altair823	b134ae9dd5	feat(chunk): integrate lindera korean morphological tokenizer V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence 를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성 직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store 의 put_chunks 가 단일 transaction 안에서 INSERT. - crates/kebab-core/src/chunk.rs: Chunk struct 에 tokenized_korean_text: Option<String> field 추가 (#[serde(default)]). - crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological() helper + OnceLock 캐싱 + fallback (None) 정책. - crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"] 추가 (DictionaryKind::KoDic 활성화에 필요). - 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개 code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill. - crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT SQL column list + placeholder + binding 갱신 (12번째 column). - crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개. lindera 3.0.7 API 정정: load_dictionary_from_kind → load_embedded_dictionary, Token.text → Token.surface. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:22:15 +00:00
altair823	597d8b70ad	feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk tokenize_korean_morphological() helper. - Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가. - crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에 embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화). - crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological (spec §6.3 Option A — marker role only, disable path 없음). License: lindera = MIT, lindera-ko-dic = MIT (cargo info 로 확인). cargo deny 도입은 P9 follow-up. Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1 Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:03:58 +00:00
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	436fd015a2	fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3 ) v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf (1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation — `per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 + overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`. - `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash` suffix uses `segment_start` (pre-overlap boundary, strictly increasing) instead of `char_start` (post-overlap, may collapse to prev_min). - VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade, explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs: 168, 368` 의 hardcoded literal 도 v1.1 로 갱신. - module docs (`pdf_page_v1.rs:47-60`): workaround description 의 `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start invariant 명문 + HOTFIXES.md cross-ref. - `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk + overlap walk collapse trigger). - `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR, intra-doc collision root cause, second-iteration patch rationale). spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2) prior: `d9acda5` (Step 1 Bug #2 walker fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:32:09 +00:00
altair823	710945c4b0	refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성 design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를 경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC) + kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수. 설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md 플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation) - 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export. - build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module. - 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성. - kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거. - kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift. - test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/). - workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경). - design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger). - design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary). - ARCHITECTURE.md crate graph + directory tree mechanical 갱신. - tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry). - tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom). - HANDOFF.md cross-link 한 줄. - 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11). Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path / parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성. Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks). cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s). cargo metadata --no-deps --format-version 1 \| jq '.workspace_members \| length' = 22. cargo tree -p kebab-app --depth 2 \| grep -E "kebab_(parse_types\|normalize)" = 0 줄.	2026-05-26 15:00:59 +00:00
altair823	7c85de065a	chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게". ## Workspace lints - `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가: - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation. - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style. - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only. - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등). - default_trait_access — `Foo::default()` 가 idiomatic. - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도. - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도. - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존. - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style. - 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가. ## Auto-fix `cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로: - uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`. - redundant_closure_for_method_calls (~33): `.map(\|x\| x.foo())` → `.map(T::foo)`. - 그 외 mechanical refactor. ## 검증 - `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group). - `cargo test --workspace --no-fail-fast -j 1` — 1293 tests pass + 1 pre-existing flaky fail (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0. Wire 영향: 없음. Behavior 영향: 없음 (mechanical refactor only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 03:01:58 +00:00
altair823	1969c8e3b5	fix(dogfood): k8s multi-resource YAML chunk_id collision P10 dogfooding found that a k8s manifest with 2+ documents (e.g. Deployment + Service in one file) fails to ingest: UNIQUE constraint failed: chunks.chunk_id Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per resource; with split_key None every resource from the same document gets the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's code_text_paragraph_v1 had the same bug (fixed in `df3c5b8`) but it calls build_chunk_no_symbol directly — the push_chunks_with_oversize path was never fixed. Fix: push_chunks_with_oversize gains a base_split_key parameter for the non-oversize single-chunk case. k8s chunker passes Some(resource.line_start) so each resource gets a distinct chunk_id; dockerfile / manifest pass None (1 chunk per file — no sibling collision, chunk_id stays stable). Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts chunk_id distinctness; new integration test tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real 2-document YAML end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 23:49:37 +00:00
altair823	840c6c40a6	test(p10-1d-followup): cpp snapshot exercises actual CppAstExtractor Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1 mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp and snapshot the real extractor → chunker pipeline (14 blocks emitted covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn convention, glue <top-level> blocks between units). Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as kebab-parse-md). Both the existing hand-built test AND the new extractor-driven tests are kept — the former for fast chunker-only validation, the latter for end-to-end regression detection. Added tests: - code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present - code_cpp_ast_extractor_chunks_deterministic: chunker output is stable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 22:43:57 +00:00
altair823	b2a2902e38	feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test Identical chunker body to code-c-ast-v1 (per-language work happens in the CppAstExtractor, Task C). Snapshot fixture covers nested namespace + class + ctor/dtor + method + operator overload + template fn + free fn + top-level main, verifying namespace::Class::method symbol convention per design §3.4. 5 chunks emitted: - <top-level> (includes, namespace opening) - kebab::chunk::MdHeadingV1Chunker (class unit) - kebab::identity (template function) - kebab::global_helper (free function in namespace) - main (top-level main function) Template function symbols emit without <T> parameters per spec convention. Namespace::Class::method pattern verified. All tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 13:46:12 +00:00
altair823	03cd41c48f	feat(p10-1d): code-c-ast-v1 chunker + snapshot test Mirrors code-go-ast-v1's chunker pattern. Snapshot test against tests/fixtures/sample.c (function + typedef struct + typedef enum + preprocessor) verifies symbol list + lang=c stamping. Chunks produced (4 total): - <top-level> glue: includes, defines, static vars, typedefs (lines 1-18) - parse_record function (lines 20-23) - print_record function (lines 25-27) - main function (lines 29-33) All chunks stamped with lang=c and chunker_version=code-c-ast-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 13:41:19 +00:00
altair823	df3c5b8caf	test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback) Two new tests verify end-to-end Tier 3 wiring: - tier3_shell_ingest_searchable: .sh file → --code-lang shell search → Citation::Code { symbol: None, lang: "shell" }, chunker_version "code-text-paragraph-v1". - tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and chunker_version "code-text-paragraph-v1". Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs (≤80 lines) were emitted with split_key=None, causing all paragraphs from the same document to share the same chunk_id (UNIQUE constraint violation at put_chunks). Fix: always use para.line_start as split_key so every paragraph gets a distinct id regardless of size. Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:37:44 +00:00
altair823	0b7d8af759	feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback Blank-line paragraph segmentation (whitespace-only lines as boundaries, blank lines themselves never in any chunk's range). Paragraphs > 80 lines split into 80-line windows with 20-line overlap (stride 60), sharing the input lang and symbol=None per spec §9.3. tier2_shared exposes a new build_chunk_no_symbol helper so Chunk id/hash/token semantics stay identical with Tier 1/2. Extracts build_chunk_from_span as private core so build_chunk and build_chunk_no_symbol share mechanics without drift. 4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line boundaries verified), 200-line oversize line-window split (chunks 1-80 / 61-140 / 121-200), empty file, and lang preservation when input is yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:22:48 +00:00
altair823	9342b9543f	refactor(p10-3): expose tier2_shared::build_chunk as pub(crate) Tier 3 chunker (next task) needs to call the same Chunk-construction helper to keep id / hash / token-count / policy_hash semantics identical with Tier 2. Visibility-only change; signature and body unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:17:51 +00:00
altair823	75c1c7b911	test(p10-2-followup): cover tier2_shared oversize fallback with >200-line k8s ConfigMap Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised the line-window split branch of tier2_shared::push_chunks_with_oversize. This adds a 256-line ConfigMap fixture (generated inline) and asserts: - ≥2 chunks emitted (split happened), - all chunks share symbol `ConfigMap/prod/big`, - chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation), - line ranges form a contiguous partition (prev.line_end + 1 == next.line_start). Reviewer nit #1 (PR #153 code-reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:41:16 +00:00
altair823	22d4161728	feat(p10-2): manifest-file-v1 chunker (whole-file 1 chunk, symbol <manifest>) Emits 1 Chunk per manifest file (Cargo.toml / pyproject.toml / package.json / tsconfig.json / pom.xml / build.gradle / go.mod). Symbol unified to "<manifest>"; manifest type distinguished by code_lang (toml / json / xml / groovy / go-mod) read from Block::Code.lang. Oversize >200 lines splits via tier2_shared::push_chunks_with_oversize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 13:11:46 +00:00
altair823	51004ac593	feat(p10-2): dockerfile-file-v1 chunker (whole-file 1 chunk, symbol <dockerfile>) Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range 1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via tier2_shared::push_chunks_with_oversize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 13:09:13 +00:00
altair823	8996e73282	feat(p10-2): k8s-manifest-resource-v1 chunker + tier2_shared helper Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string fields per document, emits 1 chunk per recognized k8s resource. Symbol = <kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines splits into line-windows sharing the same symbol. tier2_shared module hosts the oversize fallback + Chunk-construction helper mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F (manifest) will reuse it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 13:06:47 +00:00
altair823	077f92f41e	build(p10-2): add serde_yaml dep to kebab-chunk for k8s-manifest-resource-v1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 12:57:06 +00:00
altair823	813bdd1a16	test(p10-1c-jk): code-java-ast-v1 + code-kotlin-ast-v1 chunker snapshots Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no kebab-parse-code dep — boundary §6.3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:57:37 +00:00
altair823	30e03c7a12	feat(p10-1c-jk): code-kotlin-ast-v1 chunker (1:1 + oversize split) Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross- chunker policy_hash identity asserted vs md-heading-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:52:24 +00:00
altair823	7bda1509b7	feat(p10-1c-jk): code-java-ast-v1 chunker (1:1 + oversize split) Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:41:27 +00:00
altair823	ab288135e9	test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate Mirrors code_python_ast_snapshot / code_ts_ast_snapshot patterns. In-memory CanonicalDocument (no kebab-parse-code dep — boundary §6.3 respected). verify: - cargo test -p kebab-chunk --test code_go_ast_snapshot → 2/2 - cargo test --workspace --no-fail-fast -j 1 → 0 failures (all green) - cargo clippy --workspace --all-targets -- -D warnings → clean - SMOKE: chunk.ParseDoc symbol + code_lang_breakdown {"go": 1} 확인 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:54:17 +00:00
altair823	f1a4f67e12	feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split) Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:11:14 +00:00
altair823	d6bb6cfd3b	test(p10-1b): per-language chunker snapshots (python/ts/js) Mirrors code_rust_ast_snapshot pattern. In-memory CanonicalDocument build so no kebab-parse-code dep (boundary §6.3 respected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:39:17 +00:00
altair823	d53995a6d4	feat(p10-1b): code-js-ast-v1 chunker + activate JavaScript in app dispatch Chunker: duplicate-with-substitution from code-ts-ast-v1 / code-rust-ast-v1. Dispatch: replaces JS bail! arms with JavascriptAstExtractor + CodeJsAstV1Chunker. Integration test javascript_file_ingests_and_searches_as_code_citation asserts citation.lang=javascript, symbol=src/Bar.Bar.baz, code_lang=javascript. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:16:07 +00:00
altair823	20feb3133e	feat(p10-1b): code-ts-ast-v1 chunker (1:1 + oversize split) Duplicate of code-rust-ast-v1 / code-python-ast-v1 with language-agnostic body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:56:41 +00:00
altair823	6a0b340941	feat(p10-1b): code-python-ast-v1 chunker (1:1 + oversize split) Duplicate of code-rust-ast-v1 with language-agnostic body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:46:17 +00:00
altair823	97e9f558f4	test(p10-1a-2): code-rust-ast-v1 chunker snapshot + full-suite gate Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:14:57 +00:00
altair823	808b92a6c5	feat(p10-1a-2): code-rust-ast-v1 chunker (1:1 + oversize split) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:40:11 +00:00
th-kim0823	bf4ebf8d2a	feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang Four optional, serde-skipped-when-None fields added to `Metadata` for code ingest context. All 11 downstream construction sites patched with `repo: None, git_branch: None, git_commit: None, code_lang: None`. Full workspace check (`--tests`) and per-crate test suite pass clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:44:18 +09:00
altair823	f867b36afb	feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:50:25 +00:00
altair823	8181fd91e7	review(p7-2): 회차 1 지적 반영 - chunk 진입부에 overlap clamp 추가 (`target_bytes / 2` 상한). 병적 정책 (`overlap_tokens >= target_tokens`) 에서 chunk 가 직전 chunk 의 텍스트를 완전히 재발행하던 위험 차단. md-heading-v1 의 `seed_budget = overlap_tokens .min(target/2)` 가드 패턴과 일치. 회귀 테스트 `overlap_clamped_when _overlap_exceeds_target` 추가 — `actual_start` 가 인접 chunk 사이에 엄격 증가하는지 검증. - `char_start as u32` / `char_end as u32` silent truncation → `try_from ::expect` 로 corrupted input 시 명시 panic. - 모듈 doc 의 `## Splitting policy` 에 약어 케이스 (`Mr.` / `i.e.` 등) + overlap clamp 두 항목 명시. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:55:07 +00:00
altair823	7ee0ac9894	feat(kebab-chunk): P7-2 pdf-page-v1 chunker — page-aware splitting `PdfPageV1Chunker` 가 `kebab-parse-pdf` 가 emit 한 `CanonicalDocument` (블록당 한 페이지, 모두 `SourceSpan::Page`) 를 받아 페이지 경계를 절대 넘지 않는 `Chunk` 들을 생성. `chunker_version = "pdf-page-v1"`. 핵심 동작: - 페이지 텍스트가 `target_tokens × BYTES_PER_TOKEN` (= 3) 안이면 한 덩어리. 초과 시 `\n\n` (paragraph) 또는 sentence-end 구두점 + whitespace 경계를 segment 로 보고 greedy 누적, 기본 한 chunk 당 최소 한 segment. - 다음 chunk 의 prefix 에 `overlap_tokens × BYTES_PER_TOKEN` 만큼의 직전 꼬리를 prepend (char 단위, 이전 chunk 시작 너머로 backtrack 안 함). - 빈/공백-only 페이지는 0 chunk (페이지의 `Provenance::Warning` 으로 `kebab-parse-pdf` 단계에 이미 표시됨). - 비-PDF doc (Block::Paragraph 가 아니거나 SourceSpan 이 Page 아님) → 명시 에러. Spec deviation (HOTFIXES 2026-05-02 P7-2): - `chunk_id` 충돌 가드: 같은 페이지에서 여러 chunk 가 나오면 `block_ids` 가 모두 같아 §4.2 recipe 가 충돌. `id_for_chunk` 의 `policy_hash` 인풋을 per-chunk 로 `format!("{base}#c{char_start}")` 변형해 회피. recipe 자체는 불변. `Chunk.policy_hash` 필드는 base 유지. - `BYTES_PER_TOKEN = 3` (md-heading-v1 실제 코드와 일치). spec 본문은 "/ 4" 라고 했지만 그 자체가 md-heading-v1 의 실코드와 어긋나 있어 일관성 쪽을 택함. cross-chunker `policy_hash` 동일성 unit test 로 잠금. 테스트 (10개 신규): - chunker_version label, 3-page small, 1-page huge + overlap + chunk_id 유일성, empty page skip, whitespace-only skip, non-PDF error, cross-page boundary 절대 안 만들어짐, determinism (1000회), snapshot shape 안정, md-heading-v1 와 policy_hash 동일. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:51:44 +00:00
altair823	ca0567c72b	feat(kebab-app): P6-4 image ingest wiring — kebab ingest 가 PNG/JPEG 자산도 처리 P6-1/P6-2/P6-3 의 라이브러리 (`ImageExtractor`, `OllamaVisionOcr`, `apply_caption`) 가 그동안 CLI 에서 보이지 않던 미완 구간을 완성. 이제 `kebab ingest` 가 markdown 외에 이미지 자산을 end-to-end 로 색인하고, `kebab search` / `kebab ask` 가 OCR 텍스트 + caption 으로 이미지를 매칭/인용한다. ## kebab-app - `[dependencies]` 에 `kebab-parse-image` 추가. - `ingest_with_config` 진입 시 `image.ocr.enabled` / `image.caption.enabled` 플래그에 따라 `OllamaVisionOcr` / `OllamaLanguageModel` 을 ingest 세션당 1회 빌드. 자산 루프에서 trait object 로 공유. reqwest::blocking::Client 의 내부 Arc 덕분에 알로케이션 비용은 자산 수와 무관. - 두 어댑터 + ImageExtractor 를 한 묶음으로 `ImagePipeline` 구조체에 담아 `ingest_one_asset` 매개변수 폭증 차단 (clippy::too_many_arguments 대응). - `ingest_one_asset` 의 markdown-only 가드를 `match media_type` 으로 교체 — Markdown 은 기존 경로, Image(_) 는 새 `ingest_one_image_asset` 로 분기, PDF/Audio/Other 는 종전대로 skipped. - 신규 `ingest_one_image_asset`: - bytes 읽기 → `ImageExtractor::extract` (실패 시 caller 가 errors+=1) - `apply_ocr` (Lenient — 실패 시 ProvenanceKind::Warning 이벤트 + `IngestItem.warnings` 에 \"ocr_failed: ...\", `block.ocr` 는 None 유지) - `apply_caption` (동일 Lenient 정책) - 기존 `MdHeadingV1Chunker` 호출 — 청커는 이미 `Block::ImageRef` 를 단일 청크로 emit - 기존 persist + embed 시퀀스 그대로 (markdown 과 byte-identical) - `lang_hint_from_doc` — `Lang(\"und\")` 또는 빈 문자열을 None 으로 매핑 (image-pipeline 어댑터의 build_prompt 가 \"und\" 를 silent drop 하지 않도록 caller 측에서 미리). ## kebab-chunk - `render_block_text` 의 `Block::ImageRef` 분기를 P6-4 (β) plain concat 정책으로 교체 — `[alt, ocr.joined, caption.text]` 를 `\\n\\n` 로 join, 빈 부분은 drop. alt 가 비면 `src` 의 basename 으로 fallback (P6-1 contract 의 defensive guard). - 신규 unit 테스트 `image_ref_p6_4_plain_concat_drops_empty_parts` — alt-only / alt+ocr / alt+caption / alt+ocr+caption / 빈 alt → src fallback 다섯 케이스 모두 검증. - 기존 `image_ref_emits_own_chunk_zero_tokens` 그대로 통과 — 청커의 per-block dispatch 는 변경 없음, text 렌더링만 갱신. ## 통합 테스트 (kebab-app/tests/image_pipeline.rs) wiremock 으로 Ollama 를 stub. 5건: 1. OCR-only happy path — 1 PNG + ocr.enabled → 1 doc + 1 chunk emit, `block.ocr.joined` 가 mock 의 \"Hello World 2026\". 2. OCR + caption 동시 활성 — 두 필드 모두 채워지고 chunk text 에 alt + ocr + caption 세 부분 모두 포함. 3. Lenient 실패 검증 — OCR 503 시 자산은 indexed (kind=New), `errors=0`, ProvenanceKind::Warning attributed to \"kb-app\", `IngestItem.warnings` 에 \"ocr_failed:\" 노트. 4. 양쪽 비활성 — `image.ocr.enabled=false && image.caption.enabled=false` 여도 자산은 chunk 1개로 indexed (chunk text=filename), EXIF + dimensions 그대로 채워짐. 5. 결정성 (re-ingest) — 동일 PNG 두 번 ingest 시 두 번째는 `Updated` + 동일 `doc_id`. ## SMOKE.md `kebab search --mode lexical \"Hello World\"` 단계를 명령 시퀀스에 추가. `[image.ocr]` / `[image.caption]` config 절 예시 + ingest 시간 추정 (자산당 ~5-10초) 추가. \"책은 P7 PDF 라인으로\" 가이드를 검증 체크리스트 와 \"알려진 동작\" 양쪽에 박음. ## 실 Ollama 통합 검증 192.168.0.47 + gemma4:e4b 기준: ``` $ kebab --config /tmp/kebab-smoke/config.toml ingest scanned 2 new 2 updated 0 skipped 0 errors 0 (18395 ms) $ kebab inspect doc <image_doc_id> parser_version: image-meta-v1 blocks: [{ alt: \"hello.png\", ocr: \"Hello World 2026\", caption: \"The image displays the text \\\"Hello World 2026\\\" in a large, black, sans-serif font.\" }] $ kebab --json ask \"Hello World 텍스트가 어디에 있나?\" --mode hybrid grounded: true citations: [{marker: \"[1]\", doc_path: \"hello.png\"}] ``` ## 검증 - `cargo test --workspace --no-fail-fast -j 1` — 전부 pass - `cargo clippy --workspace --all-targets -- -D warnings` — pass - `cargo test -p kebab-chunk image_ref` — 2 pass (P1-5 회귀 + P6-4 신규 unit) - `cargo test -p kebab-app --test image_pipeline` — 5 pass ## 의존성 경계 - `kebab-app` 이 `kebab-parse-image` 추가 — spec Allowed dep 그대로. - 새 forbidden 침범 없음 (기존 `kebab-tui` / `kebab-desktop` / `kebab-eval` 미참조 유지). - 본 task 가 신설하는 image-specific 비즈니스 로직 0줄 — 모두 `kebab-parse-image` 에 위임. `tasks/p6/p6-4-image-ingest-wiring.md` status: planned → completed. contract: docs/superpowers/specs/2026-04-27-kebab-final-form-design.md sections: §3.4 ImageRefBlock, §6.1 ingest pipeline, §7.2 Extractor/Chunker traits, §9.1 image extraction policy.	2026-05-02 07:37:56 +00:00
altair823	f1a448d6dc	refactor(rename): kb → kebab — binary, env vars, XDG paths, file renames 두 번째 commit. 사용자 facing surface (CLI binary, env vars, XDG paths) + 코드 안 single-letter token (`KB_`, `kb.sqlite`, `/kb/`, tracing target) 일괄 rename. 그리고 3 개 file rename: - 디자인 doc `2026-04-27-kb-final-form-design.md` → `2026-04-27-kebab-final-form-design.md` - 최초 보고서 `kb_local_rust_report.md` → `kebab_local_rust_report.md` - workspace ignore `.kbignore` → `.kebabignore` ## 변경 - `crates/kebab-cli/Cargo.toml`: `[[bin]] name = "kb"` → `"kebab"`. - `crates/kebab-cli/src/main.rs`: `#[command(name = "kb", …)]` → `name = "kebab"`. - 모든 `KB_` env var (코드 + doc + 테스트) → `KEBAB_`. apply_env prefix 매칭 + 30+ 개 setting 키 모두. - XDG paths: `~/.config/kb` / `~/.local/share/kb` / `~/.cache/kb` / `~/.local/state/kb` → `~/.config/kebab` 등. config defaults + expand_path tests + paths.rs 의 hardcode 모두. - SQLite filename: `kb.sqlite` → `kebab.sqlite` (`SQLITE_FILE` const + 테스트 hardcode 모두). - tracing target: `target: "kb-"` → `"kebab-"` (10+ 곳). - snapshot fixture: `.kbignore` → `.kebabignore` (`fixtures/source-fs/ tree-1.snapshot.json` 갱신). ## 검증 - `cargo test --workspace -j 1` clean (linker OOM 회피 위해 직렬). - `cargo clippy --workspace --all-targets -- -D warnings` clean. 다음 commit 에서 docs sweep. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:01:35 +00:00
altair823	911fb49550	refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules 프로젝트 이름 `kb` → `kebab` rename 의 첫 단계. - workspace `Cargo.toml`: members `crates/kb-` → `crates/kebab-`, repository URL `altair823/kb` → `altair823/kebab`. - 18 crate 폴더 rename via `git mv` (history 보존). - 각 crate `Cargo.toml`: `name = "kb-"` → `"kebab-"`, path deps `../kb-` → `../kebab-`. - 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`, `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`, `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`, `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`, `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어 경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염). CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths, tracing target, 그리고 docs sweep 은 다음 commit 에서. ## 검증 - `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:28:08 +00:00

40 Commits