Compare commits
132 Commits
16dc02cfa2
...
v0.17.2
| Author | SHA1 | Date | |
|---|---|---|---|
| 1640ecf288 | |||
| 90e77631a8 | |||
| fa251db48f | |||
| 3114c31841 | |||
| 271329efbd | |||
| f2867540d2 | |||
| e118844256 | |||
| 41c5edc517 | |||
| d02149c010 | |||
| 0c69b9621b | |||
| 0d69d85757 | |||
| a67300317b | |||
| abb05ebc23 | |||
| 26fdc4f344 | |||
| 3f5e0e6e90 | |||
| 578a60e3bb | |||
| 64f518e08e | |||
| fa9f91ead4 | |||
| 9ee89c2a94 | |||
| 13a3361ba2 | |||
| 0def913abd | |||
| ff9d5f5f86 | |||
| 70a5068c0d | |||
| 93ddece111 | |||
| 67559fb3ce | |||
| d79e432916 | |||
| 0ee18149e7 | |||
| 8a68289499 | |||
| 6ac7fea7b9 | |||
| fe123c0c6d | |||
| 753b1ff5e5 | |||
| 8dcedc4b11 | |||
| 8781c6112b | |||
| 14197b5e02 | |||
| 584247f1ea | |||
| a0c0dca321 | |||
| 667495ae6a | |||
| 08d72a12e0 | |||
| 1969c8e3b5 | |||
| c6207d196e | |||
| 840c6c40a6 | |||
| b81574afa9 | |||
| 6beff35a2f | |||
| 75a4207aa1 | |||
| 86aa180ad7 | |||
| 802c573c07 | |||
| 438870ee25 | |||
| 192835e5bf | |||
| 1034de25a2 | |||
| d1560be80d | |||
| b2a2902e38 | |||
| 03cd41c48f | |||
| 926042049c | |||
| e0a29225da | |||
| b541567946 | |||
| a58d400abd | |||
| 8add684ffc | |||
| 7a90df1485 | |||
| 46f408dc0f | |||
| 49e60fb314 | |||
| 6bc7a83d3c | |||
| df3c5b8caf | |||
| 5051ea7534 | |||
| 88d7fbc182 | |||
| 0b7d8af759 | |||
| 9342b9543f | |||
| a8aa03042f | |||
| 9d4a60aac5 | |||
| 8ce7a911ee | |||
| 75c1c7b911 | |||
| b5c12ecb6f | |||
| a1192ce3b2 | |||
| 17ee400fd5 | |||
| 217dddb4ba | |||
| 308666dbd5 | |||
| 522ae7b8bc | |||
| 166e1ddfaf | |||
| 226ce8b744 | |||
| 22d4161728 | |||
| 51004ac593 | |||
| 8996e73282 | |||
| 22dba09857 | |||
| aaa90b1754 | |||
| 077f92f41e | |||
| 5ce7f60932 | |||
| 47857b2622 | |||
| 1e4cff879b | |||
| 2d7a566624 | |||
| 813bdd1a16 | |||
| ff1bedbef5 | |||
| 30e03c7a12 | |||
| 2ce6ae47c5 | |||
| ebc4ef2eea | |||
| 7bda1509b7 | |||
| 61d48d67a3 | |||
| f4c840b994 | |||
| 15244b7494 | |||
| a7f7ab9f93 | |||
| 1b19e33a4f | |||
| 9c9e391b15 | |||
| f95cd55484 | |||
| ab288135e9 | |||
| c19aa006d0 | |||
| f1a4f67e12 | |||
| 6463c52827 | |||
| 2559d0d95a | |||
| 4524830306 | |||
| 8cdd3903c7 | |||
| 8b89961ada | |||
| eec90996aa | |||
| ce1c778b4a | |||
| 453ec15df4 | |||
| 1e6de9fe9f | |||
| 9fa2a1ebac | |||
| 749c6ae240 | |||
| 5f2bd9e97e | |||
| 1ce06c1e2d | |||
| d26efe167f | |||
| d6d165df01 | |||
| 2baa846c6b | |||
| 27baec82ea | |||
| acf8cf3be2 | |||
| ea5f7b22c8 | |||
| 5497c6e7b5 | |||
| 5a90940f1c | |||
| 4389b887f0 | |||
| 360f825f3a | |||
| 641b92af7d | |||
| 08fb743598 | |||
| 0a2a7ae214 | |||
| 803d02b68b | |||
| 4e8b84c4e0 |
104
Cargo.lock
generated
104
Cargo.lock
generated
@@ -4127,7 +4127,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-app"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"base64 0.22.1",
|
||||
@@ -4172,22 +4172,24 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-chunk"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
"kebab-core",
|
||||
"kebab-normalize",
|
||||
"kebab-parse-code",
|
||||
"kebab-parse-md",
|
||||
"serde_json",
|
||||
"serde_json_canonicalizer",
|
||||
"serde_yaml",
|
||||
"time",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "kebab-cli"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"clap",
|
||||
@@ -4208,7 +4210,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-config"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"dirs 5.0.1",
|
||||
@@ -4223,7 +4225,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-core"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4237,7 +4239,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-embed"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4251,7 +4253,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-embed-local"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"fastembed",
|
||||
@@ -4264,7 +4266,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-eval"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-app",
|
||||
@@ -4283,7 +4285,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-llm"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-core",
|
||||
@@ -4292,7 +4294,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-llm-local"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-config",
|
||||
@@ -4309,7 +4311,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-mcp"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-app",
|
||||
@@ -4327,7 +4329,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-normalize"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-core",
|
||||
@@ -4342,7 +4344,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-code"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"gix",
|
||||
@@ -4352,7 +4354,12 @@ dependencies = [
|
||||
"time",
|
||||
"tracing",
|
||||
"tree-sitter",
|
||||
"tree-sitter-c",
|
||||
"tree-sitter-cpp",
|
||||
"tree-sitter-go",
|
||||
"tree-sitter-java",
|
||||
"tree-sitter-javascript",
|
||||
"tree-sitter-kotlin-ng",
|
||||
"tree-sitter-python",
|
||||
"tree-sitter-rust",
|
||||
"tree-sitter-typescript",
|
||||
@@ -4360,7 +4367,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-image"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"ab_glyph",
|
||||
"anyhow",
|
||||
@@ -4384,7 +4391,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-md"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"kebab-core",
|
||||
@@ -4401,7 +4408,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-pdf"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4414,7 +4421,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-parse-types"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"kebab-core",
|
||||
"serde",
|
||||
@@ -4422,7 +4429,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-rag"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4443,7 +4450,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-search"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"globset",
|
||||
@@ -4462,10 +4469,11 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-source-fs"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
"globset",
|
||||
"ignore",
|
||||
"kebab-config",
|
||||
"kebab-core",
|
||||
@@ -4480,7 +4488,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-store-sqlite"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"blake3",
|
||||
@@ -4501,7 +4509,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-store-vector"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"arrow",
|
||||
@@ -4525,7 +4533,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "kebab-tui"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"crossterm",
|
||||
@@ -8526,6 +8534,46 @@ dependencies = [
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-c"
|
||||
version = "0.24.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "a9b2eb57a55fed6b00812912e730b7a275cf4fe98bfd6a5d76263d4438371728"
|
||||
dependencies = [
|
||||
"cc",
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-cpp"
|
||||
version = "0.23.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "df2196ea9d47b4ab4a31b9297eaa5a5d19a0b121dceb9f118f6790ad0ab94743"
|
||||
dependencies = [
|
||||
"cc",
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-go"
|
||||
version = "0.25.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c8560a4d2f835cc0d4d2c2e03cbd0dde2f6114b43bc491164238d333e28b16ea"
|
||||
dependencies = [
|
||||
"cc",
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-java"
|
||||
version = "0.23.5"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "0aa6cbcdc8c679b214e616fd3300da67da0e492e066df01bcf5a5921a71e90d6"
|
||||
dependencies = [
|
||||
"cc",
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-javascript"
|
||||
version = "0.25.0"
|
||||
@@ -8536,6 +8584,16 @@ dependencies = [
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-kotlin-ng"
|
||||
version = "1.1.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e800ebbda938acfbf224f4d2c34947a31994b1295ee6e819b65226c7b51b4450"
|
||||
dependencies = [
|
||||
"cc",
|
||||
"tree-sitter-language",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "tree-sitter-language"
|
||||
version = "0.1.7"
|
||||
|
||||
10
Cargo.toml
10
Cargo.toml
@@ -31,7 +31,7 @@ edition = "2024"
|
||||
rust-version = "1.85"
|
||||
license = "MIT OR Apache-2.0"
|
||||
repository = "https://github.com/altair823/kebab"
|
||||
version = "0.8.1"
|
||||
version = "0.17.2"
|
||||
|
||||
[workspace.dependencies]
|
||||
anyhow = "1"
|
||||
@@ -94,6 +94,14 @@ tree-sitter-rust = "0.24"
|
||||
tree-sitter-python = "0.25.0"
|
||||
tree-sitter-typescript = "0.23.2"
|
||||
tree-sitter-javascript = "0.25.0"
|
||||
# Go grammar for code ingest (kebab-parse-code, p10-1C-Go).
|
||||
tree-sitter-go = "0.25.0"
|
||||
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
|
||||
tree-sitter-java = "0.23.5"
|
||||
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
|
||||
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
|
||||
tree-sitter-c = "0.24.2"
|
||||
tree-sitter-cpp = "0.23.4"
|
||||
|
||||
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
|
||||
# behavior are unchanged — only DWARF debug info is reduced (line
|
||||
|
||||
35
HANDOFF.md
35
HANDOFF.md
@@ -4,7 +4,7 @@
|
||||
|
||||
## 한 줄 요약
|
||||
|
||||
P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF 모두 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공 — 사용자가 `?` 로 ask, `/` 로 search, Library Enter / Search `i` 로 inspect, Search `g` 로 editor jump. 다음 후보 = P9-5 (desktop tauri) 또는 보류 중인 P8 (audio) 의 시스템 dep brainstorm.
|
||||
P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) + P10 전체 머지 완료 (현재 **v0.17.2**). `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin / C / C++) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패 케이스) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. **v0.17.0 cut (2026-05-24)**: 한국어 trigram FTS5 tokenizer (PR #159) + C typedef alias unit (PR #160) + `code_lang_chunk_breakdown` additive (PR #161). **v0.17.1 cut (2026-05-25)**: 확장 도그푸딩 후 `[models.llm] request_timeout_secs` config 노브 (PR #162) + sudo 없이 ollama 설치 + `kebab ask --stream` UX 권장 docs (PR #163). **v0.17.2 cut (2026-05-25)**: v0.17.1 post-dogfood polish — `[image.ocr] request_timeout_secs` 별 노브 (PR #164, v0.17.1 미진행 closure) + `heading_path` FTS5 column filter 로 text-only 매칭 + raw-mode escape hatch (PR #165, 2026-05-24 v0.17.0 trigram entry 의 JSON 노이즈 closure). 자세한 영향은 [v0.17.0 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.0) + [v0.17.1 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.1) + [v0.17.2 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.2). 구조적으로 남은 component 는 P9-5 (desktop tauri) 하나뿐, P8 (audio) 는 사용자 보류.
|
||||
|
||||
## Phase 로드맵
|
||||
|
||||
@@ -20,7 +20,7 @@ P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
|
||||
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
|
||||
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
|
||||
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
|
||||
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1` — 3 언어 dogfooding 가능, v0.8.0 대기) |
|
||||
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
|
||||
|
||||
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
|
||||
|
||||
@@ -32,6 +32,12 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
|
||||
|
||||
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
|
||||
|
||||
- **2026-05-25 v0.17.2 post-v0.17.1 polish (PR #164 + #165)** — v0.17.1 의 두 follow-up closure. (1) `[image.ocr] request_timeout_secs` 별 노브 — `crates/kebab-parse-image/src/ocr.rs::REQUEST_TIMEOUT` hard 300s 제거, LLM 쪽 패턴 (PR #162) 을 OCR 어댑터에 동일 적용. 사용자 결정으로 별 노브 분리 (OCR vs LLM 의 cold start 패턴이 달라 독립 조절). v0.17.1 미진행 항목 closure. (2) `chunks_fts` 의 `heading_path` 컬럼이 JSON 표기 + path 세그먼트 까지 trigram 색인 → query false positive 가능 문제 closure. `lexical.rs::build_match_string` 가 non-raw 분기 결과를 `text : (<expr>)` 로 wrap — heading 색인 V007 verbatim 유지, 매칭만 text 한정. 사용자가 명시 heading 검색 하려면 raw mode `'heading_path : <token>'` escape hatch (SKILL.md 갱신). 둘 다 additive (옛 config 호환) / re-ingest 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 v0.17.2 두 entry).
|
||||
- **2026-05-25 v0.17.1 post-dogfood (PR #162 + #163)** — 확장 도그푸딩 (16 GB CPU only, gemma4:e4b 시도) 에서 발견된 두 follow-up 한 묶음. (1) `crates/kebab-llm-local/src/ollama.rs::REQUEST_TIMEOUT` hard 300s → `[models.llm] request_timeout_secs` config + env override (additive, default 300, `=0` 은 disable 아닌 "즉시 timeout" 이라 doc 명시). (2) README + SMOKE 에 sudo / systemd 없이 ollama 설치 + ≤4B Q4 권장 모델 + `kebab ask --stream` UX 권장 docs. additive only — 옛 config / wire 호환. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25).
|
||||
- **2026-05-24 v0.17.0 PR-C `code_lang_chunk_breakdown` additive (closure of 2026-05-22 LOW)** — `schema.v1.stats` 에 chunk 수 집계 신규 키. 기존 `code_lang_breakdown` (doc count) 와 sister. 또 기존 두 필드 JSON schema description 의 "chunk count" 오기재 → "doc count" 로 정정. wire additive — schema_version bump 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-C).
|
||||
- **2026-05-24 v0.17.0 PR-B C typedef alias unit (closure of 2026-05-21)** — `kebab-parse-code::c::extract_blocks` 의 `type_definition` 분기로 inner anonymous struct/enum/union → declarator 의 typedef alias 이름으로 synthetic unit 방출. `PARSER_VERSION code-c-v1` → `code-c-v2` bump + 같은-asset/다른-doc_id 케이스용 `purge_workspace_path_for_parser_bump` cascade (`stale_chunk_ids_for_workspace_path_except_doc_id` + `purge_document_at_workspace_path_except_doc_id` helper 신규). 사용자 작업 불필요 (다음 ingest 가 자동 재처리). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-B).
|
||||
- **2026-05-24 v0.17.0 PR-A 한국어 trigram tokenizer 채택 (closure of 2026-05-22 한국어 lexical)** — `chunks_fts` 가 FTS5 `unicode61` → `trigram` 으로 V007 migration (자동 backfill, re-ingest 불필요). `lexical.rs::build_match_string` trigram-aware 재설계 — multi-token 한국어 query (`해시 충돌`) 가 whole-phrase 후보로 hit, 한영 혼합 (`Rust 충돌은`) 도 OR-combined. 2자 이하 query 는 0-hit + CLI/TUI/wire `hint` 안내. 영어 lexical 도 substring 매칭으로 바뀜 (recall ↑ / 단어 경계 ↓). `kebab.sqlite` 크기 ~2-5배 증가 (trigram index). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24).
|
||||
- **2026-05-22 P10 종합 도그푸딩 round 2 (한국어 lexical 검색 한계)** — `kebab search --mode lexical` 의 한국어 query 가 FTS5 `unicode61` 토크나이저에서 거의 0 hit (어절 단위 토큰화 → 부분 매칭 불가). 기본 hybrid 모드는 `multilingual-e5-small` vector 가 carry 해 한국어 검색 정상. **closure**: 위 2026-05-24 v0.17.0 entry.
|
||||
- **2026-05-20 P10-1B (Rust 1A symbol path 비일관 + expression-level 함수 미방출)** — (a) Rust `code-rust-ast-v1` 은 file-scope nesting 만 (workspace path prefix 없음), 1B 의 Python/TypeScript/JavaScript 는 workspace 경로 → module path prefix 사용 (비일관 수용, retrofit = chunker_version bump + reindex 필요, 사용자 명시 요청까지 보류); (b) TS/JS 의 `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 처리됨 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20) 두 항목.
|
||||
- **2026-05-19 P10-1A-2 (code_rust_ast_v1.rs + SourceType)** — `AST_CHUNK_MAX_LINES` 상수가 `IngestCodeCfg.ast_chunk_max_lines` 를 읽지 않고 모듈 상수 200 고정 (Chunker trait 이 per-medium config 미노출); `SourceType::Code` variant 부재로 code 파일이 `SourceType::Note` 로 분류됨 — 두 항목 모두 `tasks/HOTFIXES.md` (2026-05-19) 에 기록.
|
||||
- **2026-05-07 fb-26 (progress.rs)** — `Aborted` unconditional writeln (TTY duplicate) + `Completed` TTY no summary fixed; `KEBAB_PROGRESS=plain` env + quiet suppression added
|
||||
@@ -81,13 +87,13 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
|
||||
|
||||
## 다음 task 후보
|
||||
|
||||
- **P9-2 TUI search** — `App.search` slot 채움. Library 의 `/` 가 enable 됨.
|
||||
- **P9-3 TUI ask** — `App.ask` slot 채움. `?` enable.
|
||||
- **P9-4 TUI inspect** — `App.inspect` slot 채움. `Enter` enable.
|
||||
- **P9-5 desktop tauri** — 별도 분기. PDF citation rendering UI 가치 큼.
|
||||
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 후순위.
|
||||
구조적으로 미완인 component 는 P9-5 하나뿐. 나머지는 도그푸딩 follow-up (아래 "P10 dogfooding 백로그") 또는 사용자 결정 대기.
|
||||
|
||||
P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에 병렬 진행 가능 — 같은 `App` 손대지 않음.
|
||||
- **P9-5 desktop tauri** — 마지막 남은 P9 component. `kebab-desktop` crate + Tauri 앱, 별도 분기. PDF citation rendering UI 가치 큼. 사용자 우선순위 (P9 우선 · 책/PDF 위주) 와 부합.
|
||||
- **P10 도그푸딩 round 2 follow-up** — ✅ v0.17.0 cut (2026-05-24) 으로 세 항목 모두 closure (한국어 trigram PR-A + C typedef alias PR-B + code_lang_chunk_breakdown additive PR-C). 상세 cross-link: 아래 "P10 dogfooding 백로그" 절 + `tasks/HOTFIXES.md` (2026-05-24 PR-A/B/C).
|
||||
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 보류.
|
||||
- **fb-41 multi-hop reasoning** — ⏳ 미구현, XL, eval 인프라 선행 + brainstorm 필요.
|
||||
- **Rust symbol path retrofit** — Rust `code-rust-ast-v1` symbol 이 file-scope-only (1B+ 는 module prefix). `code-rust-ast-v2` bump + Rust corpus re-ingest 비용 → 사용자 명시 요청까지 보류. HOTFIXES `2026-05-20`.
|
||||
|
||||
### P9 dogfooding 백로그 (fb-26 ~ fb-42) — release 분할
|
||||
|
||||
@@ -96,11 +102,20 @@ P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에
|
||||
- **0.3.0 — agent foundation** ✅ cut 2026-05-07: fb-26 (log), fb-27 (introspection/error wire), fb-28 (readonly/quiet). ~~fb-29 (daemon)~~ → 🚫 **deferred** — fb-30 stdio MCP 가 동일 가치를 daemon 복잡도 없이 제공.
|
||||
- **0.4.0 — agent integration (MCP)** ✅ cut: fb-30 (MCP stdio), fb-31 (single-file/stdin ingest).
|
||||
- **0.5.0 — agent surface refinement (additive)** ✅ cut 2026-05-10: fb-32 (stale doc indicator), fb-33 (streaming ask), fb-34 (output budget controls), fb-35 (verbatim fetch), fb-36 (search filter args), fb-37 (trace + stats). 모두 wire schema additive minor.
|
||||
- **0.6.0 — RAG quality** 🟡 진행: fb-38 (score semantics) ✅ 머지 (2026-05-10), fb-40 (fact-grounded answer / rag-v2 prompt) ✅ 머지 (2026-05-10), fb-39 (retrieval precision tuning, embedding_version cascade) — 미진행 (eval golden set 선행 필요).
|
||||
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL), fb-42 (bulk multi-query / rerank, Nice).
|
||||
- **0.6.0 — RAG quality** ✅ 대부분 머지 (2026-05-10): fb-38 (score semantics) ✅, fb-39 (eval foundation — `precision_at_k_chunk` metric) ✅, fb-39b (embedding upgrade — multilingual-e5-large default) ✅, fb-40 (fact-grounded answer / rag-v2 prompt) ✅. 잔여 = fb-39 의 retrieval precision lever 실제 적용 (eval golden set 확장 선행 필요).
|
||||
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL) — ⏳ 미구현 · brainstorm 필요; fb-42 (bulk multi-query) ✅ 머지 (2026-05-10, bulk only — rerank hint 은 deferred).
|
||||
|
||||
각 fb spec frontmatter 의 `target_version` 필드가 source of truth. INDEX.md 의 release subheader 도 동일 grouping.
|
||||
|
||||
### P10 dogfooding 백로그 (2026-05-22 round 2)
|
||||
|
||||
P10 종합 도그푸딩 round 2 (`/build/cache/dogfood-p10b/`, OSS 8 repo + 한국어 위키 문서 10편) 에서 발견된 follow-up 후보. 자세한 내용 + 우선순위 근거는 `tasks/HOTFIXES.md` (2026-05-22).
|
||||
|
||||
- **한국어 lexical tokenizer** — ✅ v0.17.0 (2026-05-24) PR-A 머지 (#159). V007 trigram migration 자동 backfill + `build_match_string` 재설계 + CLI/TUI/wire hint. HOTFIXES `2026-05-24 PR-A` 참조.
|
||||
- **code_lang_chunk_breakdown chunk 단위 집계 (LOW)** — ✅ v0.17.0 (2026-05-24) PR-C 머지 (#161). `schema.v1.stats` additive 필드. HOTFIXES `2026-05-24 PR-C` 참조.
|
||||
- **C typedef-wrapped struct (LOW)** — ✅ v0.17.0 (2026-05-24) PR-B 머지 (#160). `type_definition` 분기 + `PARSER_VERSION code-c-v2` bump + orphan purge cascade. HOTFIXES `2026-05-24 PR-B` 참조.
|
||||
- **ranking glue chunk 편향 (deferred)** — 자동 heuristic 은 user intent misalignment 위험. 사용자 명시 요청 전까지 surface 변경 0 유지. 1주+ 실사용 후 재 brainstorm.
|
||||
|
||||
## 검증된 운영 동작 (release binary, fastembed enabled)
|
||||
|
||||
P7-3 머지 직후 25 시나리오 smoke 통과 — markdown + image + PDF 5 자산 워크스페이스에서 doctor / ingest / list / inspect / search (lex/vec/hybrid) / re-ingest / byte-edit re-ingest / corrupt PDF / RAG ask + page citation 모두. 자세한 시나리오 표는 conversation 기록 참조; 워크스페이스에 직접 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md).
|
||||
|
||||
24
README.md
24
README.md
@@ -6,6 +6,20 @@
|
||||
|
||||
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
|
||||
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
|
||||
- **CPU only / RAM ≤ 16 GB 환경 권장 모델**: gemma4:e4b (8B) 는 CPU 추론에 무거워 RAG 한 답변이 5분을 넘기기 쉽다 — `[models.llm] request_timeout_secs` 의 기본 300 s 한도에 걸려 `error: kb-rag: llm.generate_stream` 으로 떨어진다 (HOTFIXES 2026-05-25). `gemma3:4b` / `qwen2.5:3b` / `phi3:mini` 같은 ≤ 4B Q4 모델로 바꾸면 답변 1-3 분에 안정 동작 (확장 도그푸딩에서 검증). 모델 storage 가 부담이면 `OLLAMA_MODELS=/path` env 로 위치 분리 가능.
|
||||
- **`request_timeout_secs` 노브 (v0.17.0)**: `[models.llm] request_timeout_secs = 1200` (또는 `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=1200`) 로 한도를 늘려 큰 모델도 시도 가능. 단 응답 동안 RAM 점유가 길어진다. **`= 0` 은 disable 이 아니라 "즉시 timeout"** (reqwest 의 의미상) — "사실상 무제한" 의도면 `u64::MAX` 또는 `86400` 같이 큰 finite 값 사용.
|
||||
- **sudo 없이 설치 (격리 디렉토리 사용)**: `install.sh` 가 `/usr/local/bin/ollama` + `systemd` 유닛까지 건드리는 게 부담이면 binary tarball 만 받아 사용자 디렉토리에 풀고 env 로 모델 위치 분리하면 된다.
|
||||
```bash
|
||||
mkdir -p /opt/ollama/{models,logs}
|
||||
curl -fL https://ollama.com/download/ollama-linux-amd64.tar.zst -o /tmp/ollama.tar.zst
|
||||
zstd -d /tmp/ollama.tar.zst -o /tmp/ollama.tar && tar -xf /tmp/ollama.tar -C /opt/ollama/
|
||||
# bin/ollama + lib/ollama/ 가 풀린다. 모델 디렉토리는 OLLAMA_MODELS 로 분리.
|
||||
OLLAMA_MODELS=/opt/ollama/models OLLAMA_HOST=127.0.0.1:11434 \
|
||||
/opt/ollama/bin/ollama serve > /opt/ollama/logs/serve.log 2>&1 &
|
||||
/opt/ollama/bin/ollama pull gemma3:4b
|
||||
```
|
||||
루트 디스크 부담을 분리하고 싶을 때 (`~/.ollama/models` 가 기본) 그대로 활용. systemd 가 없는 컨테이너 / WSL2 / 회사 머신 등에서 유용.
|
||||
- **`kebab ask --stream` 권장 (fb-33)**: 모델 cold start 가 길 때 (8B+ 또는 첫 호출) `--stream` 으로 토큰을 stderr 에 ndjson 으로 흘려 받으면 5 분 timeout 한도 안에서도 첫 토큰이 빨리 보여 사용자 체감이 개선된다. 동일 inference 시간이라도 wait-and-pray 보다 progressive 가 안정적. CLI: `kebab ask "..." --stream 2> events.ndjson > final.json`. MCP host 도 `streaming_ask` capability flag 가 `true` 면 자동 사용 권장.
|
||||
- **빌드 디스크** — 첫 빌드 시 `target/` 가 6–10 GB (Lance + DataFusion + fastembed). 여유 확인.
|
||||
- **fastembed 모델** — 첫 `kebab ingest` 시 `multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
|
||||
|
||||
@@ -34,7 +48,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
|
||||
|
||||
업데이트는 `git pull && cargo install --path crates/kebab-cli --locked --force` 또는 git URL 형식의 경우 `cargo install --git ... --force`.
|
||||
|
||||
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed) 등.
|
||||
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed), **`kebab reset --orphans-only`** (현재 walker scope 밖에 있는 stored doc 만 정리 — `config.workspace.include` 좁히거나 sub-dir 옮긴 후 explicit reconcile; fs 의 file 은 건드리지 않음) 등.
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -42,7 +56,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
|
||||
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
|
||||
kebab init
|
||||
|
||||
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
|
||||
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js / go)
|
||||
${EDITOR:-vi} ~/.config/kebab/config.toml
|
||||
|
||||
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
|
||||
@@ -70,8 +84,8 @@ kebab doctor
|
||||
| 명령 | 동작 |
|
||||
|------|------|
|
||||
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
|
||||
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → `code-python-ast-v1`, `.ts`/`.tsx` → `code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx` → `code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"` 에 `citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
|
||||
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
|
||||
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → `code-python-ast-v1`, `.ts`/`.tsx` → `code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx` → `code-js-ast-v1`, `.go` → `code-go-ast-v1`, `.java` → `code-java-ast-v1`, `.kt`/`.kts` → `code-kotlin-ast-v1`, `.c`/`.h` → `code-c-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` → `code-cpp-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml` → `k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile` → `dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod` → `manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원); **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh` → `code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"` 에 `citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--code-lang shell` / `--code-lang c` / `--code-lang cpp` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
|
||||
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. **v0.17.0 trigram tokenizer (한국어 + 영어 동작 변경):** `chunks_fts` 가 FTS5 `trigram` 으로 동작 — 한국어 query 는 3자 이상 substring 매칭 (`해시 충돌` 같은 multi-token 도 whole-phrase 후보로 hit), 영어도 substring 매칭 (`token` 이 `tokenizer` 도 hit, recall ↑ / 단어 경계 ↓). 2자 이하 query 는 0-hit + stderr `[hint] 3자 이상 키워드 권장` + `search_response.v1.hint` 필드 (raw FTS5 mode `'...'` 제외). `kebab.sqlite` 파일 크기는 trigram index 비대화로 ~2-5배 또는 수백 MB 증가 (V007 자동 backfill, re-ingest 불필요). |
|
||||
| `kebab list docs` | 색인된 문서 목록 |
|
||||
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
|
||||
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
|
||||
@@ -132,7 +146,7 @@ flowchart TB
|
||||
|
||||
subgraph Pipeline["도메인 + 파이프라인"]
|
||||
parse["parse-md / parse-pdf / parse-image / parse-code"]
|
||||
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
|
||||
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin,c,cpp}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
|
||||
embedder["embedder (fastembed multilingual-e5-large)"]
|
||||
retriever["retriever (lexical / vector / hybrid RRF)"]
|
||||
rag["RAG pipeline"]
|
||||
|
||||
@@ -73,6 +73,37 @@ pub struct SearchResponse {
|
||||
/// p9-fb-37: present when caller passed `SearchOpts.trace = true`.
|
||||
/// Consumers that ignore trace should leave this `None`.
|
||||
pub trace: Option<kebab_core::SearchTrace>,
|
||||
/// v0.17.0 A5 Step 4b: human / agent-readable advisory string set
|
||||
/// when the empty hit list is likely due to a query shorter than the
|
||||
/// FTS5 trigram tokenizer's 3-char minimum. `None` otherwise. CLI
|
||||
/// surfaces it on stderr (text mode); MCP / `--json` consumers
|
||||
/// surface it however they prefer. See
|
||||
/// `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`
|
||||
/// §3.3.
|
||||
pub hint: Option<String>,
|
||||
}
|
||||
|
||||
/// v0.17.0 A5 Step 4b: decide whether to attach a "3자 이상 키워드 권장"
|
||||
/// hint to a `SearchResponse`. Fires only when the result set is empty
|
||||
/// *and* the trimmed query is shorter than the trigram tokenizer can
|
||||
/// resolve. Raw FTS5 mode (`'...'`) opts out — the user explicitly
|
||||
/// invoked FTS5 syntax. Identical condition powers the CLI stderr line
|
||||
/// and (separately) the TUI status bar.
|
||||
pub fn short_query_hint(query_text: &str, hits_empty: bool) -> Option<String> {
|
||||
if !hits_empty {
|
||||
return None;
|
||||
}
|
||||
let trimmed = query_text.trim();
|
||||
let bytes = trimmed.as_bytes();
|
||||
// Raw single-quote mode: user opted into FTS5 syntax, no advisory.
|
||||
if bytes.len() >= 2 && bytes[0] == b'\'' && bytes[bytes.len() - 1] == b'\'' {
|
||||
return None;
|
||||
}
|
||||
if trimmed.chars().count() < 3 {
|
||||
Some("3자 이상 키워드 권장 (trigram tokenizer 제약)".to_string())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Facade state — see module docs for lifetime rules.
|
||||
@@ -418,11 +449,13 @@ impl App {
|
||||
|
||||
// Trace path skips the budget loop. Caller will inspect
|
||||
// `hits.len()` and `trace.timing` rather than paginate.
|
||||
let hint = short_query_hint(&query.text, hits.is_empty());
|
||||
return Ok(SearchResponse {
|
||||
hits,
|
||||
next_cursor: None,
|
||||
truncated: false,
|
||||
trace: Some(trace),
|
||||
hint,
|
||||
});
|
||||
}
|
||||
|
||||
@@ -505,11 +538,13 @@ impl App {
|
||||
None
|
||||
};
|
||||
|
||||
let hint = short_query_hint(&query.text, hits.is_empty());
|
||||
Ok(SearchResponse {
|
||||
hits,
|
||||
next_cursor,
|
||||
truncated,
|
||||
trace: None,
|
||||
hint,
|
||||
})
|
||||
}
|
||||
|
||||
|
||||
@@ -96,6 +96,11 @@ fn serialize_search_response(r: &SearchResponse) -> Value {
|
||||
None => Value::Null,
|
||||
};
|
||||
map.insert("trace".to_string(), trace_v);
|
||||
// v0.17.0 A5 Step 4b: only emit `hint` when set — matches
|
||||
// the CLI wire wrapper's additive emit pattern.
|
||||
if let Some(hint) = &r.hint {
|
||||
map.insert("hint".to_string(), Value::String(hint.clone()));
|
||||
}
|
||||
}
|
||||
v
|
||||
}
|
||||
|
||||
@@ -189,10 +189,12 @@ fn fetch_span(
|
||||
// (markdown / note / paper / reference / inbox) is the *user-facing*
|
||||
// category, not the rendering format — the actual byte-level format
|
||||
// lives on the source `RawAsset.media_type`. Look it up via
|
||||
// workspace_path (unique key per asset).
|
||||
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset_by_workspace_path(
|
||||
// doc.source_asset_id (PRIMARY KEY) so twin files (identical content
|
||||
// at different paths) always read *this* document's own asset row,
|
||||
// not whichever twin last wrote `assets.workspace_path`.
|
||||
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset(
|
||||
&app.sqlite,
|
||||
&doc.workspace_path,
|
||||
&doc.source_asset_id,
|
||||
)? {
|
||||
if matches!(
|
||||
asset.media_type,
|
||||
|
||||
@@ -39,7 +39,7 @@ use std::sync::Arc;
|
||||
use anyhow::{Context, anyhow};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use kebab_chunk::{CodeJsAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
|
||||
use kebab_chunk::{CodeCAstV1Chunker, CodeCppAstV1Chunker, CodeGoAstV1Chunker, CodeJavaAstV1Chunker, CodeJsAstV1Chunker, CodeKotlinAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTextParagraphV1Chunker, CodeTsAstV1Chunker, DockerfileFileV1Chunker, K8sManifestResourceV1Chunker, ManifestFileV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
|
||||
use kebab_core::{
|
||||
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
|
||||
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
|
||||
@@ -50,7 +50,7 @@ use kebab_core::{
|
||||
use kebab_llm_local::OllamaLanguageModel;
|
||||
use kebab_normalize::build_canonical_document;
|
||||
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
|
||||
use kebab_parse_code::{JavascriptAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
|
||||
use kebab_parse_code::{CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
|
||||
use kebab_parse_pdf::PdfTextExtractor;
|
||||
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
|
||||
use kebab_source_fs::FsSourceConnector;
|
||||
@@ -69,9 +69,9 @@ pub mod reset;
|
||||
pub mod schema;
|
||||
mod staleness;
|
||||
|
||||
pub use app::{App, SearchResponse};
|
||||
pub use app::{App, SearchResponse, short_query_hint};
|
||||
pub use ingest_progress::{AggregateCounts, IngestEvent, render_skipped_breakdown};
|
||||
pub use reset::{ResetReport, ResetScope};
|
||||
pub use reset::{ResetReport, ResetScope, enumerate_orphans};
|
||||
pub use error_wire::{ERROR_V1_ID, ErrorV1, StructuredError, classify};
|
||||
pub use fetch::fetch_with_config;
|
||||
#[doc(hidden)]
|
||||
@@ -375,6 +375,28 @@ pub fn ingest_with_config_opts(
|
||||
.map(|d| d.doc_id.0)
|
||||
.collect();
|
||||
|
||||
// Dogfood: post-walker sweep to remove stored docs whose source
|
||||
// file has been deleted from the filesystem. Must run BEFORE the
|
||||
// per-asset loop so the loop's New/Updated labelling is based on
|
||||
// the post-purge store state (the purged doc_ids won't be in
|
||||
// `existing_doc_ids` above — they were already removed, OR the
|
||||
// sweep here removes them before we start counting).
|
||||
//
|
||||
// Critical design invariant: only purge when the file is TRULY
|
||||
// absent from disk. A file that is still on disk but outside the
|
||||
// current walker scope (config narrowing / include-glob change) is
|
||||
// NOT purged — we leave it in place to protect against accidental
|
||||
// data loss via config edits.
|
||||
let scanned_paths: std::collections::HashSet<kebab_core::WorkspacePath> = assets
|
||||
.iter()
|
||||
.map(|a| a.workspace_path.clone())
|
||||
.collect();
|
||||
let purged_deleted_files = sweep_deleted_files(
|
||||
&app,
|
||||
&scanned_paths,
|
||||
vector_store.as_ref().map(|v| v.as_ref()),
|
||||
)?;
|
||||
|
||||
let started_at = time::OffsetDateTime::now_utc();
|
||||
|
||||
let mut items: Vec<kebab_core::IngestItem> = Vec::new();
|
||||
@@ -647,11 +669,11 @@ pub fn ingest_with_config_opts(
|
||||
crate::ingest_progress::emit(progress, terminal_event);
|
||||
|
||||
// p9-fb-19: bump the persistent corpus_revision counter when a
|
||||
// commit landed (any new / updated). This invalidates every
|
||||
// commit landed (any new / updated / purged). This invalidates every
|
||||
// entry in any in-process LRU search cache (in this process or
|
||||
// a sibling) on the next lookup. No-op when nothing changed
|
||||
// (skipped-only run) — the cache stays valid.
|
||||
if new_count > 0 || updated_count > 0 {
|
||||
if new_count > 0 || updated_count > 0 || purged_deleted_files > 0 {
|
||||
match app.sqlite.bump_corpus_revision() {
|
||||
Ok(rev) => tracing::debug!(
|
||||
target: "kebab-app",
|
||||
@@ -682,6 +704,7 @@ pub fn ingest_with_config_opts(
|
||||
skipped_generated: fs_skips.skipped_generated,
|
||||
skipped_size_exceeded: fs_skips.skipped_size_exceeded,
|
||||
skip_examples: fs_skips.skip_examples,
|
||||
purged_deleted_files,
|
||||
items: if summary_only { None } else { Some(items) },
|
||||
})
|
||||
}
|
||||
@@ -748,15 +771,18 @@ struct ImagePipeline<'a> {
|
||||
/// hold (per design §9 cascade rule):
|
||||
///
|
||||
/// 1. `force_reingest == false` — caller hasn't asked to bypass skip.
|
||||
/// 2. The freshly-scanned asset's blake3 checksum equals what the
|
||||
/// existing `assets` row stores at the same `workspace_path`.
|
||||
/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)`
|
||||
/// exists. If the parser_version changed, `id_for_doc` produces a
|
||||
/// different `doc_id` so the lookup misses → no skip → re-process.
|
||||
/// 4. The existing doc's stamped `last_chunker_version` AND
|
||||
/// `last_embedding_version` match the values the caller is about
|
||||
/// to use (`Some(v) == Some(v)` and `None == None` — see design
|
||||
/// doc for the `None == None` rule when no embedder is configured).
|
||||
/// 2. A document already exists at this `workspace_path`
|
||||
/// (`get_document_by_workspace_path`). The lookup is document-side, not
|
||||
/// asset-side, so twin files (identical content at different paths) each
|
||||
/// hit their own stable doc row — `documents.workspace_path` is UNIQUE
|
||||
/// while `assets` may dedupe content into a single row with a flip-flop
|
||||
/// `workspace_path` column (dogfood bug #4, see `tasks/HOTFIXES.md`).
|
||||
/// 3. The existing doc's `source_asset_id` equals the freshly-scanned
|
||||
/// asset's blake3 checksum (content unchanged).
|
||||
/// 4. The existing doc's `parser_version` matches the current extractor's
|
||||
/// `parser_version` (extractor not upgraded). Combined with `chunker_version`
|
||||
/// and `last_embedding_version` checks immediately below — full cascade
|
||||
/// per design §9.
|
||||
///
|
||||
/// Returns `Ok(None)` (proceed with full re-process) when any check
|
||||
/// fails or any DB read errors out — the skip path is opportunistic;
|
||||
@@ -769,35 +795,24 @@ fn try_skip_unchanged(
|
||||
current_chunker_version: &ChunkerVersion,
|
||||
current_embedding_version: Option<&kebab_core::EmbeddingVersion>,
|
||||
force_reingest: bool,
|
||||
fallback_chunker_version: Option<&ChunkerVersion>, // p10-3 fix
|
||||
) -> anyhow::Result<Option<kebab_core::IngestItem>> {
|
||||
if force_reingest {
|
||||
return Ok(None);
|
||||
}
|
||||
let existing_asset = match app
|
||||
// Document-centric skip: look up the existing document row by
|
||||
// workspace_path directly. This avoids the twin-file flip-flop
|
||||
// that the old asset-side lookup suffers from — multiple files
|
||||
// with identical content share one `assets` row whose
|
||||
// `workspace_path` is overwritten on every UPSERT, so
|
||||
// `get_asset_by_workspace_path(path1)` could return the OTHER
|
||||
// twin's path (or None) after any ingest of the twin. The
|
||||
// `documents` table has a UNIQUE index on `workspace_path` (V001),
|
||||
// so each twin has its own stable row regardless of asset de-dup.
|
||||
let existing_doc = match app
|
||||
.sqlite
|
||||
.get_asset_by_workspace_path(&asset.workspace_path)
|
||||
.get_document_by_workspace_path(&asset.workspace_path)
|
||||
{
|
||||
Ok(Some(a)) => a,
|
||||
Ok(None) => return Ok(None),
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
path = %asset.workspace_path.0,
|
||||
error = %e,
|
||||
"skip-check: get_asset_by_workspace_path failed; falling through to re-process"
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
};
|
||||
if existing_asset.checksum != asset.checksum {
|
||||
return Ok(None);
|
||||
}
|
||||
let candidate_doc_id = kebab_core::id_for_doc(
|
||||
&asset.workspace_path,
|
||||
&asset.asset_id,
|
||||
current_parser_version,
|
||||
);
|
||||
let existing_doc = match app.sqlite.get_document(&candidate_doc_id) {
|
||||
Ok(Some(d)) => d,
|
||||
Ok(None) => return Ok(None),
|
||||
Err(e) => {
|
||||
@@ -805,21 +820,97 @@ fn try_skip_unchanged(
|
||||
target: "kebab-app",
|
||||
path = %asset.workspace_path.0,
|
||||
error = %e,
|
||||
"skip-check: get_document failed; falling through to re-process"
|
||||
"skip-check: get_document_by_workspace_path failed; falling through to re-process"
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
};
|
||||
// 1. Content unchanged: the freshly-computed asset_id (blake3
|
||||
// content hash) must match what this document was ingested from.
|
||||
if existing_doc.source_asset_id != asset.asset_id {
|
||||
return Ok(None);
|
||||
}
|
||||
// p10-3 fix: detect "stored doc was previously Tier 3 fallback".
|
||||
// When a Tier 1/2 extractor emits empty chunks, the fallback wrapper
|
||||
// retries with CodeTextParagraphV1Chunker and stores
|
||||
// last_chunker_version = "code-text-paragraph-v1" + parser_version = "none-v1".
|
||||
// On the next ingest the caller computes current_parser_version /
|
||||
// current_chunker_version from the Tier 1/2 dispatch (e.g.
|
||||
// "k8s-manifest-resource-v1"), which can never match the stored
|
||||
// fallback values, causing spurious re-ingests. Detect this case
|
||||
// and bypass the parser/chunker equality checks — only the embedder
|
||||
// version still must match.
|
||||
let stored_is_tier3_fallback = fallback_chunker_version.is_some_and(|fbv| {
|
||||
existing_doc.last_chunker_version.as_ref() == Some(fbv)
|
||||
&& existing_doc.parser_version.0 == "none-v1"
|
||||
});
|
||||
|
||||
if stored_is_tier3_fallback {
|
||||
// Embedder version still must match.
|
||||
let embedder_match = existing_doc.last_embedding_version.as_ref()
|
||||
== current_embedding_version;
|
||||
if !embedder_match {
|
||||
return Ok(None);
|
||||
}
|
||||
let candidate_doc_id = existing_doc.doc_id.clone();
|
||||
tracing::debug!(
|
||||
target: "kebab-app::ingest",
|
||||
path = %asset.workspace_path.0,
|
||||
doc_id = %candidate_doc_id.0,
|
||||
"skip-unchanged: tier 3 fallback state detected; bypassing parser/chunker equality"
|
||||
);
|
||||
return Ok(Some(kebab_core::IngestItem {
|
||||
kind: kebab_core::IngestItemKind::Unchanged,
|
||||
doc_id: Some(candidate_doc_id),
|
||||
doc_path: asset.workspace_path.clone(),
|
||||
asset_id: Some(asset.asset_id.clone()),
|
||||
byte_len: Some(asset.byte_len),
|
||||
block_count: u32::try_from(existing_doc.blocks.len()).ok(),
|
||||
chunk_count: None,
|
||||
parser_version: Some(existing_doc.parser_version.clone()),
|
||||
chunker_version: existing_doc.last_chunker_version.clone(),
|
||||
warnings: Vec::new(),
|
||||
error: None,
|
||||
}));
|
||||
}
|
||||
|
||||
// 2. Parser unchanged: parser_version is baked into id_for_doc so
|
||||
// a version bump yields a different doc_id and the row above
|
||||
// would have been missing. Checking here explicitly keeps the
|
||||
// logic self-documenting and guards against future id_for_doc
|
||||
// changes.
|
||||
if existing_doc.parser_version != *current_parser_version {
|
||||
// v0.17.0 PR-B: parser_version bump cascade. Same bytes (same
|
||||
// asset_id) → asset-keyed `stale_chunk_ids_at` is a no-op, but
|
||||
// the stale `documents` row at this workspace_path still
|
||||
// collides with `idx_docs_workspace_path` on the next INSERT
|
||||
// and the LanceDB rows under the old chunk_ids orphan. Sweep
|
||||
// both stores here, before returning Ok(None), so the caller's
|
||||
// full-ingest path lands a clean slate. The `keep_doc_id = ""`
|
||||
// sentinel removes every doc at this path (the new doc_id is
|
||||
// not yet known here — it's computed downstream from the new
|
||||
// PARSER_VERSION).
|
||||
purge_workspace_path_for_parser_bump(app, asset).with_context(|| {
|
||||
format!(
|
||||
"parser-bump orphan purge at {}",
|
||||
asset.workspace_path.0
|
||||
)
|
||||
})?;
|
||||
return Ok(None);
|
||||
}
|
||||
// 3. Chunker unchanged.
|
||||
let chunker_match = existing_doc.last_chunker_version.as_ref()
|
||||
== Some(current_chunker_version);
|
||||
if !chunker_match {
|
||||
return Ok(None);
|
||||
}
|
||||
// 4. Embedder unchanged.
|
||||
let embedder_match = existing_doc.last_embedding_version.as_ref()
|
||||
== current_embedding_version;
|
||||
if !embedder_match {
|
||||
return Ok(None);
|
||||
}
|
||||
let candidate_doc_id = existing_doc.doc_id.clone();
|
||||
tracing::debug!(
|
||||
target: "kebab-app::ingest",
|
||||
path = %asset.workspace_path.0,
|
||||
@@ -918,9 +1009,12 @@ fn ingest_one_asset(
|
||||
force_reingest,
|
||||
);
|
||||
}
|
||||
// p10-1A-2 / 1B: code ingest dispatch.
|
||||
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added. p10-3: shell added. p10-1D: c/cpp added.
|
||||
MediaType::Code(lang)
|
||||
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript") =>
|
||||
if matches!(lang.as_str(),
|
||||
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
|
||||
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
| "shell" | "c" | "cpp") =>
|
||||
{
|
||||
return ingest_one_code_asset(
|
||||
app,
|
||||
@@ -984,6 +1078,7 @@ fn ingest_one_asset(
|
||||
&MdHeadingV1Chunker.chunker_version(),
|
||||
embedder.map(|e| e.model_version()).as_ref(),
|
||||
force_reingest,
|
||||
None,
|
||||
)? {
|
||||
return Ok(item);
|
||||
}
|
||||
@@ -1178,6 +1273,7 @@ fn ingest_one_image_asset(
|
||||
&MdHeadingV1Chunker.chunker_version(),
|
||||
embedder.map(|e| e.model_version()).as_ref(),
|
||||
force_reingest,
|
||||
None,
|
||||
)? {
|
||||
return Ok(item);
|
||||
}
|
||||
@@ -1406,6 +1502,53 @@ fn record_image_analysis_failure(
|
||||
warning_notes.push(note);
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-B: parser-bump cascade. When a code extractor ships a
|
||||
/// new `PARSER_VERSION` (e.g. `code-c-v1` → `code-c-v2`), the same
|
||||
/// (workspace_path, asset_id) pair re-emerges with a fresh `doc_id`.
|
||||
/// The existing asset-keyed [`purge_vector_orphans_for_workspace_path`]
|
||||
/// only fires on asset_id changes (file bytes edited) and is a no-op
|
||||
/// here. Without an explicit doc-keyed sweep the next INSERT raises
|
||||
/// `idx_docs_workspace_path` UNIQUE and the LanceDB rows under the
|
||||
/// stale chunk_ids orphan. This helper:
|
||||
///
|
||||
/// 1. Fetches every stale chunk_id at `workspace_path` from SQLite
|
||||
/// (`keep_doc_id = ""` means "all existing docs are stale" —
|
||||
/// `try_skip_unchanged` calls this before the new doc_id is
|
||||
/// computed).
|
||||
/// 2. Deletes the matching vectors from every Lance table (no-op if
|
||||
/// embeddings are disabled).
|
||||
/// 3. Sweeps the SQLite `documents` row (CASCADE drops `blocks` /
|
||||
/// `chunks` / `embedding_records`). The `assets` row stays — same
|
||||
/// bytes, same asset_id, only the derived `doc_id` changed.
|
||||
fn purge_workspace_path_for_parser_bump(
|
||||
app: &App,
|
||||
asset: &RawAsset,
|
||||
) -> anyhow::Result<()> {
|
||||
let path = &asset.workspace_path.0;
|
||||
let stale = app
|
||||
.sqlite
|
||||
.stale_chunk_ids_for_workspace_path_except_doc_id(path, "")
|
||||
.context("SqliteStore::stale_chunk_ids_for_workspace_path_except_doc_id")?;
|
||||
if !stale.is_empty() {
|
||||
if let Some(vec_store) = app.vector().context("App::vector")? {
|
||||
use kebab_core::VectorStore as _;
|
||||
vec_store
|
||||
.delete_by_chunk_ids(&stale)
|
||||
.context("VectorStore::delete_by_chunk_ids (parser-bump orphans)")?;
|
||||
}
|
||||
}
|
||||
app.sqlite
|
||||
.purge_document_at_workspace_path_except_doc_id(path, "")
|
||||
.context("SqliteStore::purge_document_at_workspace_path_except_doc_id")?;
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
path = %path,
|
||||
count = stale.len(),
|
||||
"purged orphan vectors + document for parser_version bump"
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// HOTFIXES 2026-05-02 P7-3 follow-up: when a tracked file's bytes
|
||||
/// change, `purge_orphan_at_workspace_path` (in `kebab-store-sqlite`)
|
||||
/// sweeps the SQLite chain (documents → blocks / chunks / embedding_records)
|
||||
@@ -1446,6 +1589,120 @@ fn purge_vector_orphans_for_workspace_path(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Dogfood: post-walker sweep that purges stored documents whose source
|
||||
/// file has been physically deleted from the filesystem.
|
||||
///
|
||||
/// Algorithm:
|
||||
/// 1. Query `documents` for every `workspace_path` currently stored.
|
||||
/// 2. Compute `orphan_candidates = stored_paths - scanned_paths`.
|
||||
/// 3. For each candidate: resolve to an absolute path and call
|
||||
/// `Path::try_exists().unwrap_or(true)` — transient FS errors
|
||||
/// (EACCES, NFS hiccup, ownership change) conservatively count as
|
||||
/// "still present" so we never purge on uncertain signal. If the
|
||||
/// file still exists on disk it was merely out-of-scope this run
|
||||
/// (config narrowing / include-glob change) — leave it alone. Only
|
||||
/// files that are truly absent trigger a purge.
|
||||
/// 4. For absent files: call `purge_deleted_workspace_path` (SQLite
|
||||
/// cascade delete + optional copied-asset file removal) and, if a
|
||||
/// vector store is present, delete the associated vectors.
|
||||
///
|
||||
/// Returns the number of documents purged.
|
||||
///
|
||||
/// Non-fatal design: individual purge failures are logged and counted
|
||||
/// as errors on the per-file level but do NOT abort the sweep — a
|
||||
/// partial failure is preferable to blocking the rest of ingest. The
|
||||
/// return value only counts successful purges.
|
||||
fn sweep_deleted_files(
|
||||
app: &App,
|
||||
scanned_paths: &std::collections::HashSet<kebab_core::WorkspacePath>,
|
||||
vector_store: Option<&kebab_store_vector::LanceVectorStore>,
|
||||
) -> anyhow::Result<u32> {
|
||||
use kebab_core::DocumentStore as _;
|
||||
|
||||
let stored_paths = app
|
||||
.sqlite
|
||||
.all_workspace_paths()
|
||||
.context("sweep_deleted_files: all_workspace_paths")?;
|
||||
|
||||
if stored_paths.is_empty() {
|
||||
return Ok(0);
|
||||
}
|
||||
|
||||
let workspace_root = app.config.resolve_workspace_root();
|
||||
let mut purged: u32 = 0;
|
||||
|
||||
for stored_path in stored_paths {
|
||||
if scanned_paths.contains(&stored_path) {
|
||||
continue; // still in scope — skip
|
||||
}
|
||||
|
||||
// Resolve to an absolute path and check existence on disk.
|
||||
// Use `try_exists` + `unwrap_or(true)` so transient FS errors
|
||||
// (EACCES on a path we lack read on, NFS hiccups, ownership
|
||||
// change) are CONSERVATIVELY treated as "file still present" —
|
||||
// never purge on uncertain signal (data-safety: PR #148 review).
|
||||
// `exists()` would return false on Err and trigger a wrongful
|
||||
// purge. Files whose path cannot be joined (theoretically
|
||||
// impossible for non-empty workspace_path strings, but
|
||||
// defense-in-depth) are likewise treated as still present.
|
||||
let abs = workspace_root.join(&stored_path.0);
|
||||
if abs.try_exists().unwrap_or(true) {
|
||||
// File is on disk but not in this scan's scope (config
|
||||
// narrowing). DO NOT purge — critical design constraint.
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
path = %stored_path.0,
|
||||
"sweep_deleted_files: file on disk but out of scope — leaving in store"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
// File is truly absent → purge.
|
||||
let chunk_ids = match kebab_store_sqlite::purge_deleted_workspace_path(
|
||||
&app.sqlite,
|
||||
&stored_path,
|
||||
) {
|
||||
Ok(ids) => ids,
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kebab-app",
|
||||
path = %stored_path.0,
|
||||
error = %e,
|
||||
"sweep_deleted_files: purge failed; skipping this path"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
// Purge associated vectors (best-effort; partial failure
|
||||
// acceptable — orphan vectors get cleaned by `kebab reset
|
||||
// --vector-only` if they accumulate).
|
||||
if let Some(vec) = vector_store {
|
||||
if !chunk_ids.is_empty() {
|
||||
use kebab_core::VectorStore as _;
|
||||
if let Err(e) = vec.delete_by_chunk_ids(&chunk_ids) {
|
||||
tracing::warn!(
|
||||
target: "kebab-app",
|
||||
path = %stored_path.0,
|
||||
count = chunk_ids.len(),
|
||||
error = %e,
|
||||
"sweep_deleted_files: vector delete failed; SQLite side already cleaned"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::info!(
|
||||
target: "kebab-app",
|
||||
path = %stored_path.0,
|
||||
"sweep_deleted_files: purged document for deleted file"
|
||||
);
|
||||
purged = purged.saturating_add(1);
|
||||
}
|
||||
|
||||
Ok(purged)
|
||||
}
|
||||
|
||||
/// P7-3: process one `MediaType::Pdf` asset end-to-end.
|
||||
///
|
||||
/// - Reads bytes from disk.
|
||||
@@ -1510,6 +1767,7 @@ fn ingest_one_pdf_asset(
|
||||
&PdfPageV1Chunker.chunker_version(),
|
||||
embedder.map(|e| e.model_version()).as_ref(),
|
||||
force_reingest,
|
||||
None,
|
||||
)? {
|
||||
return Ok(item);
|
||||
}
|
||||
@@ -1683,18 +1941,54 @@ fn ingest_one_code_asset(
|
||||
"python" => ParserVersion(kebab_parse_code::PYTHON_PARSER_VERSION.to_string()),
|
||||
"typescript" => ParserVersion(kebab_parse_code::TS_PARSER_VERSION.to_string()),
|
||||
"javascript" => ParserVersion(kebab_parse_code::JS_PARSER_VERSION.to_string()),
|
||||
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
|
||||
"java" => ParserVersion(kebab_parse_code::JAVA_PARSER_VERSION.to_string()),
|
||||
"kotlin" => ParserVersion(kebab_parse_code::KOTLIN_PARSER_VERSION.to_string()),
|
||||
// p10-2: Tier 2 has no parse step — sentinel "none-v1".
|
||||
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
=> ParserVersion("none-v1".to_string()),
|
||||
// p10-3: shell direct routes to Tier 3 (no parse step).
|
||||
"shell" => ParserVersion("none-v1".to_string()),
|
||||
// p10-1D: C + C++ AST extractors.
|
||||
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
|
||||
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
|
||||
other => anyhow::bail!("unsupported code_lang: {other}"),
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: chunker_version per-lang.
|
||||
let chunker_version = match code_lang {
|
||||
let mut chunker_version = match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker.chunker_version(),
|
||||
"python" => CodePythonAstV1Chunker.chunker_version(),
|
||||
"typescript" => CodeTsAstV1Chunker.chunker_version(),
|
||||
"javascript" => CodeJsAstV1Chunker.chunker_version(),
|
||||
"go" => CodeGoAstV1Chunker.chunker_version(),
|
||||
"java" => CodeJavaAstV1Chunker.chunker_version(),
|
||||
"kotlin" => CodeKotlinAstV1Chunker.chunker_version(),
|
||||
// p10-2 Tier 2:
|
||||
"yaml" => K8sManifestResourceV1Chunker.chunker_version(),
|
||||
"dockerfile" => DockerfileFileV1Chunker.chunker_version(),
|
||||
"toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
=> ManifestFileV1Chunker.chunker_version(),
|
||||
// p10-3:
|
||||
"shell" => CodeTextParagraphV1Chunker.chunker_version(),
|
||||
// p10-1D: C + C++ AST chunkers.
|
||||
"c" => CodeCAstV1Chunker.chunker_version(),
|
||||
"cpp" => CodeCppAstV1Chunker.chunker_version(),
|
||||
other => anyhow::bail!("unreachable chunker_version: {other}"),
|
||||
};
|
||||
|
||||
// p10-3 fix: if this lang can fall back to Tier 3, compute the fallback
|
||||
// chunker_version so try_skip_unchanged can detect the stored-as-Tier-3
|
||||
// state and skip parser/chunker equality checks.
|
||||
let tier3_fallback_cv: Option<ChunkerVersion> = match code_lang {
|
||||
"rust" | "python" | "typescript" | "javascript"
|
||||
| "go" | "java" | "kotlin"
|
||||
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
| "c" | "cpp" // p10-1D
|
||||
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
|
||||
_ => None,
|
||||
};
|
||||
|
||||
if let Some(item) = try_skip_unchanged(
|
||||
app,
|
||||
asset,
|
||||
@@ -1702,6 +1996,7 @@ fn ingest_one_code_asset(
|
||||
&chunker_version,
|
||||
embedder.map(|e| e.model_version()).as_ref(),
|
||||
force_reingest,
|
||||
tier3_fallback_cv.as_ref(),
|
||||
)? {
|
||||
return Ok(item);
|
||||
}
|
||||
@@ -1717,37 +2012,159 @@ fn ingest_one_code_asset(
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: extractor per-lang.
|
||||
let mut canonical = match code_lang {
|
||||
// p10-3: capture Result so Tier 1 extractor errors can fall back to Tier 3.
|
||||
let canonical_result: anyhow::Result<kebab_core::CanonicalDocument> = match code_lang {
|
||||
"rust" => RustAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
|
||||
.context("kb-parse-code::RustAstExtractor::extract (code:rust)"),
|
||||
"python" => PythonAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
|
||||
.context("kb-parse-code::PythonAstExtractor::extract (code:python)"),
|
||||
"typescript" => TypescriptAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
|
||||
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)"),
|
||||
"javascript" => JavascriptAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
|
||||
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)"),
|
||||
"go" => GoAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::GoAstExtractor::extract (code:go)"),
|
||||
"java" => JavaAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::JavaAstExtractor::extract (code:java)"),
|
||||
"kotlin" => KotlinAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::KotlinAstExtractor::extract (code:kotlin)"),
|
||||
// p10-2 Tier 2: no extractor — synthesize Document directly from raw bytes.
|
||||
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" => {
|
||||
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)
|
||||
}
|
||||
// p10-3: shell reuses the same synthesizer.
|
||||
"shell" => synthesize_tier2_document(asset, &bytes, "shell", &parser_version),
|
||||
// p10-1D: C + C++ AST extractors.
|
||||
"c" => CAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kebab-parse-code::CAstExtractor::extract (code:c)"),
|
||||
"cpp" => CppAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kebab-parse-code::CppAstExtractor::extract (code:cpp)"),
|
||||
other => anyhow::bail!("unreachable (extract): {other}"),
|
||||
};
|
||||
|
||||
// p10-3: Tier 1 extractor failure → fall back to Tier 3 synthesized doc.
|
||||
// Tier 2 (yaml/dockerfile/…) and shell errors are real (e.g. non-UTF-8) — propagate.
|
||||
let mut canonical = match canonical_result {
|
||||
Ok(d) => d,
|
||||
Err(e) if code_lang == "shell"
|
||||
|| matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod") =>
|
||||
{
|
||||
return Err(e).context("synthesize_tier2_document failed for tier 2/3 lang");
|
||||
}
|
||||
Err(e) => {
|
||||
// Tier 1 extractor errored — fall back to Tier 3 synthesized doc.
|
||||
tracing::warn!(
|
||||
workspace_path = %asset.workspace_path.0,
|
||||
code_lang = code_lang,
|
||||
error = %e,
|
||||
"tier1 extract errored; falling back to tier 3 synthesized doc"
|
||||
);
|
||||
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
|
||||
let tier3_parser_version = ParserVersion("none-v1".to_string());
|
||||
synthesize_tier2_document(asset, &bytes, code_lang, &tier3_parser_version)
|
||||
.context("synthesize_tier2_document for tier 3 fallback after extract error")?
|
||||
}
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: chunker per-lang.
|
||||
let chunks = match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker
|
||||
// p10-3: track whether the extract stage already fell back to Tier 3.
|
||||
// Tier 2 langs already have "none-v1" parser_version normally, so exclude them
|
||||
// from the extract_fell_back guard with the !matches! exclusion.
|
||||
let extract_fell_back = canonical.parser_version.0 == "none-v1"
|
||||
&& !matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" | "shell");
|
||||
|
||||
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
|
||||
// Tier 1 lang whose extractor errored — go straight to Tier 3 chunker.
|
||||
CodeTextParagraphV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
|
||||
"python" => CodePythonAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
|
||||
"typescript" => CodeTsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
|
||||
"javascript" => CodeJsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
|
||||
other => anyhow::bail!("unreachable (chunk): {other}"),
|
||||
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 after extract fallback)")
|
||||
} else {
|
||||
match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)"),
|
||||
"python" => CodePythonAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)"),
|
||||
"typescript" => CodeTsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)"),
|
||||
"javascript" => CodeJsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)"),
|
||||
"go" => CodeGoAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)"),
|
||||
"java" => CodeJavaAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeJavaAstV1Chunker::chunk (code:java)"),
|
||||
"kotlin" => CodeKotlinAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeKotlinAstV1Chunker::chunk (code:kotlin)"),
|
||||
// p10-2 Tier 2:
|
||||
"yaml" => K8sManifestResourceV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::K8sManifestResourceV1Chunker::chunk"),
|
||||
"dockerfile" => DockerfileFileV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::DockerfileFileV1Chunker::chunk"),
|
||||
"toml" | "json" | "xml" | "groovy" | "go-mod" => ManifestFileV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::ManifestFileV1Chunker::chunk"),
|
||||
// p10-3:
|
||||
"shell" => CodeTextParagraphV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)"),
|
||||
// p10-1D: C + C++ AST chunkers.
|
||||
"c" => CodeCAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kebab-chunk::CodeCAstV1Chunker::chunk (code:c)"),
|
||||
"cpp" => CodeCppAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kebab-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
|
||||
other => anyhow::bail!("unreachable (chunk): {other}"),
|
||||
}
|
||||
};
|
||||
|
||||
// p10-3: Tier 1/2 0-chunk OR error → Tier 3 fallback retry.
|
||||
// "shell" direct path is already Tier 3 — don't retry-double-up.
|
||||
let chunks: Vec<Chunk> = match chunks_result {
|
||||
Ok(v) if !v.is_empty() => v,
|
||||
other if code_lang == "shell" => other?, // shell propagates directly
|
||||
Ok(_empty) => {
|
||||
tracing::warn!(
|
||||
workspace_path = %asset.workspace_path.0,
|
||||
code_lang = code_lang,
|
||||
"tier1/2 emitted 0 chunks; falling back to tier 3 (code-text-paragraph-v1)"
|
||||
);
|
||||
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
|
||||
canonical.parser_version = ParserVersion("none-v1".to_string());
|
||||
CodeTextParagraphV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback)")?
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
workspace_path = %asset.workspace_path.0,
|
||||
code_lang = code_lang,
|
||||
error = %e,
|
||||
"tier1/2 chunker errored; falling back to tier 3 (code-text-paragraph-v1)"
|
||||
);
|
||||
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
|
||||
canonical.parser_version = ParserVersion("none-v1".to_string());
|
||||
CodeTextParagraphV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback after error)")?
|
||||
}
|
||||
};
|
||||
|
||||
// Stamp chunker + embedding versions so incremental skip detection has
|
||||
@@ -1842,6 +2259,139 @@ fn ingest_one_code_asset(
|
||||
})
|
||||
}
|
||||
|
||||
/// p10-2: Build a minimal [`CanonicalDocument`] for Tier 2 code assets
|
||||
/// (yaml / dockerfile / toml / json / xml / groovy / go-mod) that have
|
||||
/// no AST extractor. Produces a single `Block::Code` whose source span
|
||||
/// covers the entire file, mirroring the shape the Tier 1 extractors
|
||||
/// produce for glue / top-level regions.
|
||||
fn synthesize_tier2_document(
|
||||
asset: &RawAsset,
|
||||
bytes: &[u8],
|
||||
code_lang: &str,
|
||||
parser_version: &ParserVersion,
|
||||
) -> anyhow::Result<kebab_core::CanonicalDocument> {
|
||||
use anyhow::Context as _;
|
||||
use kebab_core::{
|
||||
BlockId, CodeBlock, CommonBlock, Lang, Metadata, Provenance, ProvenanceEvent,
|
||||
ProvenanceKind, SourceSpan, SourceType, TrustLevel, id_for_block, id_for_doc,
|
||||
};
|
||||
|
||||
let text = std::str::from_utf8(bytes)
|
||||
.with_context(|| format!("tier2 doc not utf-8: {}", asset.workspace_path.0))?
|
||||
.to_string();
|
||||
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, parser_version);
|
||||
|
||||
let n_lines = text.lines().count().max(1) as u32;
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 1,
|
||||
line_end: n_lines,
|
||||
symbol: Some("<file>".to_string()),
|
||||
lang: Some(code_lang.to_string()),
|
||||
};
|
||||
let block_id: BlockId = id_for_block(
|
||||
&doc_id,
|
||||
"code",
|
||||
&[],
|
||||
0,
|
||||
&span,
|
||||
);
|
||||
let block = kebab_core::Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some(code_lang.to_string()),
|
||||
code: text,
|
||||
});
|
||||
|
||||
let now = time::OffsetDateTime::now_utc();
|
||||
let events = vec![
|
||||
ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
},
|
||||
ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-app".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; tier2_synthesized; lang={}",
|
||||
parser_version.0, code_lang
|
||||
)),
|
||||
},
|
||||
];
|
||||
|
||||
// Resolve absolute path for repo detection. FsSourceConnector always
|
||||
// emits absolute paths in SourceUri::File (verified in connector.rs); Kb
|
||||
// URIs were rejected earlier in ingest_one_code_asset (returns Skipped),
|
||||
// so the fallback below is purely defensive. This does NOT mirror
|
||||
// RustAstExtractor — that extractor joins ctx.workspace_root for relative
|
||||
// paths, but Tier 2 trusts the connector invariant.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => p.clone(),
|
||||
kebab_core::SourceUri::Kb(_) => std::path::PathBuf::new(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match kebab_parse_code::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let title = {
|
||||
let fname = asset.workspace_path.0
|
||||
.rsplit('/')
|
||||
.next()
|
||||
.unwrap_or(&asset.workspace_path.0);
|
||||
// strip extension
|
||||
match fname.rfind('.') {
|
||||
Some(i) => fname[..i].to_string(),
|
||||
None => fname.to_string(),
|
||||
}
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: serde_json::Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some(code_lang.to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
"synthesized tier2 doc_id={} workspace_path={} lang={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
code_lang,
|
||||
);
|
||||
|
||||
Ok(kebab_core::CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks: vec![block],
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version: parser_version.clone(),
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Pull the BCP-47 language hint from the canonical document. P6-1
|
||||
/// stamps `Lang("und")` by default; image-pipeline OCR / caption
|
||||
/// adapters special-case "und" so the hint is intentionally dropped
|
||||
|
||||
@@ -9,13 +9,19 @@
|
||||
//!
|
||||
//! `--vector-only` additionally truncates `embedding_records` in SQLite
|
||||
//! so the next `kebab ingest` re-embeds cleanly without orphan rows.
|
||||
//!
|
||||
//! `--orphans-only` purges stored docs that are outside the current walker
|
||||
//! scope (config narrowing / removed sub-directory). No filesystem paths are
|
||||
//! removed — this is purely a store-level reconciliation.
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::path::PathBuf;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use kebab_config::{Config, expand_path};
|
||||
use kebab_core::WorkspacePath;
|
||||
|
||||
/// What the user asked to remove. Mutually exclusive — picked by the CLI
|
||||
/// from a clap `ArgGroup`.
|
||||
@@ -32,6 +38,13 @@ pub enum ResetScope {
|
||||
VectorOnly,
|
||||
/// Wipe only the config dir.
|
||||
ConfigOnly,
|
||||
/// Purge stored docs that are outside the current walker scope (no
|
||||
/// filesystem paths are removed). Filesystem existence is NOT checked —
|
||||
/// anything the current walker would not visit is considered an orphan.
|
||||
/// The explicit complement to the conservative `sweep_deleted_files`
|
||||
/// that runs during ingest (which leaves on-disk-but-out-of-scope docs
|
||||
/// alone for data safety).
|
||||
OrphansOnly,
|
||||
}
|
||||
|
||||
/// Result of a successful wipe — emitted as `reset_report.v1` by the
|
||||
@@ -41,6 +54,16 @@ pub struct ResetReport {
|
||||
pub scope: ResetScope,
|
||||
pub removed_paths: Vec<PathBuf>,
|
||||
pub embedding_rows_truncated: u64,
|
||||
/// Number of stored docs purged because they are outside the current
|
||||
/// walker scope. Non-zero only when `scope == OrphansOnly`.
|
||||
/// `#[serde(default)]` preserves back-compat with older callers that
|
||||
/// do not include this field.
|
||||
#[serde(default)]
|
||||
pub orphans_purged: u32,
|
||||
/// Paths of the orphaned docs that were purged. Sorted for deterministic
|
||||
/// output. Non-empty only when `scope == OrphansOnly`.
|
||||
#[serde(default)]
|
||||
pub purged_paths: Vec<WorkspacePath>,
|
||||
}
|
||||
|
||||
/// Compute the absolute on-disk paths a given scope will wipe, given a
|
||||
@@ -67,6 +90,10 @@ pub fn enumerate_paths(scope: ResetScope, cfg: &Config) -> Vec<PathBuf> {
|
||||
vec![vector_dir]
|
||||
}
|
||||
ResetScope::ConfigOnly => vec![cfg_dir],
|
||||
// OrphansOnly operates purely at the store level — no filesystem paths
|
||||
// are removed. Return empty so `estimate_size_bytes` stays zero and
|
||||
// the existing confirm UI path for directory wipes is skipped.
|
||||
ResetScope::OrphansOnly => vec![],
|
||||
}
|
||||
}
|
||||
|
||||
@@ -96,16 +123,82 @@ pub fn estimate_size_bytes(paths: &[PathBuf]) -> u64 {
|
||||
paths.iter().map(|p| walk(p)).sum()
|
||||
}
|
||||
|
||||
/// Compute the workspace paths stored in SQLite that are NOT visited by
|
||||
/// the current walker scope (i.e. they are "orphans" — on disk but
|
||||
/// outside the configured include/exclude rules, or from a sub-directory
|
||||
/// that has since been removed from the workspace).
|
||||
///
|
||||
/// Does NOT check filesystem existence — `OrphansOnly` is the explicit
|
||||
/// "I know what I'm doing" variant; callers that want the conservative
|
||||
/// fs-aware sweep should use `sweep_deleted_files` inside ingest.
|
||||
///
|
||||
/// Returns the list sorted for deterministic output. Called twice by the
|
||||
/// CLI path (once for the confirm UI preview, once inside `execute`);
|
||||
/// the double scan is acceptable for a rare destructive operation.
|
||||
pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
|
||||
use kebab_core::DocumentStore as _;
|
||||
use kebab_source_fs::FsSourceConnector;
|
||||
use kebab_core::SourceScope;
|
||||
|
||||
let store = kebab_store_sqlite::SqliteStore::open(cfg)
|
||||
.context("enumerate_orphans: open SqliteStore")?;
|
||||
|
||||
let stored = store
|
||||
.all_workspace_paths()
|
||||
.context("enumerate_orphans: all_workspace_paths")?;
|
||||
|
||||
if stored.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
// Build the same SourceScope the CLI's ingest path uses: root from
|
||||
// config, exclude list from config, no include override (full scope).
|
||||
let root = cfg.resolve_workspace_root();
|
||||
let scope = SourceScope {
|
||||
root: root.clone(),
|
||||
exclude: cfg.workspace.exclude.clone(),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let connector = FsSourceConnector::new(cfg)
|
||||
.context("enumerate_orphans: build FsSourceConnector")?;
|
||||
let (assets, _skips) = connector
|
||||
.scan_with_skips(&scope)
|
||||
.context("enumerate_orphans: scan workspace")?;
|
||||
|
||||
let scanned: HashSet<WorkspacePath> = assets
|
||||
.into_iter()
|
||||
.map(|a| a.workspace_path)
|
||||
.collect();
|
||||
|
||||
let mut orphans: Vec<WorkspacePath> = stored
|
||||
.into_iter()
|
||||
.filter(|p| !scanned.contains(p))
|
||||
.collect();
|
||||
orphans.sort_by(|a, b| a.0.cmp(&b.0));
|
||||
Ok(orphans)
|
||||
}
|
||||
|
||||
/// Wipe every path from `enumerate_paths(scope, cfg)`. For
|
||||
/// `ResetScope::VectorOnly`, also truncates the SQLite
|
||||
/// `embedding_records` table so the store doesn't point at the Lance
|
||||
/// rows we just removed off-disk.
|
||||
///
|
||||
/// For `ResetScope::OrphansOnly`, no filesystem directories are removed.
|
||||
/// Instead the store is reconciled: stored docs outside the current walker
|
||||
/// scope are purged from SQLite (+ vector store when configured). The
|
||||
/// caller is expected to have already shown the confirm UI using
|
||||
/// `enumerate_orphans`.
|
||||
///
|
||||
/// Idempotent: a missing path is treated as already-removed (success).
|
||||
/// Returns a `ResetReport` listing exactly what was removed (paths that
|
||||
/// existed before the call) so `--json` callers see the truth, not the
|
||||
/// request.
|
||||
pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
|
||||
if matches!(scope, ResetScope::OrphansOnly) {
|
||||
return execute_orphans_only(cfg);
|
||||
}
|
||||
|
||||
let paths = enumerate_paths(scope, cfg);
|
||||
let mut removed = Vec::new();
|
||||
|
||||
@@ -128,9 +221,100 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
|
||||
scope,
|
||||
removed_paths: removed,
|
||||
embedding_rows_truncated,
|
||||
orphans_purged: 0,
|
||||
purged_paths: Vec::new(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Execute the `OrphansOnly` variant: reconcile stored docs against the
|
||||
/// current walker scope without touching any filesystem directory.
|
||||
fn execute_orphans_only(cfg: &Config) -> Result<ResetReport> {
|
||||
let orphans = enumerate_orphans(cfg)
|
||||
.context("execute_orphans_only: enumerate orphans")?;
|
||||
|
||||
if orphans.is_empty() {
|
||||
return Ok(ResetReport {
|
||||
scope: ResetScope::OrphansOnly,
|
||||
removed_paths: Vec::new(),
|
||||
embedding_rows_truncated: 0,
|
||||
orphans_purged: 0,
|
||||
purged_paths: Vec::new(),
|
||||
});
|
||||
}
|
||||
|
||||
let store = std::sync::Arc::new(
|
||||
kebab_store_sqlite::SqliteStore::open(cfg)
|
||||
.context("execute_orphans_only: open SqliteStore")?,
|
||||
);
|
||||
|
||||
// Open vector store if configured. Mirror the same guard the ingest
|
||||
// path uses: only construct when the provider is not "none" / dims > 0.
|
||||
let vector_store: Option<kebab_store_vector::LanceVectorStore> =
|
||||
open_vector_store_if_configured(cfg, store.clone())?;
|
||||
|
||||
let mut purged_paths: Vec<WorkspacePath> = Vec::new();
|
||||
|
||||
for path in &orphans {
|
||||
let chunk_ids = kebab_store_sqlite::purge_deleted_workspace_path(&store, path)
|
||||
.with_context(|| format!("execute_orphans_only: purge {}", path.0))?;
|
||||
|
||||
if let Some(ref vs) = vector_store {
|
||||
if !chunk_ids.is_empty() {
|
||||
use kebab_core::VectorStore as _;
|
||||
if let Err(e) = vs.delete_by_chunk_ids(&chunk_ids) {
|
||||
tracing::warn!(
|
||||
target: "kebab-app",
|
||||
path = %path.0,
|
||||
count = chunk_ids.len(),
|
||||
error = %e,
|
||||
"reset --orphans-only: vector delete failed; SQLite side already cleaned"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::info!(
|
||||
target: "kebab-app",
|
||||
path = %path.0,
|
||||
"reset --orphans-only: purged orphan document"
|
||||
);
|
||||
purged_paths.push(path.clone());
|
||||
}
|
||||
|
||||
let orphans_purged = u32::try_from(purged_paths.len()).unwrap_or(u32::MAX);
|
||||
|
||||
Ok(ResetReport {
|
||||
scope: ResetScope::OrphansOnly,
|
||||
removed_paths: Vec::new(),
|
||||
embedding_rows_truncated: 0,
|
||||
orphans_purged,
|
||||
purged_paths,
|
||||
})
|
||||
}
|
||||
|
||||
/// Open the Lance vector store if the configured embedding provider is
|
||||
/// active (non-"none", dimensions > 0). Returns `None` for lexical-only
|
||||
/// configs. Mirrors the guard in `App::vector`.
|
||||
fn open_vector_store_if_configured(
|
||||
cfg: &Config,
|
||||
store: std::sync::Arc<kebab_store_sqlite::SqliteStore>,
|
||||
) -> Result<Option<kebab_store_vector::LanceVectorStore>> {
|
||||
if cfg.models.embedding.provider == "none" || cfg.models.embedding.dimensions == 0 {
|
||||
return Ok(None);
|
||||
}
|
||||
match kebab_store_vector::LanceVectorStore::new(cfg, store) {
|
||||
Ok(vs) => Ok(Some(vs)),
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kebab-app",
|
||||
error = %e,
|
||||
"reset --orphans-only: could not open vector store; skipping vector delete"
|
||||
);
|
||||
Ok(None)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Open the SQLite store at the configured path and run
|
||||
/// `truncate_embedding_records`. Returns the count of truncated rows
|
||||
/// (the helper itself reports `DELETE` rowcount). If the SQLite file
|
||||
@@ -200,4 +384,14 @@ mod tests {
|
||||
let bytes = estimate_size_bytes(&[dir.path().to_path_buf()]);
|
||||
assert_eq!(bytes, 5 + 6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn enumerate_orphans_only_returns_empty_paths() {
|
||||
let cfg = Config::defaults();
|
||||
let paths = enumerate_paths(ResetScope::OrphansOnly, &cfg);
|
||||
assert!(
|
||||
paths.is_empty(),
|
||||
"OrphansOnly must return empty vec from enumerate_paths"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -63,14 +63,26 @@ pub struct Stats {
|
||||
/// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold.
|
||||
#[serde(default)]
|
||||
pub stale_doc_count: u64,
|
||||
/// p10-1A-1: code language breakdown (chunk counts by canonical lowercase
|
||||
/// language identifier). Empty until 1A-2 produces code chunks.
|
||||
/// p10-1A-1: code language breakdown (**doc** counts by canonical
|
||||
/// lowercase language identifier). Empty until 1A-2 produces code
|
||||
/// docs. v0.17.0 PR-C: doc-count semantics corrected here (the
|
||||
/// previous "chunk counts" wording was a longstanding mis-label —
|
||||
/// implementation has always been `COUNT(*) FROM documents
|
||||
/// GROUP BY code_lang`). Use `code_lang_chunk_breakdown` for the
|
||||
/// chunk-level companion.
|
||||
#[serde(default)]
|
||||
pub code_lang_breakdown: std::collections::BTreeMap<String, u32>,
|
||||
/// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value).
|
||||
/// Empty until 1A-2 produces code chunks.
|
||||
/// p10-1A-1: repo breakdown (**doc** counts by `metadata.repo`
|
||||
/// value). Empty until 1A-2 produces code docs. v0.17.0 PR-C:
|
||||
/// doc-count wording corrected (mirror of code_lang_breakdown).
|
||||
#[serde(default)]
|
||||
pub repo_breakdown: std::collections::BTreeMap<String, u32>,
|
||||
/// v0.17.0 PR-C: sister of [`Self::code_lang_breakdown`] returning
|
||||
/// chunk counts instead of doc counts. Indexing-pressure metric —
|
||||
/// one PDF spec → 200 chunks vs one Rust file → 5 chunks shows up
|
||||
/// here in a way `code_lang_breakdown` (doc count) hides.
|
||||
#[serde(default)]
|
||||
pub code_lang_chunk_breakdown: std::collections::BTreeMap<String, u32>,
|
||||
}
|
||||
|
||||
const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION");
|
||||
@@ -168,7 +180,12 @@ fn collect_stats(
|
||||
stale_doc_count: counts.stale_doc_count,
|
||||
// p10-1A-2: populated by the store query added in this task.
|
||||
code_lang_breakdown: store.code_lang_breakdown()?,
|
||||
repo_breakdown: std::collections::BTreeMap::new(),
|
||||
// p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
|
||||
// placeholder — mirror of code_lang_breakdown for the repo field.
|
||||
repo_breakdown: store.repo_breakdown()?,
|
||||
// v0.17.0 PR-C: chunk-level companion (closes HOTFIXES
|
||||
// 2026-05-22 "code_lang_breakdown chunk granularity" LOW).
|
||||
code_lang_chunk_breakdown: store.code_lang_chunk_breakdown()?,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -208,6 +225,11 @@ mod tests_stats_ext {
|
||||
v.get("repo_breakdown").is_some(),
|
||||
"Stats JSON must include repo_breakdown: {v}"
|
||||
);
|
||||
// v0.17.0 PR-C: chunk-level companion field.
|
||||
assert!(
|
||||
v.get("code_lang_chunk_breakdown").is_some(),
|
||||
"Stats JSON must include code_lang_chunk_breakdown (v0.17.0 PR-C): {v}"
|
||||
);
|
||||
// Empty BTreeMap serializes as `{}` — confirm it's an object, not null.
|
||||
assert!(
|
||||
v["code_lang_breakdown"].is_object(),
|
||||
@@ -217,6 +239,10 @@ mod tests_stats_ext {
|
||||
v["repo_breakdown"].is_object(),
|
||||
"repo_breakdown must be an object: {v}"
|
||||
);
|
||||
assert!(
|
||||
v["code_lang_chunk_breakdown"].is_object(),
|
||||
"code_lang_chunk_breakdown must be an object: {v}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -390,6 +390,641 @@ fn javascript_file_ingests_and_searches_as_code_citation() {
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1c-go Task F: a `.go` file in a sub-directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="go"`,
|
||||
/// `symbol="chunk.ParseDoc"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`chunk/`) ensures the Go package-prefix wiring
|
||||
/// produces a non-empty module prefix so the fully-qualified symbol assertion
|
||||
/// exercises that path end-to-end.
|
||||
#[test]
|
||||
fn go_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let pkg_dir = env.workspace_root.join("chunk");
|
||||
std::fs::create_dir_all(&pkg_dir).unwrap();
|
||||
std::fs::write(
|
||||
pkg_dir.join("ast.go"),
|
||||
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0);
|
||||
assert!(report.new >= 1);
|
||||
|
||||
let go_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("ast.go"))
|
||||
.expect("ast.go item present");
|
||||
assert_eq!(
|
||||
go_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-go-v1"),
|
||||
"parser_version must be code-go-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
go_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-go-ast-v1"),
|
||||
"chunker_version must be code-go-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("ParseDoc"))
|
||||
.expect("search must succeed");
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
|
||||
.expect("Citation::Code hit");
|
||||
match &h.citation {
|
||||
kebab_core::Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("go"), "citation.lang must be 'go'");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("chunk.ParseDoc"),
|
||||
"citation.symbol must be 'chunk.ParseDoc'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("go"),
|
||||
"SearchHit.code_lang must be 'go'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1c-jk Task F: a `.java` file in a package directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="java"`,
|
||||
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`com/foo/`) ensures the Java package-prefix wiring
|
||||
/// produces a non-empty module prefix so the fully-qualified symbol assertion
|
||||
/// exercises that path end-to-end.
|
||||
#[test]
|
||||
fn java_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let pkg_dir = env.workspace_root.join("com").join("foo");
|
||||
std::fs::create_dir_all(&pkg_dir).unwrap();
|
||||
std::fs::write(
|
||||
pkg_dir.join("Foo.java"),
|
||||
"package com.foo;\n\npublic class Foo {\n public String bar() { return \"x\"; }\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0);
|
||||
assert!(report.new >= 1);
|
||||
|
||||
let java_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Foo.java"))
|
||||
.expect("Foo.java item present");
|
||||
assert_eq!(
|
||||
java_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-java-v1"),
|
||||
"parser_version must be code-java-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
java_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-java-ast-v1"),
|
||||
"chunker_version must be code-java-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
|
||||
.expect("search must succeed");
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
|
||||
.expect("Citation::Code hit");
|
||||
match &h.citation {
|
||||
kebab_core::Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("java"), "citation.lang must be 'java'");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("com.foo.Foo.bar"),
|
||||
"citation.symbol must be 'com.foo.Foo.bar'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("java"),
|
||||
"SearchHit.code_lang must be 'java'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1c-jk Task I: a `.kt` file in a package directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="kotlin"`,
|
||||
/// `symbol="com.foo.Foo.bar"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`com/foo/`) ensures the Kotlin package-prefix wiring
|
||||
/// produces a non-empty module prefix so the fully-qualified symbol assertion
|
||||
/// exercises that path end-to-end.
|
||||
#[test]
|
||||
fn kotlin_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let pkg_dir = env.workspace_root.join("com").join("foo");
|
||||
std::fs::create_dir_all(&pkg_dir).unwrap();
|
||||
std::fs::write(
|
||||
pkg_dir.join("Foo.kt"),
|
||||
"package com.foo\n\nclass Foo {\n fun bar(): String = \"x\"\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0);
|
||||
assert!(report.new >= 1);
|
||||
|
||||
let kt_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Foo.kt"))
|
||||
.expect("Foo.kt item present");
|
||||
assert_eq!(
|
||||
kt_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-kotlin-v1"),
|
||||
"parser_version must be code-kotlin-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
kt_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-kotlin-ast-v1"),
|
||||
"chunker_version must be code-kotlin-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
|
||||
.expect("search must succeed");
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, kebab_core::Citation::Code { .. }))
|
||||
.expect("Citation::Code hit");
|
||||
match &h.citation {
|
||||
kebab_core::Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("kotlin"), "citation.lang must be 'kotlin'");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("com.foo.Foo.bar"),
|
||||
"citation.symbol must be 'com.foo.Foo.bar'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("kotlin"),
|
||||
"SearchHit.code_lang must be 'kotlin'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-2 Task H: a `k8s/deploy.yaml` file with a Deployment resource is
|
||||
/// ingested and the resulting `Citation::Code` hit must carry
|
||||
/// `lang="yaml"`, `symbol="Deployment/prod/api"`, and `line_start >= 1`.
|
||||
/// Exercises the k8s-manifest-resource-v1 chunker end-to-end.
|
||||
#[test]
|
||||
fn tier2_k8s_yaml_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let k8s_dir = env.workspace_root.join("k8s");
|
||||
std::fs::create_dir_all(&k8s_dir).unwrap();
|
||||
std::fs::write(
|
||||
k8s_dir.join("deploy.yaml"),
|
||||
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 1\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "yaml file ingested: {report:?}");
|
||||
|
||||
let yaml_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("deploy.yaml"))
|
||||
.expect("deploy.yaml item present");
|
||||
assert_eq!(
|
||||
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("none-v1"),
|
||||
"parser_version must be none-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("k8s-manifest-resource-v1"),
|
||||
"chunker_version must be k8s-manifest-resource-v1"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "api".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["yaml".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'api'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("yaml"), "citation.lang must be 'yaml'");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("Deployment/prod/api"),
|
||||
"citation.symbol must be 'Deployment/prod/api'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("yaml"),
|
||||
"SearchHit.code_lang must be 'yaml'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-2 Task H: a `Dockerfile` is ingested and the resulting
|
||||
/// `Citation::Code` hit must carry `lang="dockerfile"`,
|
||||
/// `symbol="<dockerfile>"`, and `line_start >= 1`.
|
||||
/// Exercises the dockerfile-file-v1 chunker end-to-end.
|
||||
#[test]
|
||||
fn tier2_dockerfile_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("Dockerfile"),
|
||||
"FROM rust:1.94\nRUN cargo install foo\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "Dockerfile ingested: {report:?}");
|
||||
|
||||
let df_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Dockerfile"))
|
||||
.expect("Dockerfile item present");
|
||||
assert_eq!(
|
||||
df_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("none-v1"),
|
||||
"parser_version must be none-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
df_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("dockerfile-file-v1"),
|
||||
"chunker_version must be dockerfile-file-v1"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "cargo".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["dockerfile".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'cargo'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("dockerfile"),
|
||||
"citation.lang must be 'dockerfile'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<dockerfile>"),
|
||||
"citation.symbol must be '<dockerfile>'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("dockerfile"),
|
||||
"SearchHit.code_lang must be 'dockerfile'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-2 Task H: a `Cargo.toml` manifest is ingested and the resulting
|
||||
/// `Citation::Code` hit must carry `lang="toml"`, `symbol="<manifest>"`,
|
||||
/// and `line_start >= 1`.
|
||||
/// Exercises the manifest-file-v1 chunker end-to-end.
|
||||
#[test]
|
||||
fn tier2_cargo_toml_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("Cargo.toml"),
|
||||
"[package]\nname = \"demo\"\nversion = \"0.1.0\"\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "Cargo.toml ingested: {report:?}");
|
||||
|
||||
let toml_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Cargo.toml"))
|
||||
.expect("Cargo.toml item present");
|
||||
assert_eq!(
|
||||
toml_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("none-v1"),
|
||||
"parser_version must be none-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
toml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("manifest-file-v1"),
|
||||
"chunker_version must be manifest-file-v1"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "demo".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["toml".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'demo'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("toml"),
|
||||
"citation.lang must be 'toml'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<manifest>"),
|
||||
"citation.symbol must be '<manifest>'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("toml"),
|
||||
"SearchHit.code_lang must be 'toml'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-3 Task E: a `.sh` file is ingested via the shell direct-Tier-3 path
|
||||
/// and the resulting `Citation::Code` hit must carry `lang="shell"`,
|
||||
/// `symbol=None`, `line_start >= 1`, and
|
||||
/// `chunker_version = "code-text-paragraph-v1"`.
|
||||
#[test]
|
||||
fn tier3_shell_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("deploy.sh"),
|
||||
"#!/usr/bin/env bash\nset -e\necho hello\n\nkebab ingest --json\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "shell file ingested: {report:?}");
|
||||
|
||||
let sh_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
|
||||
.expect("deploy.sh item present");
|
||||
assert_eq!(
|
||||
sh_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("none-v1"),
|
||||
"parser_version must be none-v1 for shell (Tier 3 direct)"
|
||||
);
|
||||
assert_eq!(
|
||||
sh_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-text-paragraph-v1"),
|
||||
"chunker_version must be code-text-paragraph-v1 for shell"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "kebab".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["shell".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'kebab'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("shell"),
|
||||
"citation.lang must be 'shell'"
|
||||
);
|
||||
assert_eq!(*symbol, None, "Tier 3 symbol must be None");
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("shell"),
|
||||
"SearchHit.code_lang must be 'shell'"
|
||||
);
|
||||
assert_eq!(
|
||||
h.chunker_version.0.as_str(),
|
||||
"code-text-paragraph-v1",
|
||||
"shell chunks must be stamped with the Tier 3 chunker_version"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-3 Task E: a docker-compose-shaped YAML file (no `apiVersion`/`kind`)
|
||||
/// is ingested; the k8s chunker returns `Ok(vec![])` and the Tier 3 fallback
|
||||
/// wrapper retries with `CodeTextParagraphV1Chunker`. The resulting
|
||||
/// `Citation::Code` hit must carry `lang="yaml"`, `symbol=None`,
|
||||
/// `line_start >= 1`, and `chunker_version = "code-text-paragraph-v1"`.
|
||||
#[test]
|
||||
fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// docker-compose-shaped YAML — version + services but no apiVersion/kind.
|
||||
// The k8s chunker returns Ok(vec![]); Tier 3 fallback should pick this up.
|
||||
std::fs::write(
|
||||
env.workspace_root.join("docker-compose.yml"),
|
||||
"version: '3'\nservices:\n api:\n image: nginx:latest\n ports:\n - 8080:80\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(
|
||||
report.new >= 1,
|
||||
"expected non-k8s yaml ingested via Tier 3, got {} new docs",
|
||||
report.new
|
||||
);
|
||||
|
||||
let yaml_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
|
||||
.expect("docker-compose.yml item present");
|
||||
assert_eq!(
|
||||
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("none-v1"),
|
||||
"parser_version must be none-v1 after Tier 3 fallback"
|
||||
);
|
||||
assert_eq!(
|
||||
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-text-paragraph-v1"),
|
||||
"chunker_version must be code-text-paragraph-v1 after Tier 3 fallback"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "nginx".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["yaml".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'nginx'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("yaml"),
|
||||
"citation.lang must be 'yaml'"
|
||||
);
|
||||
assert_eq!(*symbol, None, "Tier 3 fallback symbol must be None");
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("yaml"),
|
||||
"SearchHit.code_lang must be 'yaml'"
|
||||
);
|
||||
assert_eq!(
|
||||
h.chunker_version.0.as_str(),
|
||||
"code-text-paragraph-v1",
|
||||
"non-k8s yaml fallback must be stamped code-text-paragraph-v1"
|
||||
);
|
||||
}
|
||||
|
||||
/// Re-ingesting the same `.rs` file without changes must report
|
||||
/// `Unchanged` (incremental-skip path exercised).
|
||||
#[test]
|
||||
@@ -429,3 +1064,328 @@ fn rust_file_re_ingest_is_unchanged() {
|
||||
);
|
||||
assert_eq!(item2.doc_id, item1.doc_id);
|
||||
}
|
||||
|
||||
/// p10-3 fix regression: a docker-compose YAML that falls back to Tier 3
|
||||
/// (k8s chunker returns empty, CodeTextParagraphV1Chunker retries) must
|
||||
/// report Unchanged on the second ingest rather than re-processing.
|
||||
/// Before the fix, try_skip_unchanged returned None because the stored
|
||||
/// last_chunker_version ("code-text-paragraph-v1" / parser_version
|
||||
/// "none-v1") never matched the caller's dispatch values.
|
||||
#[test]
|
||||
fn tier3_yaml_fallback_reingest_is_unchanged() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("docker-compose.yml"),
|
||||
"version: '3'\nservices:\n api:\n image: nginx:latest\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report1 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("first ingest");
|
||||
let item1 = report1
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
|
||||
.expect("docker-compose.yml in first report");
|
||||
assert!(
|
||||
matches!(item1.kind, IngestItemKind::New),
|
||||
"first ingest must be New, got {:?}", item1.kind
|
||||
);
|
||||
assert_eq!(
|
||||
item1.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-text-paragraph-v1"),
|
||||
"first ingest must use Tier 3 fallback chunker"
|
||||
);
|
||||
|
||||
let report2 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("second ingest");
|
||||
let item2 = report2
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
|
||||
.expect("docker-compose.yml in second report");
|
||||
assert!(
|
||||
matches!(item2.kind, IngestItemKind::Unchanged),
|
||||
"second ingest must be Unchanged, got {:?}", item2.kind
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1d Task G: a `.c` file with a single top-level function is ingested
|
||||
/// and the resulting `Citation::Code` hit must carry `lang="c"`,
|
||||
/// `symbol="parse_record"` (function name only — no nesting in C), and
|
||||
/// `chunker_version = "code-c-ast-v1"`.
|
||||
#[test]
|
||||
fn tier1_c_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("parser.c"),
|
||||
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "c file ingested: {report:?}");
|
||||
|
||||
let c_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("parser.c"))
|
||||
.expect("parser.c item present");
|
||||
assert_eq!(
|
||||
c_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-c-v2"),
|
||||
"parser_version must be code-c-v2 (v0.17.0 PR-B: typedef-wrapped struct/enum/union 이 typedef alias unit 으로 방출)"
|
||||
);
|
||||
assert_eq!(
|
||||
c_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-c-ast-v1"),
|
||||
"chunker_version must be code-c-ast-v1"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "parse_record".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["c".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'parse_record'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("c"), "citation.lang must be 'c'");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("parse_record"),
|
||||
"C symbol must be function name only (no nesting)"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("c"),
|
||||
"SearchHit.code_lang must be 'c'"
|
||||
);
|
||||
assert_eq!(
|
||||
h.chunker_version.0.as_str(),
|
||||
"code-c-ast-v1",
|
||||
"C chunks must be stamped with code-c-ast-v1"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1d Task G: a `.cpp` file with nested namespace + class is ingested
|
||||
/// and the resulting `Citation::Code` hit must carry `lang="cpp"`, a
|
||||
/// `symbol` that starts with `"kebab::chunk::Foo"` (namespace::Class or
|
||||
/// namespace::Class::method), and `chunker_version = "code-cpp-ast-v1"`.
|
||||
#[test]
|
||||
fn tier1_cpp_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("chunker.cpp"),
|
||||
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
assert!(report.new >= 1, "cpp file ingested: {report:?}");
|
||||
|
||||
let cpp_item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("chunker.cpp"))
|
||||
.expect("chunker.cpp item present");
|
||||
assert_eq!(
|
||||
cpp_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-cpp-v1"),
|
||||
"parser_version must be code-cpp-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
cpp_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-cpp-ast-v1"),
|
||||
"chunker_version must be code-cpp-ast-v1"
|
||||
);
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "bar".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["cpp".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'bar'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(lang.as_deref(), Some("cpp"), "citation.lang must be 'cpp'");
|
||||
// Symbol could be "kebab::chunk::Foo" (class) or "kebab::chunk::Foo::bar"
|
||||
// (method) depending on which chunk ranks first.
|
||||
assert!(
|
||||
symbol.as_deref().is_some_and(|s| s.starts_with("kebab::chunk::Foo")),
|
||||
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("cpp"),
|
||||
"SearchHit.code_lang must be 'cpp'"
|
||||
);
|
||||
assert_eq!(
|
||||
h.chunker_version.0.as_str(),
|
||||
"code-cpp-ast-v1",
|
||||
"C++ chunks must be stamped with code-cpp-ast-v1"
|
||||
);
|
||||
}
|
||||
|
||||
/// P10 dogfood regression: a k8s YAML with 2 documents (Deployment + Service
|
||||
/// separated by `---`) must ingest without a UNIQUE constraint violation.
|
||||
/// Before the fix, push_chunks_with_oversize emitted split_key=None for each
|
||||
/// resource, giving every resource chunk the same id_hash → identical chunk_id
|
||||
/// → SQLite UNIQUE constraint failure on the second resource.
|
||||
#[test]
|
||||
fn tier2_k8s_multi_resource_yaml_ingests_without_collision() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let k8s_dir = env.workspace_root.join("k8s");
|
||||
std::fs::create_dir_all(&k8s_dir).unwrap();
|
||||
std::fs::write(
|
||||
k8s_dir.join("k8s-multi.yaml"),
|
||||
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 2\n---\napiVersion: v1\nkind: Service\nmetadata:\n name: api\n namespace: prod\nspec:\n selector:\n app: api\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
// The bug: this would land in report with an error + UNIQUE constraint message.
|
||||
let item = report
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("k8s-multi.yaml"))
|
||||
.expect("k8s-multi.yaml in report");
|
||||
assert!(
|
||||
item.error.is_none(),
|
||||
"multi-resource k8s yaml must ingest without error, got: {:?}",
|
||||
item.error
|
||||
);
|
||||
assert!(
|
||||
matches!(item.kind, IngestItemKind::New),
|
||||
"expected New, got {:?}",
|
||||
item.kind
|
||||
);
|
||||
|
||||
// Both resources must be searchable (≥2 hits: Deployment/prod/api + Service/prod/api).
|
||||
let query = kebab_core::SearchQuery {
|
||||
text: "api".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters {
|
||||
code_lang: vec!["yaml".to_string()],
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), query)
|
||||
.expect("search must succeed");
|
||||
assert!(
|
||||
hits.len() >= 2,
|
||||
"expected ≥2 hits (Deployment + Service), got {}",
|
||||
hits.len()
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-3 fix regression: a shell file (direct Tier 3, not a fallback)
|
||||
/// must also report Unchanged on re-ingest. Shell goes straight to
|
||||
/// CodeTextParagraphV1Chunker so `stored_is_tier3_fallback` is false
|
||||
/// (parser_version is "none-v1" and chunker matches the current dispatch),
|
||||
/// but the normal equality path should pass regardless.
|
||||
#[test]
|
||||
fn tier3_shell_reingest_is_unchanged() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("deploy.sh"),
|
||||
"#!/usr/bin/env bash\nset -e\necho hello\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report1 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("first ingest");
|
||||
let item1 = report1
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
|
||||
.expect("deploy.sh in first report");
|
||||
assert!(
|
||||
matches!(item1.kind, IngestItemKind::New),
|
||||
"first ingest must be New, got {:?}", item1.kind
|
||||
);
|
||||
|
||||
let report2 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("second ingest");
|
||||
let item2 = report2
|
||||
.items
|
||||
.as_ref()
|
||||
.expect("items present")
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
|
||||
.expect("deploy.sh in second report");
|
||||
assert!(
|
||||
matches!(item2.kind, IngestItemKind::Unchanged),
|
||||
"shell reingest must be Unchanged, got {:?}", item2.kind
|
||||
);
|
||||
}
|
||||
|
||||
@@ -38,12 +38,16 @@ fn fetch_chunk_returns_target_only_when_no_context() {
|
||||
#[test]
|
||||
fn fetch_chunk_with_context_returns_neighbors() {
|
||||
let env = common::TestEnv::new();
|
||||
let body = "# H1\n\nA1\n\n# H2\n\nA2\n\n# H3\n\nA3\n\n# H4\n\nA4\n\n# H5\n\nA5\n";
|
||||
// v0.17.0 trigram tokenizer: terms must be ≥3 Unicode chars to
|
||||
// match. The earlier fixture used 2-char tokens like `A1`/`A3` for
|
||||
// section bodies — those zero-hit under trigram. Use 5-char unique
|
||||
// words per section so the query can pin one chunk deterministically.
|
||||
let body = "# H1\n\napples\n\n# H2\n\nbanana\n\n# H3\n\ncherry\n\n# H4\n\ndurian\n\n# H5\n\nelder\n";
|
||||
common::ingest_md(&env, "multi.md", body);
|
||||
let app = env.app();
|
||||
|
||||
let q = kebab_core::SearchQuery {
|
||||
text: "A3".to_string(),
|
||||
text: "cherry".to_string(),
|
||||
mode: kebab_core::SearchMode::Lexical,
|
||||
k: 1,
|
||||
filters: kebab_core::SearchFilters::default(),
|
||||
|
||||
178
crates/kebab-app/tests/file_deletion_auto_purge.rs
Normal file
178
crates/kebab-app/tests/file_deletion_auto_purge.rs
Normal file
@@ -0,0 +1,178 @@
|
||||
//! Dogfood: auto-purge stored docs for filesystem-deleted files.
|
||||
//!
|
||||
//! Two tests:
|
||||
//!
|
||||
//! 1. `file_deletion_auto_purge` — ingest 2 files, delete one, re-ingest.
|
||||
//! The re-ingest must report `purged_deleted_files = 1`, the deleted
|
||||
//! file must no longer appear in `list_docs`, and lexical search for
|
||||
//! its unique content must return no hits.
|
||||
//!
|
||||
//! 2. `include_scope_narrowing_does_not_purge` — ingest 2 files under a
|
||||
//! wide glob, narrow the walker scope to only one file, re-ingest.
|
||||
//! The narrowed ingest must NOT purge the out-of-scope file because
|
||||
//! the file is still on disk (just excluded from this run). Protects
|
||||
//! users against accidental data loss via config edits.
|
||||
|
||||
mod common;
|
||||
|
||||
use common::TestEnv;
|
||||
use kebab_app::ingest_with_config_opts;
|
||||
use kebab_app::IngestOpts;
|
||||
use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
|
||||
|
||||
/// Helper: open the store via `TestEnv` and run `list_documents`.
|
||||
fn list_doc_paths(env: &TestEnv) -> Vec<String> {
|
||||
use kebab_store_sqlite::SqliteStore;
|
||||
let store = SqliteStore::open(&env.config).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
store
|
||||
.list_documents(&DocFilter::default())
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|d| d.doc_path.0)
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn file_deletion_auto_purge() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write two .rs files into the workspace.
|
||||
let a_path = env.workspace_root.join("a.rs");
|
||||
let b_path = env.workspace_root.join("b.rs");
|
||||
std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
|
||||
std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
|
||||
|
||||
// First ingest — both must be New.
|
||||
let first = ingest_with_config_opts(
|
||||
env.config.clone(),
|
||||
env.scope(),
|
||||
false,
|
||||
IngestOpts::default(),
|
||||
)
|
||||
.expect("first ingest must succeed");
|
||||
// Only count the .rs files we added (there may be fixture files too).
|
||||
let first_new = first.new;
|
||||
assert!(first_new >= 2, "expected at least 2 new docs: {first:?}");
|
||||
assert_eq!(
|
||||
first.purged_deleted_files, 0,
|
||||
"no purges on first ingest: {first:?}"
|
||||
);
|
||||
assert_eq!(first.errors, 0, "no errors on first ingest: {first:?}");
|
||||
|
||||
// Delete one file from the filesystem.
|
||||
std::fs::remove_file(&b_path).expect("remove b.rs");
|
||||
|
||||
// Second ingest — scanned count drops by 1; b.rs should be purged.
|
||||
let second = ingest_with_config_opts(
|
||||
env.config.clone(),
|
||||
env.scope(),
|
||||
false,
|
||||
IngestOpts::default(),
|
||||
)
|
||||
.expect("second ingest must succeed");
|
||||
|
||||
assert_eq!(
|
||||
second.purged_deleted_files, 1,
|
||||
"exactly 1 file should be purged: {second:?}"
|
||||
);
|
||||
assert_eq!(second.new, 0, "no new docs after deletion: {second:?}");
|
||||
assert_eq!(second.updated, 0, "no updated docs: {second:?}");
|
||||
assert_eq!(second.errors, 0, "no errors: {second:?}");
|
||||
|
||||
// b.rs must no longer appear in list_docs.
|
||||
let doc_paths = list_doc_paths(&env);
|
||||
let b_ws_path = "b.rs";
|
||||
assert!(
|
||||
!doc_paths.iter().any(|p| p == b_ws_path),
|
||||
"b.rs must be gone from list_docs; got: {doc_paths:?}"
|
||||
);
|
||||
// a.rs must still be present.
|
||||
let a_ws_path = "a.rs";
|
||||
assert!(
|
||||
doc_paths.iter().any(|p| p == a_ws_path),
|
||||
"a.rs must still be in list_docs; got: {doc_paths:?}"
|
||||
);
|
||||
|
||||
// Lexical search for b.rs's unique content returns no hits.
|
||||
let app = env.app();
|
||||
let query = SearchQuery {
|
||||
text: "bravo".to_string(),
|
||||
mode: SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: kebab_core::SearchFilters::default(),
|
||||
};
|
||||
let hits = app.search(query).expect("search must not error");
|
||||
assert!(
|
||||
hits.is_empty(),
|
||||
"search for deleted file's content must return no hits; got: {hits:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn include_scope_narrowing_does_not_purge() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write two .rs files.
|
||||
let a_path = env.workspace_root.join("a_narrow.rs");
|
||||
let b_path = env.workspace_root.join("b_narrow.rs");
|
||||
std::fs::write(&a_path, "// narrow a\nfn alpha_narrow() {}\n").unwrap();
|
||||
std::fs::write(&b_path, "// narrow b\nfn bravo_narrow() {}\n").unwrap();
|
||||
|
||||
// Wide scope: first ingest — both must be New.
|
||||
let wide_scope = SourceScope {
|
||||
root: env.workspace_root.clone(),
|
||||
include: vec!["**/*.rs".to_string()],
|
||||
exclude: env.config.workspace.exclude.clone(),
|
||||
};
|
||||
let first = ingest_with_config_opts(
|
||||
env.config.clone(),
|
||||
wide_scope,
|
||||
false,
|
||||
IngestOpts::default(),
|
||||
)
|
||||
.expect("first ingest (wide) must succeed");
|
||||
assert!(
|
||||
first.new >= 2,
|
||||
"expected at least 2 new docs: {first:?}"
|
||||
);
|
||||
assert_eq!(
|
||||
first.purged_deleted_files, 0,
|
||||
"no purges on first ingest: {first:?}"
|
||||
);
|
||||
|
||||
// Narrow scope: only a_narrow.rs in include — b_narrow.rs is still
|
||||
// on disk but excluded from the walker scope.
|
||||
let narrow_scope = SourceScope {
|
||||
root: env.workspace_root.clone(),
|
||||
include: vec!["a_narrow.rs".to_string()],
|
||||
exclude: env.config.workspace.exclude.clone(),
|
||||
};
|
||||
let second = ingest_with_config_opts(
|
||||
env.config.clone(),
|
||||
narrow_scope,
|
||||
false,
|
||||
IngestOpts::default(),
|
||||
)
|
||||
.expect("second ingest (narrow) must succeed");
|
||||
|
||||
// CRITICAL: b_narrow.rs is still on disk — must NOT be purged.
|
||||
assert_eq!(
|
||||
second.purged_deleted_files, 0,
|
||||
"scope-narrowing must NOT purge on-disk files; got: {second:?}"
|
||||
);
|
||||
assert_eq!(second.errors, 0, "no errors: {second:?}");
|
||||
|
||||
// b_narrow.rs must still exist in the store.
|
||||
let doc_paths = list_doc_paths(&env);
|
||||
let b_ws_path = "b_narrow.rs";
|
||||
assert!(
|
||||
doc_paths.iter().any(|p| p == b_ws_path),
|
||||
"b_narrow.rs must still be in list_docs after scope narrowing; got: {doc_paths:?}"
|
||||
);
|
||||
// And the file must still be on disk.
|
||||
assert!(
|
||||
b_path.exists(),
|
||||
"b_narrow.rs must still be on disk (we didn't delete it)"
|
||||
);
|
||||
}
|
||||
141
crates/kebab-app/tests/reset_orphans.rs
Normal file
141
crates/kebab-app/tests/reset_orphans.rs
Normal file
@@ -0,0 +1,141 @@
|
||||
//! Integration test for `kebab reset --orphans-only`.
|
||||
//!
|
||||
//! Verifies that stored docs outside the current walker scope are purged
|
||||
//! from the store without removing any files from the filesystem.
|
||||
//!
|
||||
//! Test outline:
|
||||
//! 1. Ingest 3 .rs files (a.rs, b.rs, c.rs) — all New.
|
||||
//! 2. Narrow the config `include` to `["a.rs"]` only; b.rs and c.rs are
|
||||
//! still on disk but outside the walker scope.
|
||||
//! 3. Run `execute(ResetScope::OrphansOnly, &cfg)` — report must show
|
||||
//! `orphans_purged == 2` and `purged_paths` contains b.rs + c.rs.
|
||||
//! 4. `list docs` must show only a.rs.
|
||||
//! 5. b.rs and c.rs must still exist on disk (no filesystem removal).
|
||||
//! 6. Second reset → `orphans_purged == 0` (idempotent).
|
||||
|
||||
mod common;
|
||||
|
||||
use common::TestEnv;
|
||||
use kebab_app::IngestOpts;
|
||||
use kebab_app::reset::{ResetScope, execute};
|
||||
use kebab_core::{DocFilter, DocumentStore, SourceScope};
|
||||
|
||||
/// Open the SqliteStore and list all `workspace_path` values.
|
||||
fn list_doc_paths(env: &TestEnv) -> Vec<String> {
|
||||
use kebab_store_sqlite::SqliteStore;
|
||||
let store = SqliteStore::open(&env.config).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
store
|
||||
.list_documents(&DocFilter::default())
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|d| d.doc_path.0)
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn reset_orphans_only_purges_out_of_scope_docs() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write three .rs files into the workspace.
|
||||
let a_path = env.workspace_root.join("a.rs");
|
||||
let b_path = env.workspace_root.join("b.rs");
|
||||
let c_path = env.workspace_root.join("c.rs");
|
||||
std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
|
||||
std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
|
||||
std::fs::write(&c_path, "// file c\nfn charlie() {}\n").unwrap();
|
||||
|
||||
// Ingest all three with a wide scope.
|
||||
let wide_scope = SourceScope {
|
||||
root: env.workspace_root.clone(),
|
||||
include: vec!["**/*.rs".to_string()],
|
||||
exclude: env.config.workspace.exclude.clone(),
|
||||
};
|
||||
let first = kebab_app::ingest_with_config_opts(
|
||||
env.config.clone(),
|
||||
wide_scope,
|
||||
false,
|
||||
IngestOpts::default(),
|
||||
)
|
||||
.expect("first ingest must succeed");
|
||||
// The fixture workspace may contain other .rs files — just assert we
|
||||
// got at least 3 new docs (our a.rs, b.rs, c.rs).
|
||||
assert!(first.new >= 3, "expected at least 3 new docs: {first:?}");
|
||||
assert_eq!(first.errors, 0, "no errors on first ingest");
|
||||
|
||||
// Narrow config to include only a.rs; b.rs + c.rs are still on disk.
|
||||
let mut narrow_cfg = env.config.clone();
|
||||
narrow_cfg.workspace.exclude.clear();
|
||||
// Re-point workspace root (already correct) and restrict include via
|
||||
// the SourceScope in the connector. The config's `workspace.root` is
|
||||
// used by `enumerate_orphans` to build its scope — we keep that
|
||||
// pointing at the workspace root. We simulate narrowing by setting a
|
||||
// glob that only matches a.rs.
|
||||
//
|
||||
// NOTE: `kebab_config::WorkspaceCfg` does not have an `include` field
|
||||
// (it was removed in p9-fb-25). We narrow the scope via the walker
|
||||
// exclude list: exclude b.rs and c.rs explicitly.
|
||||
narrow_cfg.workspace.exclude = vec!["b.rs".to_string(), "c.rs".to_string()];
|
||||
|
||||
// Run orphans-only reset.
|
||||
let report = execute(ResetScope::OrphansOnly, &narrow_cfg)
|
||||
.expect("orphans-only reset must succeed");
|
||||
|
||||
assert_eq!(
|
||||
report.orphans_purged, 2,
|
||||
"expected 2 orphans purged (b.rs + c.rs): {report:?}"
|
||||
);
|
||||
|
||||
let mut purged: Vec<String> = report
|
||||
.purged_paths
|
||||
.iter()
|
||||
.map(|p| p.0.clone())
|
||||
.collect();
|
||||
purged.sort();
|
||||
assert_eq!(
|
||||
purged,
|
||||
vec!["b.rs".to_string(), "c.rs".to_string()],
|
||||
"purged_paths must list b.rs and c.rs in sorted order: {purged:?}"
|
||||
);
|
||||
|
||||
// list docs must show only a.rs (and any pre-existing fixture files
|
||||
// that are not excluded by the narrow config).
|
||||
let doc_paths = list_doc_paths(&env);
|
||||
// The narrow_cfg excludes b.rs + c.rs — they must no longer be in store.
|
||||
assert!(
|
||||
!doc_paths.iter().any(|p| p == "b.rs"),
|
||||
"b.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
|
||||
);
|
||||
assert!(
|
||||
!doc_paths.iter().any(|p| p == "c.rs"),
|
||||
"c.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
|
||||
);
|
||||
assert!(
|
||||
doc_paths.iter().any(|p| p == "a.rs"),
|
||||
"a.rs must still be in store; got: {doc_paths:?}"
|
||||
);
|
||||
|
||||
// Both b.rs and c.rs must still exist on the filesystem — no file
|
||||
// removal is performed by orphans-only.
|
||||
assert!(
|
||||
b_path.exists(),
|
||||
"b.rs must still be on disk after orphans-only reset"
|
||||
);
|
||||
assert!(
|
||||
c_path.exists(),
|
||||
"c.rs must still be on disk after orphans-only reset"
|
||||
);
|
||||
|
||||
// Second reset must be idempotent: nothing left to purge.
|
||||
let second = execute(ResetScope::OrphansOnly, &narrow_cfg)
|
||||
.expect("second orphans-only reset must succeed");
|
||||
assert_eq!(
|
||||
second.orphans_purged, 0,
|
||||
"second reset must be idempotent (orphans_purged == 0): {second:?}"
|
||||
);
|
||||
assert!(
|
||||
second.purged_paths.is_empty(),
|
||||
"second reset purged_paths must be empty: {:?}",
|
||||
second.purged_paths
|
||||
);
|
||||
}
|
||||
@@ -46,3 +46,88 @@ fn korean_lexical_query_returns_korean_document() {
|
||||
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
|
||||
/// A4 Step 1c — multi-token Korean query (`해시 충돌`) must hit when
|
||||
/// the lexical builder routes it through a whole-phrase MATCH candidate.
|
||||
///
|
||||
/// Expected: FAIL until A5 (`build_match_string` redesign) lands — the
|
||||
/// current builder emits `"해시" "충돌"` AND, but FTS5 trigram tokenizer
|
||||
/// has no 2-char terms so each side is 0-hit. A5 introduces a whole-
|
||||
/// phrase candidate (`"해시 충돌"`) OR'd with the token AND, restoring
|
||||
/// hits for the dominant Korean usage pattern.
|
||||
#[test]
|
||||
fn lexical_multi_token_korean_query_hits() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Copy the synthetic Korean fixture (introduced in A4 Step 0) into
|
||||
// the test workspace. The fixture contains the exact phrase
|
||||
// "해시 충돌" multiple times.
|
||||
let dest = env.workspace_root.join("hash-table.md");
|
||||
let src = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("..")
|
||||
.join("..")
|
||||
.join("fixtures")
|
||||
.join("search")
|
||||
.join("korean")
|
||||
.join("hash-table.md");
|
||||
std::fs::copy(&src, &dest).expect("copy korean fixture");
|
||||
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
let hits = kebab_app::search_with_config(
|
||||
env.config.clone(),
|
||||
common::lexical_query("해시 충돌"),
|
||||
)
|
||||
.expect("search must succeed");
|
||||
|
||||
assert!(
|
||||
!hits.is_empty(),
|
||||
"multi-token Korean query '해시 충돌' must hit the hash-table fixture; got {:?}",
|
||||
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
|
||||
);
|
||||
let any_hash_table = hits.iter().any(|h| h.doc_path.0.contains("hash-table"));
|
||||
assert!(
|
||||
any_hash_table,
|
||||
"expected at least one hit on the hash-table fixture, got: {:?}",
|
||||
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
|
||||
/// A4 Step 1c — mixed Korean+English multi-token query (`Rust 충돌은`).
|
||||
/// Both tokens are ≥3 chars, so the redesigned builder (A5) emits
|
||||
/// `("Rust 충돌은") OR ("Rust" AND "충돌은")`. With trigram tokenizer
|
||||
/// each side has substring coverage in the document, so the AND branch
|
||||
/// alone is enough. Expected: FAIL pre-A5, PASS post-A5.
|
||||
#[test]
|
||||
fn lexical_mixed_korean_english_multi_token_query_hits() {
|
||||
let env = TestEnv::lexical_only();
|
||||
let doc_path = env.workspace_root.join("rust-hash.md");
|
||||
std::fs::write(
|
||||
&doc_path,
|
||||
"# Rust 해시 테이블\n\nRust 의 std::collections::HashMap 에서 \
|
||||
해시 충돌은 SipHash 로 완화한다.\n",
|
||||
)
|
||||
.expect("write rust-hash fixture");
|
||||
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
let hits = kebab_app::search_with_config(
|
||||
env.config.clone(),
|
||||
common::lexical_query("Rust 충돌은"),
|
||||
)
|
||||
.expect("search must succeed");
|
||||
|
||||
assert!(
|
||||
!hits.is_empty(),
|
||||
"mixed Korean+English multi-token query 'Rust 충돌은' must hit the rust-hash fixture; got {:?}",
|
||||
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
|
||||
);
|
||||
let any_rust_hash = hits.iter().any(|h| h.doc_path.0.contains("rust-hash"));
|
||||
assert!(
|
||||
any_rust_hash,
|
||||
"expected at least one hit on the rust-hash fixture, got: {:?}",
|
||||
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
|
||||
176
crates/kebab-app/tests/twin_files_fetch_span.rs
Normal file
176
crates/kebab-app/tests/twin_files_fetch_span.rs
Normal file
@@ -0,0 +1,176 @@
|
||||
//! Regression test for the twin-file fetch_span media-type lookup bug.
|
||||
//!
|
||||
//! Twin files (identical content at different workspace paths) share one
|
||||
//! `assets` row whose PRIMARY KEY is the blake3 content hash. The old
|
||||
//! `fetch_span` implementation called
|
||||
//! `get_asset_by_workspace_path(&doc.workspace_path)` to check whether the
|
||||
//! media type was PDF/audio (and therefore reject span fetch). For a twin
|
||||
//! file that lookup could silently return the *other* twin's asset row if
|
||||
//! `assets.workspace_path` had been overwritten on the most recent ingest of
|
||||
//! the sibling — making the media-type branch decision incorrect.
|
||||
//!
|
||||
//! Fix: `fetch_span` now uses the 2-step lookup
|
||||
//! `get_document_by_workspace_path` → `doc.source_asset_id` → `get_asset`
|
||||
//! so the result is always anchored to the requesting document, not
|
||||
//! whichever twin last updated `assets.workspace_path`.
|
||||
//!
|
||||
//! This test builds a twin-file scenario (two .md files at different paths
|
||||
//! with identical content), ingests both, then calls `fetch_span` on each
|
||||
//! twin's `doc_id` and asserts it succeeds. Before the fix, if the asset
|
||||
//! row's workspace_path happened to point at the wrong twin the span could
|
||||
//! return an incorrect `span_not_supported` for a non-PDF/audio file, or
|
||||
//! conversely allow span on a PDF twin by accident. After the fix, the
|
||||
//! lookup is always doc-specific.
|
||||
|
||||
mod common;
|
||||
|
||||
use common::TestEnv;
|
||||
use kebab_app::ingest_with_config;
|
||||
use kebab_core::{DocumentStore, FetchKind, FetchOpts, FetchQuery, IngestItemKind};
|
||||
|
||||
#[test]
|
||||
fn twin_files_fetch_span_uses_correct_asset() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write two markdown files with identical content at different paths.
|
||||
let dir_a = env.workspace_root.join("src_a");
|
||||
let dir_b = env.workspace_root.join("src_b");
|
||||
std::fs::create_dir_all(&dir_a).unwrap();
|
||||
std::fs::create_dir_all(&dir_b).unwrap();
|
||||
|
||||
// The content must produce at least 1 line so span fetch is non-trivial.
|
||||
let content = "# Twin\n\nLine one.\n\nLine two.\n\nLine three.\n";
|
||||
std::fs::write(dir_a.join("note.md"), content).unwrap();
|
||||
std::fs::write(dir_b.join("note.md"), content).unwrap();
|
||||
|
||||
// Ingest all files (fixture workspace + our two new twins).
|
||||
let report = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors; report={report:?}");
|
||||
|
||||
// Both twin paths must appear as New in the report.
|
||||
let items = report.items.as_ref().expect("items must be present");
|
||||
let twin_items: Vec<_> = items
|
||||
.iter()
|
||||
.filter(|i| {
|
||||
i.doc_path.0.ends_with("src_a/note.md")
|
||||
|| i.doc_path.0.ends_with("src_b/note.md")
|
||||
})
|
||||
.collect();
|
||||
assert_eq!(
|
||||
twin_items.len(),
|
||||
2,
|
||||
"exactly 2 twin items expected; items={items:?}"
|
||||
);
|
||||
for item in &twin_items {
|
||||
assert_eq!(
|
||||
item.kind,
|
||||
IngestItemKind::New,
|
||||
"each twin must be New; item={item:?}"
|
||||
);
|
||||
}
|
||||
|
||||
// Resolve doc_ids for both workspace paths.
|
||||
// The ingest layer normalises workspace_path to the path relative to
|
||||
// workspace_root (e.g. "src_a/note.md"), so we look up by that form.
|
||||
let store = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
|
||||
// Find the twin items by matching on suffix so the test is robust to
|
||||
// however the workspace root is represented.
|
||||
let items = report.items.as_ref().expect("items must be present");
|
||||
let path_a_str = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("src_a/note.md"))
|
||||
.map(|i| i.doc_path.0.clone())
|
||||
.expect("src_a/note.md must appear in ingest report");
|
||||
let path_b_str = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("src_b/note.md"))
|
||||
.map(|i| i.doc_path.0.clone())
|
||||
.expect("src_b/note.md must appear in ingest report");
|
||||
|
||||
let path_a = kebab_core::WorkspacePath(path_a_str);
|
||||
let path_b = kebab_core::WorkspacePath(path_b_str);
|
||||
|
||||
let doc_a = store
|
||||
.get_document_by_workspace_path(&path_a)
|
||||
.expect("get_document_by_workspace_path path_a")
|
||||
.expect("doc_a must exist after ingest");
|
||||
let doc_b = store
|
||||
.get_document_by_workspace_path(&path_b)
|
||||
.expect("get_document_by_workspace_path path_b")
|
||||
.expect("doc_b must exist after ingest");
|
||||
|
||||
// Both twins share one asset_id (same content hash).
|
||||
assert_eq!(
|
||||
doc_a.source_asset_id, doc_b.source_asset_id,
|
||||
"twin files must share one asset_id"
|
||||
);
|
||||
|
||||
// Open App and issue span fetch on each twin's doc_id.
|
||||
let app = env.app();
|
||||
|
||||
let result_a = app
|
||||
.fetch(
|
||||
FetchQuery::Span {
|
||||
doc_id: doc_a.doc_id.clone(),
|
||||
line_start: 1,
|
||||
line_end: 2,
|
||||
},
|
||||
FetchOpts::default(),
|
||||
)
|
||||
.expect("fetch_span on twin A must succeed for a markdown file");
|
||||
assert_eq!(result_a.kind, FetchKind::Span);
|
||||
assert!(
|
||||
result_a.text.as_deref().is_some_and(|t| !t.is_empty()),
|
||||
"span text for twin A must not be empty"
|
||||
);
|
||||
|
||||
let result_b = app
|
||||
.fetch(
|
||||
FetchQuery::Span {
|
||||
doc_id: doc_b.doc_id.clone(),
|
||||
line_start: 1,
|
||||
line_end: 2,
|
||||
},
|
||||
FetchOpts::default(),
|
||||
)
|
||||
.expect("fetch_span on twin B must succeed for a markdown file");
|
||||
assert_eq!(result_b.kind, FetchKind::Span);
|
||||
assert!(
|
||||
result_b.text.as_deref().is_some_and(|t| !t.is_empty()),
|
||||
"span text for twin B must not be empty"
|
||||
);
|
||||
|
||||
// Ingest again to force the asset.workspace_path flip-flop, then
|
||||
// re-check. Pre-fix this was the scenario that triggered the bug:
|
||||
// after the second ingest the asset row's workspace_path could point
|
||||
// at either twin, making one twin's span fetch behave incorrectly.
|
||||
let report2 = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("second ingest must succeed");
|
||||
assert_eq!(report2.errors, 0, "no ingest errors on second run; report={report2:?}");
|
||||
|
||||
// Re-open app after second ingest and verify span still works on both.
|
||||
let app2 = env.app();
|
||||
|
||||
app2.fetch(
|
||||
FetchQuery::Span {
|
||||
doc_id: doc_a.doc_id.clone(),
|
||||
line_start: 1,
|
||||
line_end: 3,
|
||||
},
|
||||
FetchOpts::default(),
|
||||
)
|
||||
.expect("fetch_span on twin A after flip-flop must still succeed");
|
||||
|
||||
app2.fetch(
|
||||
FetchQuery::Span {
|
||||
doc_id: doc_b.doc_id.clone(),
|
||||
line_start: 1,
|
||||
line_end: 3,
|
||||
},
|
||||
FetchOpts::default(),
|
||||
)
|
||||
.expect("fetch_span on twin B after flip-flop must still succeed");
|
||||
}
|
||||
90
crates/kebab-app/tests/twin_files_idempotent.rs
Normal file
90
crates/kebab-app/tests/twin_files_idempotent.rs
Normal file
@@ -0,0 +1,90 @@
|
||||
//! Regression test for the twin-file idempotency bug.
|
||||
//!
|
||||
//! Identical-content files at different workspace paths share one
|
||||
//! `assets` row (`asset_id` = blake3 content hash, PRIMARY KEY). The
|
||||
//! old UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path =
|
||||
//! excluded.workspace_path` made each twin overwrite the other's path
|
||||
//! on every ingest, so `get_asset_by_workspace_path(path1)` returned
|
||||
//! None (or the wrong twin) → re-process every time.
|
||||
//!
|
||||
//! Fix: `try_skip_unchanged` now uses `get_document_by_workspace_path`
|
||||
//! instead. `documents.workspace_path` is UNIQUE (V001) so each twin
|
||||
//! has its own stable document row.
|
||||
//!
|
||||
//! Assertion contract:
|
||||
//! 1st ingest → 2 New (one per twin)
|
||||
//! 2nd ingest → 0 New, 0 Updated, 2 Unchanged
|
||||
|
||||
mod common;
|
||||
|
||||
use common::TestEnv;
|
||||
use kebab_app::ingest_with_config;
|
||||
use kebab_core::IngestItemKind;
|
||||
|
||||
#[test]
|
||||
fn twin_files_second_ingest_is_unchanged() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write two files with identical content at different paths.
|
||||
let pkg_a = env.workspace_root.join("pkg_a");
|
||||
let pkg_b = env.workspace_root.join("pkg_b");
|
||||
std::fs::create_dir_all(&pkg_a).unwrap();
|
||||
std::fs::create_dir_all(&pkg_b).unwrap();
|
||||
|
||||
let content = b"# shared\nThis content is identical in both files.\n";
|
||||
std::fs::write(pkg_a.join("__init__.py"), content).unwrap();
|
||||
std::fs::write(pkg_b.join("__init__.py"), content).unwrap();
|
||||
|
||||
// First ingest — both files must be New.
|
||||
let first = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("first ingest must succeed");
|
||||
assert_eq!(first.errors, 0, "first ingest: no errors; report={first:?}");
|
||||
|
||||
let items = first.items.as_ref().expect("items must be present");
|
||||
let twin_items: Vec<_> = items
|
||||
.iter()
|
||||
.filter(|i| {
|
||||
i.doc_path.0.ends_with("__init__.py")
|
||||
})
|
||||
.collect();
|
||||
assert_eq!(
|
||||
twin_items.len(),
|
||||
2,
|
||||
"first ingest: expected exactly 2 __init__.py items; items={items:?}"
|
||||
);
|
||||
for item in &twin_items {
|
||||
assert_eq!(
|
||||
item.kind,
|
||||
IngestItemKind::New,
|
||||
"first ingest: each twin must be New; item={item:?}"
|
||||
);
|
||||
}
|
||||
|
||||
// Second ingest — same files, same content → both must be Unchanged.
|
||||
let second = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("second ingest must succeed");
|
||||
assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
|
||||
assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
|
||||
assert_eq!(
|
||||
second.updated, 0,
|
||||
"second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"
|
||||
);
|
||||
|
||||
let second_items = second.items.as_ref().expect("items must be present");
|
||||
let twin_items2: Vec<_> = second_items
|
||||
.iter()
|
||||
.filter(|i| i.doc_path.0.ends_with("__init__.py"))
|
||||
.collect();
|
||||
assert_eq!(
|
||||
twin_items2.len(),
|
||||
2,
|
||||
"second ingest: expected exactly 2 __init__.py items; items={second_items:?}"
|
||||
);
|
||||
for item in &twin_items2 {
|
||||
assert_eq!(
|
||||
item.kind,
|
||||
IngestItemKind::Unchanged,
|
||||
"second ingest: each twin must be Unchanged; item={item:?}"
|
||||
);
|
||||
}
|
||||
}
|
||||
@@ -13,14 +13,16 @@ serde_json_canonicalizer = "0.3"
|
||||
blake3 = { workspace = true }
|
||||
anyhow = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
serde_yaml = { workspace = true }
|
||||
|
||||
[dev-dependencies]
|
||||
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration
|
||||
# test to build a CanonicalDocument from a fixture Markdown file. Forbidden as
|
||||
# regular deps per design §8 (chunker consumes CanonicalDocument from kb-core
|
||||
# only); `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
|
||||
# confirms this.
|
||||
kebab-parse-md = { path = "../kebab-parse-md" }
|
||||
kebab-normalize = { path = "../kebab-normalize" }
|
||||
serde_json = { workspace = true }
|
||||
time = { workspace = true }
|
||||
# kb-parse-md / kb-normalize / kb-parse-code are dev-only — used by the
|
||||
# snapshot integration tests to build a CanonicalDocument from fixture files.
|
||||
# Forbidden as regular deps per design §8 (chunker consumes CanonicalDocument
|
||||
# from kb-core only); `cargo tree -p kb-chunk --depth 1` (default scope,
|
||||
# excludes dev-deps) confirms this.
|
||||
kebab-parse-md = { path = "../kebab-parse-md" }
|
||||
kebab-parse-code = { path = "../kebab-parse-code" }
|
||||
kebab-normalize = { path = "../kebab-normalize" }
|
||||
serde_json = { workspace = true }
|
||||
time = { workspace = true }
|
||||
|
||||
322
crates/kebab-chunk/src/code_c_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_c_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-c-ast-v1` — maps a tree-sitter-derived C AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-c-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeCAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeCAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeCAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeCAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-c-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.c".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-c-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("c".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("c".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("c".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_c_ast_v1() {
|
||||
assert_eq!(CodeCAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-c-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "int parse() {\n\t// x\n}"),
|
||||
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
|
||||
]);
|
||||
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-c-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
|
||||
let code = format!("int big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeCAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
|
||||
let base: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeCAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_cpp_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_cpp_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-cpp-ast-v1` — maps a tree-sitter-derived C++ AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-cpp-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeCppAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeCppAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeCppAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeCppAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-cpp-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.cpp".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-cpp-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("cpp".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("cpp".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("cpp".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_cpp_ast_v1() {
|
||||
assert_eq!(CodeCppAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-cpp-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "int parse() {\n\t// x\n}"),
|
||||
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
|
||||
]);
|
||||
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-cpp-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
|
||||
let code = format!("int big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeCppAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
|
||||
let base: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeCppAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_go_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_go_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-go-ast-v1` — maps a tree-sitter-derived Go AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-go-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeGoAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeGoAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeGoAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeGoAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-go-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.go".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-go-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("go".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("go".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("go".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_go_ast_v1() {
|
||||
assert_eq!(CodeGoAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-go-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "func parse() {\n\t// x\n}"),
|
||||
("Foo.double", 5, 7, "func double() int {\n\t//\n\treturn 0\n}"),
|
||||
]);
|
||||
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-go-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!("\tx{i} := {i}")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("func big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "func parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeGoAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "func parse() {}\n")]);
|
||||
let base: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeGoAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_java_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_java_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-java-ast-v1` — maps a tree-sitter-derived Java AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-java-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeJavaAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeJavaAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeJavaAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeJavaAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-java-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/Main.java".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-java-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("java".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("java".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("java".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_java_ast_v1() {
|
||||
assert_eq!(CodeJavaAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-java-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "void parse() {\n\t// x\n}"),
|
||||
("Foo.double", 5, 7, "int double() {\n\t//\n\treturn 0;\n}"),
|
||||
]);
|
||||
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-java-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!("\tint x{i} = {i};")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("void big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "void parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeJavaAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "void parse() {}\n")]);
|
||||
let base: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeJavaAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_kotlin_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_kotlin_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-kotlin-ast-v1` — maps a tree-sitter-derived Kotlin AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-kotlin-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeKotlinAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeKotlinAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeKotlinAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeKotlinAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-kotlin-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/Main.kt".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-kotlin-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("kotlin".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("kotlin".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("kotlin".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_kotlin_ast_v1() {
|
||||
assert_eq!(CodeKotlinAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-kotlin-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "fun parse() {\n\t// x\n}"),
|
||||
("Foo.double", 5, 7, "fun double(): Int {\n\t//\n\treturn 0\n}"),
|
||||
]);
|
||||
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-kotlin-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!("\tval x{i} = {i}")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("fun big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "fun parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeKotlinAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "fun parse() {}\n")]);
|
||||
let base: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeKotlinAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
170
crates/kebab-chunk/src/code_text_paragraph_v1.rs
Normal file
170
crates/kebab-chunk/src/code_text_paragraph_v1.rs
Normal file
@@ -0,0 +1,170 @@
|
||||
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
|
||||
//!
|
||||
//! Splits code/text files on blank-line paragraph boundaries. Paragraphs
|
||||
//! with more than 80 lines are further split into 80-line windows with a
|
||||
//! 20-line overlap (stride 60) — the same oversize pattern used by Tier 1/2
|
||||
//! chunkers but without AST structure, hence no symbol.
|
||||
//!
|
||||
//! Per spec §9.3: all emitted chunks carry `symbol: None`.
|
||||
|
||||
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
|
||||
use anyhow::Result;
|
||||
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
|
||||
|
||||
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
|
||||
|
||||
/// Lines-per-window for the oversize fallback (Tier 3).
|
||||
const FALLBACK_LINES_PER_CHUNK: usize = 80;
|
||||
/// Overlap between consecutive windows.
|
||||
const FALLBACK_LINES_OVERLAP: usize = 20;
|
||||
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeTextParagraphV1Chunker;
|
||||
|
||||
impl Chunker for CodeTextParagraphV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
policy_hash(policy)
|
||||
}
|
||||
|
||||
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
||||
// Expect a single Block::Code carrying the full source text.
|
||||
let (text, lang_str) = match doc.blocks.first() {
|
||||
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
|
||||
_ => return Ok(vec![]),
|
||||
};
|
||||
|
||||
let mut chunks = Vec::new();
|
||||
for para in split_paragraphs(text) {
|
||||
push_paragraph(&mut chunks, doc, policy, ¶, lang_str)?;
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = chunks.len(),
|
||||
"code-text-paragraph-v1 chunked",
|
||||
);
|
||||
|
||||
Ok(chunks)
|
||||
}
|
||||
}
|
||||
|
||||
/// A contiguous run of non-blank lines from the source text.
|
||||
struct Paragraph {
|
||||
/// Lines joined with `\n` (no trailing newline).
|
||||
text: String,
|
||||
/// 1-indexed line number of the first line in the source file.
|
||||
line_start: u32,
|
||||
/// 1-indexed line number of the last line in the source file.
|
||||
line_end: u32,
|
||||
}
|
||||
|
||||
/// Split `text` into `Paragraph`s separated by blank (all-whitespace) lines.
|
||||
///
|
||||
/// Blank lines are treated as boundaries and are NOT included in any
|
||||
/// paragraph's line range. Paragraphs that would consist entirely of blank
|
||||
/// lines are skipped.
|
||||
fn split_paragraphs(text: &str) -> Vec<Paragraph> {
|
||||
let mut paragraphs = Vec::new();
|
||||
let mut current: Vec<&str> = Vec::new();
|
||||
let mut current_start: Option<u32> = None;
|
||||
|
||||
for (idx, line) in text.lines().enumerate() {
|
||||
let line_no = (idx + 1) as u32;
|
||||
let is_blank = line.trim().is_empty();
|
||||
if is_blank {
|
||||
if let Some(start) = current_start.take() {
|
||||
let end = start + current.len() as u32 - 1;
|
||||
paragraphs.push(Paragraph {
|
||||
text: current.join("\n"),
|
||||
line_start: start,
|
||||
line_end: end,
|
||||
});
|
||||
current.clear();
|
||||
}
|
||||
} else {
|
||||
if current_start.is_none() {
|
||||
current_start = Some(line_no);
|
||||
}
|
||||
current.push(line);
|
||||
}
|
||||
}
|
||||
// Flush any trailing paragraph not terminated by a blank line.
|
||||
if let Some(start) = current_start {
|
||||
let end = start + current.len() as u32 - 1;
|
||||
paragraphs.push(Paragraph {
|
||||
text: current.join("\n"),
|
||||
line_start: start,
|
||||
line_end: end,
|
||||
});
|
||||
}
|
||||
paragraphs
|
||||
}
|
||||
|
||||
/// Emit one or more chunks for a single paragraph.
|
||||
///
|
||||
/// Paragraphs with ≤ `FALLBACK_LINES_PER_CHUNK` lines become a single chunk.
|
||||
/// Larger paragraphs are split into overlapping windows of
|
||||
/// `FALLBACK_LINES_PER_CHUNK` lines with stride `FALLBACK_LINES_PER_CHUNK -
|
||||
/// FALLBACK_LINES_OVERLAP`. The last window may be shorter. Window starts
|
||||
/// are passed as `split_key` so `id_for_chunk` can produce distinct ids
|
||||
/// across windows.
|
||||
fn push_paragraph(
|
||||
out: &mut Vec<Chunk>,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
para: &Paragraph,
|
||||
lang: &str,
|
||||
) -> Result<()> {
|
||||
let n_lines = (para.line_end - para.line_start + 1) as usize;
|
||||
|
||||
if n_lines <= FALLBACK_LINES_PER_CHUNK {
|
||||
// Use line_start as split_key so each paragraph gets a distinct
|
||||
// chunk_id even when block_ids is empty (no symbol, no AST structure).
|
||||
// Without this, all short paragraphs from the same doc share the same
|
||||
// base_policy_hash and therefore the same id_for_chunk result.
|
||||
out.push(build_chunk_no_symbol(
|
||||
doc,
|
||||
policy,
|
||||
¶.text,
|
||||
para.line_start,
|
||||
para.line_end,
|
||||
lang,
|
||||
VERSION_LABEL,
|
||||
Some(para.line_start),
|
||||
));
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// Oversize: line-window split with overlap.
|
||||
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
|
||||
let lines: Vec<&str> = para.text.lines().collect();
|
||||
let mut i = 0usize;
|
||||
loop {
|
||||
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
|
||||
let window_text = lines[i..end].join("\n");
|
||||
let window_start = para.line_start + i as u32;
|
||||
let window_end = para.line_start + (end as u32) - 1;
|
||||
// Use window_start as split_key so chunk_ids are unique across windows.
|
||||
out.push(build_chunk_no_symbol(
|
||||
doc,
|
||||
policy,
|
||||
&window_text,
|
||||
window_start,
|
||||
window_end,
|
||||
lang,
|
||||
VERSION_LABEL,
|
||||
Some(window_start),
|
||||
));
|
||||
if end == lines.len() {
|
||||
break;
|
||||
}
|
||||
i += stride;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
58
crates/kebab-chunk/src/dockerfile_file_v1.rs
Normal file
58
crates/kebab-chunk/src/dockerfile_file_v1.rs
Normal file
@@ -0,0 +1,58 @@
|
||||
//! p10-2: dockerfile whole-file chunker (Tier 2).
|
||||
//!
|
||||
//! Reads entire Dockerfile content and emits a single Chunk with symbol
|
||||
//! "<dockerfile>", code_lang "dockerfile", line range 1..EOF.
|
||||
//! Oversize >200 lines splits into line-windows sharing the symbol via
|
||||
//! tier2_shared::push_chunks_with_oversize.
|
||||
|
||||
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
|
||||
use anyhow::Result;
|
||||
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
|
||||
|
||||
pub const VERSION_LABEL: &str = "dockerfile-file-v1";
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct DockerfileFileV1Chunker;
|
||||
|
||||
impl Chunker for DockerfileFileV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
policy_hash(policy)
|
||||
}
|
||||
|
||||
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
||||
// Expect a single Block::Code carrying the full Dockerfile text.
|
||||
let text = match doc.blocks.first() {
|
||||
Some(Block::Code(cb)) => cb.code.as_str(),
|
||||
_ => return Ok(vec![]),
|
||||
};
|
||||
|
||||
let total_lines = text.lines().count().max(1) as u32;
|
||||
let mut chunks = Vec::new();
|
||||
|
||||
push_chunks_with_oversize(
|
||||
&mut chunks,
|
||||
doc,
|
||||
policy,
|
||||
text,
|
||||
1,
|
||||
total_lines,
|
||||
"<dockerfile>",
|
||||
"dockerfile",
|
||||
VERSION_LABEL,
|
||||
None,
|
||||
)?;
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = chunks.len(),
|
||||
"dockerfile-file-v1 chunked",
|
||||
);
|
||||
|
||||
Ok(chunks)
|
||||
}
|
||||
}
|
||||
170
crates/kebab-chunk/src/k8s_manifest_resource_v1.rs
Normal file
170
crates/kebab-chunk/src/k8s_manifest_resource_v1.rs
Normal file
@@ -0,0 +1,170 @@
|
||||
//! p10-2: k8s manifest resource-aware chunker.
|
||||
//!
|
||||
//! Splits a multi-document YAML file on `^---\s*$` boundaries, recognises
|
||||
//! documents that have both `apiVersion` and `kind` string fields as k8s
|
||||
//! resources, and emits one `Chunk` per resource (with oversize >200-line
|
||||
//! fallback). Non-k8s documents are skipped; invalid YAML yields 0 chunks
|
||||
//! for the entire file.
|
||||
|
||||
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
|
||||
use anyhow::Result;
|
||||
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
|
||||
|
||||
pub const VERSION_LABEL: &str = "k8s-manifest-resource-v1";
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct K8sManifestResourceV1Chunker;
|
||||
|
||||
impl Chunker for K8sManifestResourceV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
policy_hash(policy)
|
||||
}
|
||||
|
||||
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
||||
// Expect a single Block::Code carrying the full YAML text.
|
||||
let text = match doc.blocks.first() {
|
||||
Some(Block::Code(cb)) => cb.code.as_str(),
|
||||
_ => return Ok(vec![]),
|
||||
};
|
||||
|
||||
let slices = split_yaml_documents(text);
|
||||
let mut chunks: Vec<Chunk> = Vec::new();
|
||||
|
||||
for slice in slices {
|
||||
// Invalid YAML in any document → return 0 chunks for the file.
|
||||
let value: serde_yaml::Value = match serde_yaml::from_str(slice.text) {
|
||||
Ok(v) => v,
|
||||
Err(_) => return Ok(vec![]),
|
||||
};
|
||||
|
||||
let Some(mapping) = value.as_mapping() else {
|
||||
continue;
|
||||
};
|
||||
|
||||
let api = mapping
|
||||
.get("apiVersion")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
let kind = mapping
|
||||
.get("kind")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
|
||||
// Skip non-k8s documents.
|
||||
if api.is_empty() || kind.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
let metadata = mapping
|
||||
.get("metadata")
|
||||
.and_then(|v| v.as_mapping());
|
||||
let name = metadata
|
||||
.and_then(|m| m.get("name"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("<unnamed>");
|
||||
let namespace = metadata
|
||||
.and_then(|m| m.get("namespace"))
|
||||
.and_then(|v| v.as_str());
|
||||
|
||||
let symbol = match namespace {
|
||||
Some(ns) if !ns.is_empty() => format!("{kind}/{ns}/{name}"),
|
||||
_ => format!("{kind}/{name}"),
|
||||
};
|
||||
|
||||
push_chunks_with_oversize(
|
||||
&mut chunks,
|
||||
doc,
|
||||
policy,
|
||||
slice.text,
|
||||
slice.line_start,
|
||||
slice.line_end,
|
||||
&symbol,
|
||||
"yaml",
|
||||
VERSION_LABEL,
|
||||
Some(slice.line_start),
|
||||
)?;
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = chunks.len(),
|
||||
"k8s-manifest-resource-v1 chunked",
|
||||
);
|
||||
|
||||
Ok(chunks)
|
||||
}
|
||||
}
|
||||
|
||||
struct YamlSlice<'a> {
|
||||
text: &'a str,
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
}
|
||||
|
||||
/// Split raw YAML text into per-document slices on `---` separator lines.
|
||||
/// Line numbers are 1-indexed.
|
||||
fn split_yaml_documents(text: &str) -> Vec<YamlSlice<'_>> {
|
||||
let lines: Vec<&str> = text.lines().collect();
|
||||
|
||||
// Collect indices of separator lines (0-based), then append a sentinel at
|
||||
// the end so the last slice is always terminated.
|
||||
let mut separators: Vec<usize> = lines
|
||||
.iter()
|
||||
.enumerate()
|
||||
.filter_map(|(i, l)| {
|
||||
let trimmed = l.trim_end();
|
||||
if trimmed == "---"
|
||||
|| trimmed.starts_with("--- ")
|
||||
|| trimmed.starts_with("---\t")
|
||||
{
|
||||
Some(i)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
separators.push(lines.len());
|
||||
|
||||
let mut slices: Vec<YamlSlice<'_>> = Vec::new();
|
||||
let mut doc_start_line: usize = 0; // 0-based index of current doc start
|
||||
|
||||
for sep_line in separators {
|
||||
if sep_line > doc_start_line {
|
||||
let start_byte = byte_offset_of_line(text, doc_start_line);
|
||||
let end_byte = byte_offset_of_line(text, sep_line);
|
||||
let slice_text = &text[start_byte..end_byte];
|
||||
if !slice_text.trim().is_empty() {
|
||||
slices.push(YamlSlice {
|
||||
text: slice_text,
|
||||
line_start: (doc_start_line + 1) as u32,
|
||||
line_end: sep_line as u32,
|
||||
});
|
||||
}
|
||||
}
|
||||
doc_start_line = sep_line + 1;
|
||||
}
|
||||
|
||||
slices
|
||||
}
|
||||
|
||||
/// Return the byte offset of the start of `line_idx` (0-based line index).
|
||||
fn byte_offset_of_line(text: &str, line_idx: usize) -> usize {
|
||||
if line_idx == 0 {
|
||||
return 0;
|
||||
}
|
||||
let mut count = 0usize;
|
||||
for (i, c) in text.char_indices() {
|
||||
if c == '\n' {
|
||||
count += 1;
|
||||
if count == line_idx {
|
||||
return i + 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
text.len()
|
||||
}
|
||||
@@ -15,16 +15,35 @@
|
||||
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
|
||||
//! It consumes `CanonicalDocument` purely through `kb-core` types.
|
||||
|
||||
mod code_c_ast_v1;
|
||||
mod code_cpp_ast_v1;
|
||||
mod code_go_ast_v1;
|
||||
mod code_java_ast_v1;
|
||||
mod code_js_ast_v1;
|
||||
mod code_kotlin_ast_v1;
|
||||
mod code_python_ast_v1;
|
||||
mod code_rust_ast_v1;
|
||||
mod code_ts_ast_v1;
|
||||
mod md_heading_v1;
|
||||
mod pdf_page_v1;
|
||||
mod tier2_shared;
|
||||
pub mod k8s_manifest_resource_v1;
|
||||
pub mod dockerfile_file_v1;
|
||||
pub mod manifest_file_v1;
|
||||
pub mod code_text_paragraph_v1;
|
||||
|
||||
pub use code_c_ast_v1::CodeCAstV1Chunker;
|
||||
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
|
||||
pub use code_go_ast_v1::CodeGoAstV1Chunker;
|
||||
pub use code_java_ast_v1::CodeJavaAstV1Chunker;
|
||||
pub use code_js_ast_v1::CodeJsAstV1Chunker;
|
||||
pub use code_kotlin_ast_v1::CodeKotlinAstV1Chunker;
|
||||
pub use code_python_ast_v1::CodePythonAstV1Chunker;
|
||||
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
|
||||
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
|
||||
pub use md_heading_v1::MdHeadingV1Chunker;
|
||||
pub use pdf_page_v1::PdfPageV1Chunker;
|
||||
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
|
||||
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
|
||||
pub use manifest_file_v1::ManifestFileV1Chunker;
|
||||
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;
|
||||
|
||||
59
crates/kebab-chunk/src/manifest_file_v1.rs
Normal file
59
crates/kebab-chunk/src/manifest_file_v1.rs
Normal file
@@ -0,0 +1,59 @@
|
||||
//! p10-2: manifest whole-file chunker (Tier 2).
|
||||
//!
|
||||
//! Reads entire manifest file (Cargo.toml / package.json / pom.xml / go.mod /
|
||||
//! build.gradle / pyproject.toml / tsconfig.json) and emits a single Chunk
|
||||
//! with symbol "<manifest>", code_lang read from Block::Code.lang, line range
|
||||
//! 1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
|
||||
//! tier2_shared::push_chunks_with_oversize.
|
||||
|
||||
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
|
||||
use anyhow::Result;
|
||||
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
|
||||
|
||||
pub const VERSION_LABEL: &str = "manifest-file-v1";
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct ManifestFileV1Chunker;
|
||||
|
||||
impl Chunker for ManifestFileV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
policy_hash(policy)
|
||||
}
|
||||
|
||||
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
||||
// Expect a single Block::Code carrying the full manifest text.
|
||||
let (text, lang) = match doc.blocks.first() {
|
||||
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
|
||||
_ => return Ok(vec![]),
|
||||
};
|
||||
|
||||
let total_lines = text.lines().count().max(1) as u32;
|
||||
let mut chunks = Vec::new();
|
||||
|
||||
push_chunks_with_oversize(
|
||||
&mut chunks,
|
||||
doc,
|
||||
policy,
|
||||
text,
|
||||
1,
|
||||
total_lines,
|
||||
"<manifest>",
|
||||
lang,
|
||||
VERSION_LABEL,
|
||||
None,
|
||||
)?;
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = chunks.len(),
|
||||
"manifest-file-v1 chunked",
|
||||
);
|
||||
|
||||
Ok(chunks)
|
||||
}
|
||||
}
|
||||
192
crates/kebab-chunk/src/tier2_shared.rs
Normal file
192
crates/kebab-chunk/src/tier2_shared.rs
Normal file
@@ -0,0 +1,192 @@
|
||||
//! p10-2: Tier 2 chunker shared helpers (oversize fallback + Chunk build).
|
||||
//!
|
||||
//! Mirrors `code_rust_ast_v1`'s Chunk-construction pattern exactly so that
|
||||
//! id / hashes / token-count / ChunkPolicy semantics stay identical across
|
||||
//! Tier 1 (AST) and Tier 2 (resource-aware) chunkers.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
BlockId, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, DocumentId, SourceSpan,
|
||||
id_for_chunk,
|
||||
};
|
||||
|
||||
pub(crate) const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
|
||||
/// Compute the policy hash the same way `code_rust_ast_v1` does.
|
||||
pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
|
||||
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
|
||||
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
|
||||
///
|
||||
/// `base_split_key` is used as the `split_key` for the non-oversize single-chunk
|
||||
/// case. Callers that emit multiple chunks from the same document (e.g.
|
||||
/// `K8sManifestResourceV1Chunker` — one call per k8s resource) MUST pass
|
||||
/// `Some(line_start)` so that each call produces a distinct `chunk_id`.
|
||||
/// Single-chunk callers (dockerfile-file-v1, manifest-file-v1) pass `None` to
|
||||
/// keep chunk_ids stable (no sibling can collide when there's only one chunk).
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub(crate) fn push_chunks_with_oversize(
|
||||
out: &mut Vec<Chunk>,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
text: &str,
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
symbol: &str,
|
||||
lang: &str,
|
||||
chunker_version: &str,
|
||||
base_split_key: Option<u32>,
|
||||
) -> Result<()> {
|
||||
let n_lines = (line_end - line_start + 1).max(1);
|
||||
let cv = ChunkerVersion(chunker_version.to_string());
|
||||
let base_policy_hash = policy_hash(policy);
|
||||
|
||||
if n_lines <= AST_CHUNK_MAX_LINES {
|
||||
out.push(build_chunk(
|
||||
doc,
|
||||
&cv,
|
||||
&base_policy_hash,
|
||||
text,
|
||||
line_start,
|
||||
line_end,
|
||||
symbol,
|
||||
lang,
|
||||
base_split_key,
|
||||
));
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let lines: Vec<&str> = text.lines().collect();
|
||||
let total = lines.len();
|
||||
let mut window_start = line_start;
|
||||
let mut i = 0usize;
|
||||
while i < total {
|
||||
let take = (AST_CHUNK_MAX_LINES as usize).min(total - i);
|
||||
let window_text = lines[i..i + take].join("\n");
|
||||
let window_end = window_start + take as u32 - 1;
|
||||
out.push(build_chunk(
|
||||
doc,
|
||||
&cv,
|
||||
&base_policy_hash,
|
||||
&window_text,
|
||||
window_start,
|
||||
window_end,
|
||||
symbol,
|
||||
lang,
|
||||
Some(window_start),
|
||||
));
|
||||
i += take;
|
||||
window_start = window_end + 1;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Build a single `Chunk`, mirroring `make_chunk` in `code_rust_ast_v1.rs`
|
||||
/// exactly (same id recipe, same token estimate, same field set).
|
||||
///
|
||||
/// `split_key` is `Some(line_start_of_window)` for oversize splits, `None`
|
||||
/// for normal single-chunk emission. Mirrors the `Some(part_ls)` / `None`
|
||||
/// split_key pattern in 1A-2.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub(crate) fn build_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
base_policy_hash: &str,
|
||||
text: &str,
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
symbol: &str,
|
||||
lang: &str,
|
||||
split_key: Option<u32>,
|
||||
) -> Chunk {
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol.to_string()),
|
||||
lang: Some(lang.to_string()),
|
||||
};
|
||||
build_chunk_from_span(doc, chunker_version, base_policy_hash, text, span, split_key)
|
||||
}
|
||||
|
||||
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
|
||||
///
|
||||
/// Accepts `policy: &ChunkPolicy` and `chunker_version: &str` (string slice)
|
||||
/// so callers don't need to pre-compute the hash and version wrapper.
|
||||
/// `split_key` is `Some(window_start)` for oversize line-window splits.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub(crate) fn build_chunk_no_symbol(
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
text: &str,
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
lang: &str,
|
||||
chunker_version: &str,
|
||||
split_key: Option<u32>,
|
||||
) -> Chunk {
|
||||
let cv = ChunkerVersion(chunker_version.to_string());
|
||||
let base_policy_hash = policy_hash(policy);
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: None,
|
||||
lang: Some(lang.to_string()),
|
||||
};
|
||||
build_chunk_from_span(doc, &cv, &base_policy_hash, text, span, split_key)
|
||||
}
|
||||
|
||||
/// Core chunk-building logic shared by `build_chunk` and `build_chunk_no_symbol`.
|
||||
///
|
||||
/// Takes a pre-built `SourceSpan` so the only difference between the two
|
||||
/// public helpers is whether `symbol` is `Some` or `None`. All id/hash/
|
||||
/// token mechanics are identical.
|
||||
fn build_chunk_from_span(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
base_policy_hash: &str,
|
||||
text: &str,
|
||||
span: SourceSpan,
|
||||
split_key: Option<u32>,
|
||||
) -> Chunk {
|
||||
// id_hash mirrors code_rust_ast_v1's make_chunk logic:
|
||||
// split_key Some(k) => "{base_policy_hash}#L{k}"
|
||||
// split_key None => base_policy_hash
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
|
||||
// block_ids: Tier 2/3 chunkers have no per-block structure (the whole file
|
||||
// is one Block::Code), so we pass an empty slice — same as using the doc-
|
||||
// level slice without explicit block granularity.
|
||||
let block_ids: Vec<BlockId> = vec![];
|
||||
|
||||
let chunk_id = id_for_chunk(
|
||||
&DocumentId(doc.doc_id.0.clone()),
|
||||
chunker_version,
|
||||
&block_ids,
|
||||
&id_hash,
|
||||
);
|
||||
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids,
|
||||
text: text.to_string(),
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
196
crates/kebab-chunk/tests/code_c_ast_snapshot.rs
Normal file
196
crates/kebab-chunk/tests/code_c_ast_snapshot.rs
Normal file
@@ -0,0 +1,196 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative C code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_go_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeCAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("projects/record.c".into());
|
||||
let aid = AssetId("c".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-c-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Representative units:
|
||||
// 0. imports + defines (lines 1–4, ≤200)
|
||||
// 1. status_t enum typedef (lines 6–9, ≤200)
|
||||
// 2. record_t struct typedef (lines 11–16, ≤200)
|
||||
// 3. static counter decl glue (line 18, ≤200)
|
||||
// 4. parse_record fn (lines 20–23, ≤200)
|
||||
// 5. print_record fn (lines 25–27, ≤200)
|
||||
// 6. main fn (lines 29–33, ≤200)
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"<top-level>",
|
||||
1,
|
||||
18,
|
||||
"#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;".to_string(),
|
||||
),
|
||||
(
|
||||
"parse_record",
|
||||
20,
|
||||
23,
|
||||
"int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"print_record",
|
||||
25,
|
||||
27,
|
||||
"void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"main",
|
||||
29,
|
||||
33,
|
||||
"int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}".to_string(),
|
||||
),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("c".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("c".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "record.c".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("c".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-c-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_c_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.c.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-c-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_c_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeCAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeCAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
325
crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs
Normal file
325
crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs
Normal file
@@ -0,0 +1,325 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative C++ code `CanonicalDocument`.
|
||||
//!
|
||||
//! Two complementary tests:
|
||||
//! 1. `code_cpp_ast_chunks_snapshot` — hand-built `fixed_doc()` validates the
|
||||
//! chunker's 1:1 mapping (design §6.3 / §8 boundary: no parse-code dep needed).
|
||||
//! 2. `code_cpp_ast_extractor_snapshot` — invokes `CppAstExtractor` against the
|
||||
//! real `tests/fixtures/sample.cpp` fixture, validating the extractor → chunker
|
||||
//! end-to-end pipeline. `kebab-parse-code` is a dev-dep (same pattern as
|
||||
//! `kebab-parse-md` in Markdown snapshot tests).
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeCppAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use kebab_parse_code::CppAstExtractor;
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("projects/record.cpp".into());
|
||||
let aid = AssetId("c".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-cpp-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Representative units (C++ specific):
|
||||
// 0. includes + namespace opening (lines 1–4, ≤200)
|
||||
// 1. class definition (lines 6–20, ≤200)
|
||||
// 2. template function (lines 22–25, ≤200)
|
||||
// 3. namespace closing + free fn (lines 27–29, ≤200)
|
||||
// 4. main fn (lines 31–34, ≤200)
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"<top-level>",
|
||||
1,
|
||||
4,
|
||||
"#include <string>\n#include <vector>\n\nnamespace kebab {".to_string(),
|
||||
),
|
||||
(
|
||||
"kebab::chunk::MdHeadingV1Chunker",
|
||||
6,
|
||||
20,
|
||||
"class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};".to_string(),
|
||||
),
|
||||
(
|
||||
"kebab::identity",
|
||||
22,
|
||||
25,
|
||||
"template <typename T>\nT identity(T value) {\n return value;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"kebab::global_helper",
|
||||
27,
|
||||
29,
|
||||
"void global_helper() {\n // free function in kebab namespace\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"main",
|
||||
31,
|
||||
34,
|
||||
"int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}".to_string(),
|
||||
),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("cpp".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("cpp".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "record.cpp".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("cpp".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-cpp-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Helper: run the real CppAstExtractor against tests/fixtures/sample.cpp
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_cpp_fixture() -> CanonicalDocument {
|
||||
use kebab_core::{
|
||||
AssetId, AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, RawAsset,
|
||||
SourceUri, WorkspacePath,
|
||||
};
|
||||
use std::path::PathBuf;
|
||||
|
||||
let bytes = std::fs::read(fixtures_dir().join("sample.cpp")).expect("read sample.cpp fixture");
|
||||
let src = String::from_utf8(bytes).expect("fixture is valid UTF-8");
|
||||
let wp = WorkspacePath("tests/fixtures/sample.cpp".to_string());
|
||||
let asset = RawAsset {
|
||||
asset_id: AssetId("e".repeat(64)),
|
||||
source_uri: SourceUri::File(PathBuf::from("tests/fixtures/sample.cpp")),
|
||||
workspace_path: wp,
|
||||
media_type: kebab_core::MediaType::Code("cpp".to_string()),
|
||||
byte_len: src.len() as u64,
|
||||
checksum: Checksum("f".repeat(64)),
|
||||
discovered_at: time::OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
stored: AssetStorage::Reference {
|
||||
path: PathBuf::from("tests/fixtures/sample.cpp"),
|
||||
sha: Checksum("f".repeat(64)),
|
||||
},
|
||||
};
|
||||
let cfg = ExtractConfig::default();
|
||||
let root = PathBuf::from("/tmp");
|
||||
let ctx = ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Test 1 (hand-built): chunker-only 1:1 mapping validation
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn code_cpp_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.cpp.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-cpp-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_cpp_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeCppAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeCppAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Test 2 (real extractor): end-to-end extractor → chunker pipeline
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Validates that the real `CppAstExtractor` processes `sample.cpp` and
|
||||
/// emits the expected set of symbols through the full chunker pipeline.
|
||||
///
|
||||
/// `sample.cpp` contains:
|
||||
/// - `#include` directives + nested namespace `kebab::chunk` → glue + struct unit
|
||||
/// - `class MdHeadingV1Chunker` with methods (ctor, dtor, chunk_doc, operator())
|
||||
/// - `template <typename T> T identity(T value)` (template fn)
|
||||
/// - `void kebab::global_helper()` (free fn in namespace)
|
||||
/// - `int main()` (global free fn)
|
||||
#[test]
|
||||
fn code_cpp_ast_extractor_snapshot() {
|
||||
let doc = extract_cpp_fixture();
|
||||
|
||||
// Verify the extractor emits all expected named units.
|
||||
let block_syms: Vec<Option<String>> = doc.blocks.iter().filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
}).collect();
|
||||
|
||||
// Must include namespace-qualified class and its methods
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
|
||||
"class unit missing: {block_syms:?}"
|
||||
);
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
|
||||
"ctor unit missing: {block_syms:?}"
|
||||
);
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
|
||||
"dtor unit missing: {block_syms:?}"
|
||||
);
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
|
||||
"chunk_doc unit missing: {block_syms:?}"
|
||||
);
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
|
||||
"operator() unit missing: {block_syms:?}"
|
||||
);
|
||||
// Template function (inside kebab::chunk namespace in the fixture)
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::identity")),
|
||||
"identity template fn unit missing: {block_syms:?}"
|
||||
);
|
||||
// Free function in outer namespace
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("kebab::global_helper")),
|
||||
"global_helper unit missing: {block_syms:?}"
|
||||
);
|
||||
// Global main
|
||||
assert!(
|
||||
block_syms.iter().any(|s| s.as_deref() == Some("main")),
|
||||
"main unit missing: {block_syms:?}"
|
||||
);
|
||||
}
|
||||
|
||||
/// End-to-end chunker output from real extractor is deterministic.
|
||||
#[test]
|
||||
fn code_cpp_ast_extractor_chunks_deterministic() {
|
||||
let doc1 = extract_cpp_fixture();
|
||||
let doc2 = extract_cpp_fixture();
|
||||
assert_eq!(doc1.blocks, doc2.blocks, "extractor output non-deterministic");
|
||||
|
||||
let policy = fixed_policy();
|
||||
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
|
||||
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
|
||||
assert_eq!(
|
||||
chunks1.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
|
||||
chunks2.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
|
||||
"chunker output non-deterministic"
|
||||
);
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_go_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_go_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative Go code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeGoAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("kebab_eval/metrics.go".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-go-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line function body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "func BigCompute(data []int) int {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!("\tv{i} := 0\n\tif {i} < len(data) {{\n\t\tv{i} = data[{i}]\n\t}}\n"))
|
||||
.collect();
|
||||
let footer = "\treturn len(data)\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. import block (lines 1–5, ≤200)
|
||||
// 1. free fn `ComputeMRR` (lines 7–12, ≤200)
|
||||
// 2. struct `MetricsCollector` (lines 14–20, ≤200)
|
||||
// 3. struct `BaseEvaluator` (lines 22–30, ≤200)
|
||||
// 4. method `Run` (lines 32–38, ≤200)
|
||||
// 5. method `Report` (lines 40–46, ≤200)
|
||||
// 6. BigCompute (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"imports",
|
||||
1,
|
||||
5,
|
||||
"import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)".to_string(),
|
||||
),
|
||||
(
|
||||
"ComputeMRR",
|
||||
7,
|
||||
12,
|
||||
"func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector",
|
||||
14,
|
||||
20,
|
||||
"type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"BaseEvaluator",
|
||||
22,
|
||||
30,
|
||||
"type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.Run",
|
||||
32,
|
||||
38,
|
||||
"func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.Report",
|
||||
40,
|
||||
46,
|
||||
"func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}".to_string(),
|
||||
),
|
||||
("BigCompute", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("go".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("go".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "metrics.go".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("go".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-go-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_go_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.go.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-go-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_go_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeGoAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeGoAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_java_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_java_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative Java code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeJavaAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("src/main/java/com/example/Metrics.java".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-java-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line method body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "public class BigCompute {\n public int compute(int[] data) {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" int v{i} = {i} < data.length ? data[{i}] : 0;\n"))
|
||||
.collect();
|
||||
let footer = " return data.length;\n }\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. import block (lines 1–5, ≤200)
|
||||
// 1. free method `computeMRR` (lines 7–12, ≤200)
|
||||
// 2. class `MetricsCollector` (lines 14–20, ≤200)
|
||||
// 3. class `BaseEvaluator` (lines 22–30, ≤200)
|
||||
// 4. method `MetricsCollector.run` (lines 32–38, ≤200)
|
||||
// 5. method `MetricsCollector.report` (lines 40–46, ≤200)
|
||||
// 6. BigCompute (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"imports",
|
||||
1,
|
||||
5,
|
||||
"import java.util.List;\nimport java.util.Map;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.stream.Collectors;".to_string(),
|
||||
),
|
||||
(
|
||||
"computeMRR",
|
||||
7,
|
||||
12,
|
||||
"public static double computeMRR(List<Double> scores) {\n if (scores.isEmpty()) {\n return 0.0;\n }\n return 1.0 / scores.size();\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector",
|
||||
14,
|
||||
20,
|
||||
"public class MetricsCollector {\n private List<Double> scores;\n private List<String> labels;\n private Map<String, Integer> counts;\n private Map<String, Double> totals;\n private List<String> tags;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"BaseEvaluator",
|
||||
22,
|
||||
30,
|
||||
"public class BaseEvaluator {\n private String name;\n\n public BaseEvaluator(String name) {\n this.name = name;\n }\n\n public void evaluate(List<String> data) throws Exception {\n String joined = String.join(\",\", data);\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.run",
|
||||
32,
|
||||
38,
|
||||
"public void run(List<Double> inputs) {\n for (Double inp : inputs) {\n scores.add(\n inp\n );\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.report",
|
||||
40,
|
||||
46,
|
||||
"public Map<String, Object> report() {\n Map<String, Object> result = new HashMap<>();\n result.put(\"mean\", 0.0);\n result.put(\"count\", scores.size());\n result.put(\"tags\", tags);\n return result;\n}".to_string(),
|
||||
),
|
||||
("BigCompute", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("java".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("java".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "Metrics.java".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("java".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-java-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_java_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.java.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-java-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_java_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeJavaAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeJavaAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_kotlin_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_kotlin_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative Kotlin code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeKotlinAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("src/main/kotlin/com/example/Metrics.kt".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-kotlin-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line function body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "class BigCompute {\n fun compute(data: IntArray): Int {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" val v{i} = if ({i} < data.size) data[{i}] else 0\n"))
|
||||
.collect();
|
||||
let footer = " return data.size\n }\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. import block (lines 1–5, ≤200)
|
||||
// 1. top-level fn `computeMRR` (lines 7–12, ≤200)
|
||||
// 2. data class `MetricsCollector` (lines 14–20, ≤200)
|
||||
// 3. class `BaseEvaluator` (lines 22–30, ≤200)
|
||||
// 4. method `MetricsCollector.run` (lines 32–38, ≤200)
|
||||
// 5. method `MetricsCollector.report` (lines 40–46, ≤200)
|
||||
// 6. BigCompute (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"imports",
|
||||
1,
|
||||
5,
|
||||
"import kotlin.collections.List\nimport kotlin.collections.Map\nimport kotlin.collections.MutableList\nimport kotlin.collections.MutableMap\nimport kotlin.collections.mutableListOf".to_string(),
|
||||
),
|
||||
(
|
||||
"computeMRR",
|
||||
7,
|
||||
12,
|
||||
"fun computeMRR(scores: List<Double>): Double {\n if (scores.isEmpty()) {\n return 0.0\n }\n return 1.0 / scores.size\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector",
|
||||
14,
|
||||
20,
|
||||
"data class MetricsCollector(\n val scores: MutableList<Double> = mutableListOf(),\n val labels: MutableList<String> = mutableListOf(),\n val counts: MutableMap<String, Int> = mutableMapOf(),\n val totals: MutableMap<String, Double> = mutableMapOf(),\n val tags: MutableList<String> = mutableListOf(),\n)".to_string(),
|
||||
),
|
||||
(
|
||||
"BaseEvaluator",
|
||||
22,
|
||||
30,
|
||||
"open class BaseEvaluator(val name: String) {\n\n fun evaluate(data: List<String>) {\n val joined = data.joinToString(\",\")\n println(joined)\n }\n\n open fun describe(): String = name\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.run",
|
||||
32,
|
||||
38,
|
||||
"fun MetricsCollector.run(inputs: List<Double>) {\n for (inp in inputs) {\n scores.add(\n inp\n )\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.report",
|
||||
40,
|
||||
46,
|
||||
"fun MetricsCollector.report(): Map<String, Any> {\n return mapOf(\n \"mean\" to 0.0,\n \"count\" to scores.size,\n \"tags\" to tags,\n )\n}".to_string(),
|
||||
),
|
||||
("BigCompute", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("kotlin".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("kotlin".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "Metrics.kt".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("kotlin".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-kotlin-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_kotlin_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.kt.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-kotlin-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_kotlin_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeKotlinAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeKotlinAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
270
crates/kebab-chunk/tests/code_text_paragraph_v1.rs
Normal file
270
crates/kebab-chunk/tests/code_text_paragraph_v1.rs
Normal file
@@ -0,0 +1,270 @@
|
||||
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
|
||||
//!
|
||||
//! Documents are constructed manually (no kebab-parse-code dependency) by
|
||||
//! placing raw text into a single `Block::Code`, mirroring the pattern used
|
||||
//! in `k8s_manifest_resource_v1.rs`.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeTextParagraphV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
|
||||
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
|
||||
WorkspacePath, id_for_block, id_for_doc,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
// ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
/// Build a `CanonicalDocument` with a single `Block::Code` containing `text`
|
||||
/// and the supplied `lang` label.
|
||||
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("scripts/sample.sh".into());
|
||||
let aid = AssetId("d".repeat(64));
|
||||
let pv = ParserVersion("code-text-paragraph-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
let line_count = text.lines().count() as u32;
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 1,
|
||||
line_end: line_count.max(1),
|
||||
symbol: None,
|
||||
lang: Some(lang.into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
|
||||
let block = Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some(lang.into()),
|
||||
code: text.to_string(),
|
||||
});
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "sample.sh".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks: vec![block],
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some(lang.into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-text-paragraph-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
// ── tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// `sample_shell.sh` has 4 paragraphs separated by 3 blank lines:
|
||||
/// - paragraph 1: lines 1-2 (shebang + set -euo pipefail)
|
||||
/// - paragraph 2: lines 4-7 (env setup block)
|
||||
/// - paragraph 3: lines 9-11 (ingest block)
|
||||
/// - paragraph 4: lines 13-15 (report block)
|
||||
///
|
||||
/// We assert:
|
||||
/// - exactly 4 chunks (one per paragraph)
|
||||
/// - all symbols are None (Tier 3 spec §9.3)
|
||||
/// - all langs are "shell"
|
||||
/// - line ranges are strictly ascending and do NOT include the blank lines
|
||||
/// (lines 3, 8, 12 must not appear in any range)
|
||||
#[test]
|
||||
fn shell_multi_paragraph_splits_on_blank_lines() {
|
||||
let fixture_path = fixtures_dir().join("sample_shell.sh");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = text_doc("shell", &text);
|
||||
let chunks = CodeTextParagraphV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
4,
|
||||
"expected 4 chunks (one per paragraph), got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
// All symbols must be None (Tier 3 requirement).
|
||||
for (i, chunk) in chunks.iter().enumerate() {
|
||||
match &chunk.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(
|
||||
symbol.is_none(),
|
||||
"chunk[{i}] symbol must be None for Tier 3 chunker, got {symbol:?}"
|
||||
);
|
||||
}
|
||||
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
// All langs must be "shell".
|
||||
for (i, chunk) in chunks.iter().enumerate() {
|
||||
match &chunk.source_spans[0] {
|
||||
SourceSpan::Code { lang, .. } => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("shell"),
|
||||
"chunk[{i}] lang must be 'shell', got {lang:?}"
|
||||
);
|
||||
}
|
||||
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
// Line ranges must be strictly ascending with no overlap,
|
||||
// and blank lines (3, 8, 12) must not be included in any range.
|
||||
let expected_ranges: &[(u32, u32)] = &[(1, 2), (4, 7), (9, 11), (13, 15)];
|
||||
let actual_ranges: Vec<(u32, u32)> = chunks
|
||||
.iter()
|
||||
.map(|c| match &c.source_spans[0] {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
..
|
||||
} => (*line_start, *line_end),
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
})
|
||||
.collect();
|
||||
|
||||
assert_eq!(
|
||||
actual_ranges, expected_ranges,
|
||||
"line ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
|
||||
);
|
||||
}
|
||||
|
||||
/// `sample_long_paragraph.txt` has exactly 200 non-blank lines and no blank
|
||||
/// lines, so the entire file is one paragraph. 200 > 80 (FALLBACK_LINES_PER_CHUNK),
|
||||
/// so the oversize window split fires with stride 60:
|
||||
/// - window 1: lines 1-80
|
||||
/// - window 2: lines 61-140
|
||||
/// - window 3: lines 121-200
|
||||
///
|
||||
/// All chunk_ids must be distinct (the #L{window_start} split_key suffix).
|
||||
#[test]
|
||||
fn single_long_paragraph_line_window_split() {
|
||||
let fixture_path = fixtures_dir().join("sample_long_paragraph.txt");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
assert_eq!(
|
||||
text.lines().count(),
|
||||
200,
|
||||
"fixture must have exactly 200 lines"
|
||||
);
|
||||
|
||||
let doc = text_doc("shell", &text);
|
||||
let chunks = CodeTextParagraphV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
3,
|
||||
"expected 3 window chunks for 200-line paragraph, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let expected_ranges: &[(u32, u32)] = &[(1, 80), (61, 140), (121, 200)];
|
||||
let actual_ranges: Vec<(u32, u32)> = chunks
|
||||
.iter()
|
||||
.map(|c| match &c.source_spans[0] {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
..
|
||||
} => (*line_start, *line_end),
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
})
|
||||
.collect();
|
||||
|
||||
assert_eq!(
|
||||
actual_ranges, expected_ranges,
|
||||
"window ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
|
||||
);
|
||||
|
||||
// All chunk_ids must be distinct (#L{window_start} suffix differentiates them).
|
||||
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
|
||||
assert_eq!(
|
||||
ids.len(),
|
||||
chunks.len(),
|
||||
"oversize window chunks must have distinct chunk_ids"
|
||||
);
|
||||
}
|
||||
|
||||
/// An empty source file (no non-blank lines) must yield zero chunks.
|
||||
#[test]
|
||||
fn empty_file_emits_zero_chunks() {
|
||||
let doc = text_doc("shell", "");
|
||||
let chunks = CodeTextParagraphV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
0,
|
||||
"empty file must yield 0 chunks, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
}
|
||||
|
||||
/// The `lang` field on each emitted chunk must match the `lang` passed to
|
||||
/// `text_doc`, regardless of content. `symbol` must be `None` (Tier 3 spec).
|
||||
#[test]
|
||||
fn lang_field_preserved_from_input_doc() {
|
||||
let doc = text_doc("yaml", "key1: value1\nkey2: value2\n");
|
||||
let chunks = CodeTextParagraphV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert!(!chunks.is_empty(), "expected at least one chunk");
|
||||
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { lang, symbol, .. } => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("yaml"),
|
||||
"lang must be 'yaml', got {lang:?}"
|
||||
);
|
||||
assert!(
|
||||
symbol.is_none(),
|
||||
"symbol must be None for Tier 3 chunker, got {symbol:?}"
|
||||
);
|
||||
}
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
134
crates/kebab-chunk/tests/dockerfile_file_v1.rs
Normal file
134
crates/kebab-chunk/tests/dockerfile_file_v1.rs
Normal file
@@ -0,0 +1,134 @@
|
||||
//! Behavioural tests for `DockerfileFileV1Chunker`.
|
||||
//!
|
||||
//! Documents are constructed manually (no kebab-parse-code dependency) by
|
||||
//! placing the raw Dockerfile text into a single `Block::Code`, mirroring the
|
||||
//! pattern used in `k8s_manifest_resource_v1.rs`.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::DockerfileFileV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
|
||||
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
|
||||
WorkspacePath, id_for_block, id_for_doc,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
// ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
/// Build a `CanonicalDocument` with a single `Block::Code` containing `dockerfile_text`.
|
||||
fn dockerfile_doc(dockerfile_text: &str) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("build/Dockerfile".into());
|
||||
let aid = AssetId("d".repeat(64));
|
||||
let pv = ParserVersion("code-dockerfile-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
let line_count = dockerfile_text.lines().count() as u32;
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 1,
|
||||
line_end: line_count.max(1),
|
||||
symbol: None,
|
||||
lang: Some("dockerfile".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
|
||||
let block = Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("dockerfile".into()),
|
||||
code: dockerfile_text.to_string(),
|
||||
});
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "Dockerfile".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks: vec![block],
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("dockerfile".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("dockerfile-file-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
// ── tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// A simple 5-line Dockerfile fixture must emit exactly 1 chunk with the
|
||||
/// correct symbol, lang, and line range.
|
||||
#[test]
|
||||
fn dockerfile_emits_single_chunk() {
|
||||
let fixture_path = fixtures_dir().join("sample.dockerfile");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = dockerfile_doc(&text);
|
||||
let chunks = DockerfileFileV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
// Inspect the Chunk's source_spans for symbol / lang / line range.
|
||||
let span = chunks[0].source_spans.first().expect("at least one span");
|
||||
match span {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(*line_start, 1, "line_start must be 1");
|
||||
assert_eq!(*line_end, 5, "line_end must be 5 (5-line fixture)");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<dockerfile>"),
|
||||
"symbol must be '<dockerfile>'"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("dockerfile"), "lang must be 'dockerfile'");
|
||||
}
|
||||
other => panic!("expected SourceSpan::Code, got {other:?}"),
|
||||
}
|
||||
|
||||
// Verify chunker_version label.
|
||||
assert_eq!(chunks[0].chunker_version.0, "dockerfile-file-v1");
|
||||
}
|
||||
86
crates/kebab-chunk/tests/fixtures/code-sample.c.chunks.snapshot.json
vendored
Normal file
86
crates/kebab-chunk/tests/fixtures/code-sample.c.chunks.snapshot.json
vendored
Normal file
@@ -0,0 +1,86 @@
|
||||
[
|
||||
{
|
||||
"block_ids": [
|
||||
"8149e12ca002489acb4a0f74c97a061a"
|
||||
],
|
||||
"chunk_id": "ec3cf06ae56c8e9796bbc9196438b7c5",
|
||||
"chunker_version": "code-c-ast-v1",
|
||||
"doc_id": "6bec42dd593920a060541db16c4e8e45",
|
||||
"heading_path": [],
|
||||
"policy_hash": "ecfad2ec1223662d",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "c",
|
||||
"line_end": 18,
|
||||
"line_start": 1,
|
||||
"symbol": "<top-level>"
|
||||
}
|
||||
],
|
||||
"text": "#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;",
|
||||
"token_estimate": 78
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"1baaa89f21a47b2f32d6396a24a85454"
|
||||
],
|
||||
"chunk_id": "c2d7a81c898106733ef2e703774a6a4a",
|
||||
"chunker_version": "code-c-ast-v1",
|
||||
"doc_id": "6bec42dd593920a060541db16c4e8e45",
|
||||
"heading_path": [],
|
||||
"policy_hash": "ecfad2ec1223662d",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "c",
|
||||
"line_end": 23,
|
||||
"line_start": 20,
|
||||
"symbol": "parse_record"
|
||||
}
|
||||
],
|
||||
"text": "int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}",
|
||||
"token_estimate": 41
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"8d0e14cbcc6d1e92d7878ab796ea68b8"
|
||||
],
|
||||
"chunk_id": "0e4d7b131ab64eba03b51903b5d8f96d",
|
||||
"chunker_version": "code-c-ast-v1",
|
||||
"doc_id": "6bec42dd593920a060541db16c4e8e45",
|
||||
"heading_path": [],
|
||||
"policy_hash": "ecfad2ec1223662d",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "c",
|
||||
"line_end": 27,
|
||||
"line_start": 25,
|
||||
"symbol": "print_record"
|
||||
}
|
||||
],
|
||||
"text": "void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}",
|
||||
"token_estimate": 35
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"9c2ede84423871b615d48c38fefb1853"
|
||||
],
|
||||
"chunk_id": "e076f8edb2ff141d7e99b4106bb95157",
|
||||
"chunker_version": "code-c-ast-v1",
|
||||
"doc_id": "6bec42dd593920a060541db16c4e8e45",
|
||||
"heading_path": [],
|
||||
"policy_hash": "ecfad2ec1223662d",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "c",
|
||||
"line_end": 33,
|
||||
"line_start": 29,
|
||||
"symbol": "main"
|
||||
}
|
||||
],
|
||||
"text": "int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}",
|
||||
"token_estimate": 38
|
||||
}
|
||||
]
|
||||
107
crates/kebab-chunk/tests/fixtures/code-sample.cpp.chunks.snapshot.json
vendored
Normal file
107
crates/kebab-chunk/tests/fixtures/code-sample.cpp.chunks.snapshot.json
vendored
Normal file
@@ -0,0 +1,107 @@
|
||||
[
|
||||
{
|
||||
"block_ids": [
|
||||
"53292605459065d170cd36c118e20546"
|
||||
],
|
||||
"chunk_id": "50a5b324300d9082eac4ce2a422810e1",
|
||||
"chunker_version": "code-cpp-ast-v1",
|
||||
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
|
||||
"heading_path": [],
|
||||
"policy_hash": "71f3c07bb9ec1d09",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "cpp",
|
||||
"line_end": 4,
|
||||
"line_start": 1,
|
||||
"symbol": "<top-level>"
|
||||
}
|
||||
],
|
||||
"text": "#include <string>\n#include <vector>\n\nnamespace kebab {",
|
||||
"token_estimate": 18
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"f349acad94c9fa4cf9ad1c0a93e83610"
|
||||
],
|
||||
"chunk_id": "0e6bc7c522665af8a4b0f66afb9d29c8",
|
||||
"chunker_version": "code-cpp-ast-v1",
|
||||
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
|
||||
"heading_path": [],
|
||||
"policy_hash": "71f3c07bb9ec1d09",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "cpp",
|
||||
"line_end": 20,
|
||||
"line_start": 6,
|
||||
"symbol": "kebab::chunk::MdHeadingV1Chunker"
|
||||
}
|
||||
],
|
||||
"text": "class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};",
|
||||
"token_estimate": 95
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"8b9811387717d0bd4abf84abcc35b8b1"
|
||||
],
|
||||
"chunk_id": "d9326d252905b665b2adb9a416c20451",
|
||||
"chunker_version": "code-cpp-ast-v1",
|
||||
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
|
||||
"heading_path": [],
|
||||
"policy_hash": "71f3c07bb9ec1d09",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "cpp",
|
||||
"line_end": 25,
|
||||
"line_start": 22,
|
||||
"symbol": "kebab::identity"
|
||||
}
|
||||
],
|
||||
"text": "template <typename T>\nT identity(T value) {\n return value;\n}",
|
||||
"token_estimate": 21
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"1754cb6b971f6a4cb292f144a4f0570b"
|
||||
],
|
||||
"chunk_id": "56ee5f991de4a413c016da8dc4acfc35",
|
||||
"chunker_version": "code-cpp-ast-v1",
|
||||
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
|
||||
"heading_path": [],
|
||||
"policy_hash": "71f3c07bb9ec1d09",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "cpp",
|
||||
"line_end": 29,
|
||||
"line_start": 27,
|
||||
"symbol": "kebab::global_helper"
|
||||
}
|
||||
],
|
||||
"text": "void global_helper() {\n // free function in kebab namespace\n}",
|
||||
"token_estimate": 22
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"14b5f3393d6d25f822f5b70763d24acd"
|
||||
],
|
||||
"chunk_id": "c0d7c043cdd575c530db3909b54cc906",
|
||||
"chunker_version": "code-cpp-ast-v1",
|
||||
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
|
||||
"heading_path": [],
|
||||
"policy_hash": "71f3c07bb9ec1d09",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "cpp",
|
||||
"line_end": 34,
|
||||
"line_start": 31,
|
||||
"symbol": "main"
|
||||
}
|
||||
],
|
||||
"text": "int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}",
|
||||
"token_estimate": 23
|
||||
}
|
||||
]
|
||||
233
crates/kebab-chunk/tests/fixtures/code-sample.go.chunks.snapshot.json
vendored
Normal file
233
crates/kebab-chunk/tests/fixtures/code-sample.go.chunks.snapshot.json
vendored
Normal file
@@ -0,0 +1,233 @@
|
||||
[
|
||||
{
|
||||
"block_ids": [
|
||||
"c182bf37e32c7fc1b868bd617f8eaf66"
|
||||
],
|
||||
"chunk_id": "43de518d946dc18ec040ae20d74e0cff",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 5,
|
||||
"line_start": 1,
|
||||
"symbol": "imports"
|
||||
}
|
||||
],
|
||||
"text": "import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)",
|
||||
"token_estimate": 12
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"c9992cdcfdf3c2a7700a4abc4782a8a4"
|
||||
],
|
||||
"chunk_id": "af4c382a83f1e8cdea495d8b33c11abc",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 12,
|
||||
"line_start": 7,
|
||||
"symbol": "ComputeMRR"
|
||||
}
|
||||
],
|
||||
"text": "func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}",
|
||||
"token_estimate": 50
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5f18dc3e79fe946ba05d32c3bfc00684"
|
||||
],
|
||||
"chunk_id": "4be6d8f180bc19b8651877e5264852ac",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 20,
|
||||
"line_start": 14,
|
||||
"symbol": "MetricsCollector"
|
||||
}
|
||||
],
|
||||
"text": "type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}",
|
||||
"token_estimate": 45
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"3009cc022ca832c323393e4f9bcdb388"
|
||||
],
|
||||
"chunk_id": "3ae182f4c6d304ee7f0aaf447142f948",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 30,
|
||||
"line_start": 22,
|
||||
"symbol": "BaseEvaluator"
|
||||
}
|
||||
],
|
||||
"text": "type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}",
|
||||
"token_estimate": 53
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"e0e83d1d7f9327a1902ae9a8f67c1f1c"
|
||||
],
|
||||
"chunk_id": "b962f14980e756bb8ba514e2282756cd",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 38,
|
||||
"line_start": 32,
|
||||
"symbol": "MetricsCollector.Run"
|
||||
}
|
||||
],
|
||||
"text": "func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}",
|
||||
"token_estimate": 44
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"0e6a572bc3fe2bd6d173fe614bd1b763"
|
||||
],
|
||||
"chunk_id": "441c695e990e7f49188068433e313e87",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 46,
|
||||
"line_start": 40,
|
||||
"symbol": "MetricsCollector.Report"
|
||||
}
|
||||
],
|
||||
"text": "func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}",
|
||||
"token_estimate": 53
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5d269745b2e5dbdcbef0c09ba54b0bd6"
|
||||
],
|
||||
"chunk_id": "7a942d871c588ec69426290561f05179",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 247,
|
||||
"line_start": 48,
|
||||
"symbol": "BigCompute [part 1/5]"
|
||||
}
|
||||
],
|
||||
"text": "func BigCompute(data []int) int {\n\tv0 := 0\n\tif 0 < len(data) {\n\t\tv0 = data[0]\n\t}\n\tv1 := 0\n\tif 1 < len(data) {\n\t\tv1 = data[1]\n\t}\n\tv2 := 0\n\tif 2 < len(data) {\n\t\tv2 = data[2]\n\t}\n\tv3 := 0\n\tif 3 < len(data) {\n\t\tv3 = data[3]\n\t}\n\tv4 := 0\n\tif 4 < len(data) {\n\t\tv4 = data[4]\n\t}\n\tv5 := 0\n\tif 5 < len(data) {\n\t\tv5 = data[5]\n\t}\n\tv6 := 0\n\tif 6 < len(data) {\n\t\tv6 = data[6]\n\t}\n\tv7 := 0\n\tif 7 < len(data) {\n\t\tv7 = data[7]\n\t}\n\tv8 := 0\n\tif 8 < len(data) {\n\t\tv8 = data[8]\n\t}\n\tv9 := 0\n\tif 9 < len(data) {\n\t\tv9 = data[9]\n\t}\n\tv10 := 0\n\tif 10 < len(data) {\n\t\tv10 = data[10]\n\t}\n\tv11 := 0\n\tif 11 < len(data) {\n\t\tv11 = data[11]\n\t}\n\tv12 := 0\n\tif 12 < len(data) {\n\t\tv12 = data[12]\n\t}\n\tv13 := 0\n\tif 13 < len(data) {\n\t\tv13 = data[13]\n\t}\n\tv14 := 0\n\tif 14 < len(data) {\n\t\tv14 = data[14]\n\t}\n\tv15 := 0\n\tif 15 < len(data) {\n\t\tv15 = data[15]\n\t}\n\tv16 := 0\n\tif 16 < len(data) {\n\t\tv16 = data[16]\n\t}\n\tv17 := 0\n\tif 17 < len(data) {\n\t\tv17 = data[17]\n\t}\n\tv18 := 0\n\tif 18 < len(data) {\n\t\tv18 = data[18]\n\t}\n\tv19 := 0\n\tif 19 < len(data) {\n\t\tv19 = data[19]\n\t}\n\tv20 := 0\n\tif 20 < len(data) {\n\t\tv20 = data[20]\n\t}\n\tv21 := 0\n\tif 21 < len(data) {\n\t\tv21 = data[21]\n\t}\n\tv22 := 0\n\tif 22 < len(data) {\n\t\tv22 = data[22]\n\t}\n\tv23 := 0\n\tif 23 < len(data) {\n\t\tv23 = data[23]\n\t}\n\tv24 := 0\n\tif 24 < len(data) {\n\t\tv24 = data[24]\n\t}\n\tv25 := 0\n\tif 25 < len(data) {\n\t\tv25 = data[25]\n\t}\n\tv26 := 0\n\tif 26 < len(data) {\n\t\tv26 = data[26]\n\t}\n\tv27 := 0\n\tif 27 < len(data) {\n\t\tv27 = data[27]\n\t}\n\tv28 := 0\n\tif 28 < len(data) {\n\t\tv28 = data[28]\n\t}\n\tv29 := 0\n\tif 29 < len(data) {\n\t\tv29 = data[29]\n\t}\n\tv30 := 0\n\tif 30 < len(data) {\n\t\tv30 = data[30]\n\t}\n\tv31 := 0\n\tif 31 < len(data) {\n\t\tv31 = data[31]\n\t}\n\tv32 := 0\n\tif 32 < len(data) {\n\t\tv32 = data[32]\n\t}\n\tv33 := 0\n\tif 33 < len(data) {\n\t\tv33 = data[33]\n\t}\n\tv34 := 0\n\tif 34 < len(data) {\n\t\tv34 = data[34]\n\t}\n\tv35 := 0\n\tif 35 < len(data) {\n\t\tv35 = data[35]\n\t}\n\tv36 := 0\n\tif 36 < len(data) {\n\t\tv36 = data[36]\n\t}\n\tv37 := 0\n\tif 37 < len(data) {\n\t\tv37 = data[37]\n\t}\n\tv38 := 0\n\tif 38 < len(data) {\n\t\tv38 = data[38]\n\t}\n\tv39 := 0\n\tif 39 < len(data) {\n\t\tv39 = data[39]\n\t}\n\tv40 := 0\n\tif 40 < len(data) {\n\t\tv40 = data[40]\n\t}\n\tv41 := 0\n\tif 41 < len(data) {\n\t\tv41 = data[41]\n\t}\n\tv42 := 0\n\tif 42 < len(data) {\n\t\tv42 = data[42]\n\t}\n\tv43 := 0\n\tif 43 < len(data) {\n\t\tv43 = data[43]\n\t}\n\tv44 := 0\n\tif 44 < len(data) {\n\t\tv44 = data[44]\n\t}\n\tv45 := 0\n\tif 45 < len(data) {\n\t\tv45 = data[45]\n\t}\n\tv46 := 0\n\tif 46 < len(data) {\n\t\tv46 = data[46]\n\t}\n\tv47 := 0\n\tif 47 < len(data) {\n\t\tv47 = data[47]\n\t}\n\tv48 := 0\n\tif 48 < len(data) {\n\t\tv48 = data[48]\n\t}\n\tv49 := 0\n\tif 49 < len(data) {\n\t\tv49 = data[49]",
|
||||
"token_estimate": 847
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5d269745b2e5dbdcbef0c09ba54b0bd6"
|
||||
],
|
||||
"chunk_id": "3f44ba43c9415652e2705bb667776e76",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 447,
|
||||
"line_start": 248,
|
||||
"symbol": "BigCompute [part 2/5]"
|
||||
}
|
||||
],
|
||||
"text": "\t}\n\tv50 := 0\n\tif 50 < len(data) {\n\t\tv50 = data[50]\n\t}\n\tv51 := 0\n\tif 51 < len(data) {\n\t\tv51 = data[51]\n\t}\n\tv52 := 0\n\tif 52 < len(data) {\n\t\tv52 = data[52]\n\t}\n\tv53 := 0\n\tif 53 < len(data) {\n\t\tv53 = data[53]\n\t}\n\tv54 := 0\n\tif 54 < len(data) {\n\t\tv54 = data[54]\n\t}\n\tv55 := 0\n\tif 55 < len(data) {\n\t\tv55 = data[55]\n\t}\n\tv56 := 0\n\tif 56 < len(data) {\n\t\tv56 = data[56]\n\t}\n\tv57 := 0\n\tif 57 < len(data) {\n\t\tv57 = data[57]\n\t}\n\tv58 := 0\n\tif 58 < len(data) {\n\t\tv58 = data[58]\n\t}\n\tv59 := 0\n\tif 59 < len(data) {\n\t\tv59 = data[59]\n\t}\n\tv60 := 0\n\tif 60 < len(data) {\n\t\tv60 = data[60]\n\t}\n\tv61 := 0\n\tif 61 < len(data) {\n\t\tv61 = data[61]\n\t}\n\tv62 := 0\n\tif 62 < len(data) {\n\t\tv62 = data[62]\n\t}\n\tv63 := 0\n\tif 63 < len(data) {\n\t\tv63 = data[63]\n\t}\n\tv64 := 0\n\tif 64 < len(data) {\n\t\tv64 = data[64]\n\t}\n\tv65 := 0\n\tif 65 < len(data) {\n\t\tv65 = data[65]\n\t}\n\tv66 := 0\n\tif 66 < len(data) {\n\t\tv66 = data[66]\n\t}\n\tv67 := 0\n\tif 67 < len(data) {\n\t\tv67 = data[67]\n\t}\n\tv68 := 0\n\tif 68 < len(data) {\n\t\tv68 = data[68]\n\t}\n\tv69 := 0\n\tif 69 < len(data) {\n\t\tv69 = data[69]\n\t}\n\tv70 := 0\n\tif 70 < len(data) {\n\t\tv70 = data[70]\n\t}\n\tv71 := 0\n\tif 71 < len(data) {\n\t\tv71 = data[71]\n\t}\n\tv72 := 0\n\tif 72 < len(data) {\n\t\tv72 = data[72]\n\t}\n\tv73 := 0\n\tif 73 < len(data) {\n\t\tv73 = data[73]\n\t}\n\tv74 := 0\n\tif 74 < len(data) {\n\t\tv74 = data[74]\n\t}\n\tv75 := 0\n\tif 75 < len(data) {\n\t\tv75 = data[75]\n\t}\n\tv76 := 0\n\tif 76 < len(data) {\n\t\tv76 = data[76]\n\t}\n\tv77 := 0\n\tif 77 < len(data) {\n\t\tv77 = data[77]\n\t}\n\tv78 := 0\n\tif 78 < len(data) {\n\t\tv78 = data[78]\n\t}\n\tv79 := 0\n\tif 79 < len(data) {\n\t\tv79 = data[79]\n\t}\n\tv80 := 0\n\tif 80 < len(data) {\n\t\tv80 = data[80]\n\t}\n\tv81 := 0\n\tif 81 < len(data) {\n\t\tv81 = data[81]\n\t}\n\tv82 := 0\n\tif 82 < len(data) {\n\t\tv82 = data[82]\n\t}\n\tv83 := 0\n\tif 83 < len(data) {\n\t\tv83 = data[83]\n\t}\n\tv84 := 0\n\tif 84 < len(data) {\n\t\tv84 = data[84]\n\t}\n\tv85 := 0\n\tif 85 < len(data) {\n\t\tv85 = data[85]\n\t}\n\tv86 := 0\n\tif 86 < len(data) {\n\t\tv86 = data[86]\n\t}\n\tv87 := 0\n\tif 87 < len(data) {\n\t\tv87 = data[87]\n\t}\n\tv88 := 0\n\tif 88 < len(data) {\n\t\tv88 = data[88]\n\t}\n\tv89 := 0\n\tif 89 < len(data) {\n\t\tv89 = data[89]\n\t}\n\tv90 := 0\n\tif 90 < len(data) {\n\t\tv90 = data[90]\n\t}\n\tv91 := 0\n\tif 91 < len(data) {\n\t\tv91 = data[91]\n\t}\n\tv92 := 0\n\tif 92 < len(data) {\n\t\tv92 = data[92]\n\t}\n\tv93 := 0\n\tif 93 < len(data) {\n\t\tv93 = data[93]\n\t}\n\tv94 := 0\n\tif 94 < len(data) {\n\t\tv94 = data[94]\n\t}\n\tv95 := 0\n\tif 95 < len(data) {\n\t\tv95 = data[95]\n\t}\n\tv96 := 0\n\tif 96 < len(data) {\n\t\tv96 = data[96]\n\t}\n\tv97 := 0\n\tif 97 < len(data) {\n\t\tv97 = data[97]\n\t}\n\tv98 := 0\n\tif 98 < len(data) {\n\t\tv98 = data[98]\n\t}\n\tv99 := 0\n\tif 99 < len(data) {\n\t\tv99 = data[99]",
|
||||
"token_estimate": 850
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5d269745b2e5dbdcbef0c09ba54b0bd6"
|
||||
],
|
||||
"chunk_id": "e4763e10f059d97f40c2932761b56c3e",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 647,
|
||||
"line_start": 448,
|
||||
"symbol": "BigCompute [part 3/5]"
|
||||
}
|
||||
],
|
||||
"text": "\t}\n\tv100 := 0\n\tif 100 < len(data) {\n\t\tv100 = data[100]\n\t}\n\tv101 := 0\n\tif 101 < len(data) {\n\t\tv101 = data[101]\n\t}\n\tv102 := 0\n\tif 102 < len(data) {\n\t\tv102 = data[102]\n\t}\n\tv103 := 0\n\tif 103 < len(data) {\n\t\tv103 = data[103]\n\t}\n\tv104 := 0\n\tif 104 < len(data) {\n\t\tv104 = data[104]\n\t}\n\tv105 := 0\n\tif 105 < len(data) {\n\t\tv105 = data[105]\n\t}\n\tv106 := 0\n\tif 106 < len(data) {\n\t\tv106 = data[106]\n\t}\n\tv107 := 0\n\tif 107 < len(data) {\n\t\tv107 = data[107]\n\t}\n\tv108 := 0\n\tif 108 < len(data) {\n\t\tv108 = data[108]\n\t}\n\tv109 := 0\n\tif 109 < len(data) {\n\t\tv109 = data[109]\n\t}\n\tv110 := 0\n\tif 110 < len(data) {\n\t\tv110 = data[110]\n\t}\n\tv111 := 0\n\tif 111 < len(data) {\n\t\tv111 = data[111]\n\t}\n\tv112 := 0\n\tif 112 < len(data) {\n\t\tv112 = data[112]\n\t}\n\tv113 := 0\n\tif 113 < len(data) {\n\t\tv113 = data[113]\n\t}\n\tv114 := 0\n\tif 114 < len(data) {\n\t\tv114 = data[114]\n\t}\n\tv115 := 0\n\tif 115 < len(data) {\n\t\tv115 = data[115]\n\t}\n\tv116 := 0\n\tif 116 < len(data) {\n\t\tv116 = data[116]\n\t}\n\tv117 := 0\n\tif 117 < len(data) {\n\t\tv117 = data[117]\n\t}\n\tv118 := 0\n\tif 118 < len(data) {\n\t\tv118 = data[118]\n\t}\n\tv119 := 0\n\tif 119 < len(data) {\n\t\tv119 = data[119]\n\t}\n\tv120 := 0\n\tif 120 < len(data) {\n\t\tv120 = data[120]\n\t}\n\tv121 := 0\n\tif 121 < len(data) {\n\t\tv121 = data[121]\n\t}\n\tv122 := 0\n\tif 122 < len(data) {\n\t\tv122 = data[122]\n\t}\n\tv123 := 0\n\tif 123 < len(data) {\n\t\tv123 = data[123]\n\t}\n\tv124 := 0\n\tif 124 < len(data) {\n\t\tv124 = data[124]\n\t}\n\tv125 := 0\n\tif 125 < len(data) {\n\t\tv125 = data[125]\n\t}\n\tv126 := 0\n\tif 126 < len(data) {\n\t\tv126 = data[126]\n\t}\n\tv127 := 0\n\tif 127 < len(data) {\n\t\tv127 = data[127]\n\t}\n\tv128 := 0\n\tif 128 < len(data) {\n\t\tv128 = data[128]\n\t}\n\tv129 := 0\n\tif 129 < len(data) {\n\t\tv129 = data[129]\n\t}\n\tv130 := 0\n\tif 130 < len(data) {\n\t\tv130 = data[130]\n\t}\n\tv131 := 0\n\tif 131 < len(data) {\n\t\tv131 = data[131]\n\t}\n\tv132 := 0\n\tif 132 < len(data) {\n\t\tv132 = data[132]\n\t}\n\tv133 := 0\n\tif 133 < len(data) {\n\t\tv133 = data[133]\n\t}\n\tv134 := 0\n\tif 134 < len(data) {\n\t\tv134 = data[134]\n\t}\n\tv135 := 0\n\tif 135 < len(data) {\n\t\tv135 = data[135]\n\t}\n\tv136 := 0\n\tif 136 < len(data) {\n\t\tv136 = data[136]\n\t}\n\tv137 := 0\n\tif 137 < len(data) {\n\t\tv137 = data[137]\n\t}\n\tv138 := 0\n\tif 138 < len(data) {\n\t\tv138 = data[138]\n\t}\n\tv139 := 0\n\tif 139 < len(data) {\n\t\tv139 = data[139]\n\t}\n\tv140 := 0\n\tif 140 < len(data) {\n\t\tv140 = data[140]\n\t}\n\tv141 := 0\n\tif 141 < len(data) {\n\t\tv141 = data[141]\n\t}\n\tv142 := 0\n\tif 142 < len(data) {\n\t\tv142 = data[142]\n\t}\n\tv143 := 0\n\tif 143 < len(data) {\n\t\tv143 = data[143]\n\t}\n\tv144 := 0\n\tif 144 < len(data) {\n\t\tv144 = data[144]\n\t}\n\tv145 := 0\n\tif 145 < len(data) {\n\t\tv145 = data[145]\n\t}\n\tv146 := 0\n\tif 146 < len(data) {\n\t\tv146 = data[146]\n\t}\n\tv147 := 0\n\tif 147 < len(data) {\n\t\tv147 = data[147]\n\t}\n\tv148 := 0\n\tif 148 < len(data) {\n\t\tv148 = data[148]\n\t}\n\tv149 := 0\n\tif 149 < len(data) {\n\t\tv149 = data[149]",
|
||||
"token_estimate": 917
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5d269745b2e5dbdcbef0c09ba54b0bd6"
|
||||
],
|
||||
"chunk_id": "24176c911d0bacf9a29fa7f8251f5036",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 847,
|
||||
"line_start": 648,
|
||||
"symbol": "BigCompute [part 4/5]"
|
||||
}
|
||||
],
|
||||
"text": "\t}\n\tv150 := 0\n\tif 150 < len(data) {\n\t\tv150 = data[150]\n\t}\n\tv151 := 0\n\tif 151 < len(data) {\n\t\tv151 = data[151]\n\t}\n\tv152 := 0\n\tif 152 < len(data) {\n\t\tv152 = data[152]\n\t}\n\tv153 := 0\n\tif 153 < len(data) {\n\t\tv153 = data[153]\n\t}\n\tv154 := 0\n\tif 154 < len(data) {\n\t\tv154 = data[154]\n\t}\n\tv155 := 0\n\tif 155 < len(data) {\n\t\tv155 = data[155]\n\t}\n\tv156 := 0\n\tif 156 < len(data) {\n\t\tv156 = data[156]\n\t}\n\tv157 := 0\n\tif 157 < len(data) {\n\t\tv157 = data[157]\n\t}\n\tv158 := 0\n\tif 158 < len(data) {\n\t\tv158 = data[158]\n\t}\n\tv159 := 0\n\tif 159 < len(data) {\n\t\tv159 = data[159]\n\t}\n\tv160 := 0\n\tif 160 < len(data) {\n\t\tv160 = data[160]\n\t}\n\tv161 := 0\n\tif 161 < len(data) {\n\t\tv161 = data[161]\n\t}\n\tv162 := 0\n\tif 162 < len(data) {\n\t\tv162 = data[162]\n\t}\n\tv163 := 0\n\tif 163 < len(data) {\n\t\tv163 = data[163]\n\t}\n\tv164 := 0\n\tif 164 < len(data) {\n\t\tv164 = data[164]\n\t}\n\tv165 := 0\n\tif 165 < len(data) {\n\t\tv165 = data[165]\n\t}\n\tv166 := 0\n\tif 166 < len(data) {\n\t\tv166 = data[166]\n\t}\n\tv167 := 0\n\tif 167 < len(data) {\n\t\tv167 = data[167]\n\t}\n\tv168 := 0\n\tif 168 < len(data) {\n\t\tv168 = data[168]\n\t}\n\tv169 := 0\n\tif 169 < len(data) {\n\t\tv169 = data[169]\n\t}\n\tv170 := 0\n\tif 170 < len(data) {\n\t\tv170 = data[170]\n\t}\n\tv171 := 0\n\tif 171 < len(data) {\n\t\tv171 = data[171]\n\t}\n\tv172 := 0\n\tif 172 < len(data) {\n\t\tv172 = data[172]\n\t}\n\tv173 := 0\n\tif 173 < len(data) {\n\t\tv173 = data[173]\n\t}\n\tv174 := 0\n\tif 174 < len(data) {\n\t\tv174 = data[174]\n\t}\n\tv175 := 0\n\tif 175 < len(data) {\n\t\tv175 = data[175]\n\t}\n\tv176 := 0\n\tif 176 < len(data) {\n\t\tv176 = data[176]\n\t}\n\tv177 := 0\n\tif 177 < len(data) {\n\t\tv177 = data[177]\n\t}\n\tv178 := 0\n\tif 178 < len(data) {\n\t\tv178 = data[178]\n\t}\n\tv179 := 0\n\tif 179 < len(data) {\n\t\tv179 = data[179]\n\t}\n\tv180 := 0\n\tif 180 < len(data) {\n\t\tv180 = data[180]\n\t}\n\tv181 := 0\n\tif 181 < len(data) {\n\t\tv181 = data[181]\n\t}\n\tv182 := 0\n\tif 182 < len(data) {\n\t\tv182 = data[182]\n\t}\n\tv183 := 0\n\tif 183 < len(data) {\n\t\tv183 = data[183]\n\t}\n\tv184 := 0\n\tif 184 < len(data) {\n\t\tv184 = data[184]\n\t}\n\tv185 := 0\n\tif 185 < len(data) {\n\t\tv185 = data[185]\n\t}\n\tv186 := 0\n\tif 186 < len(data) {\n\t\tv186 = data[186]\n\t}\n\tv187 := 0\n\tif 187 < len(data) {\n\t\tv187 = data[187]\n\t}\n\tv188 := 0\n\tif 188 < len(data) {\n\t\tv188 = data[188]\n\t}\n\tv189 := 0\n\tif 189 < len(data) {\n\t\tv189 = data[189]\n\t}\n\tv190 := 0\n\tif 190 < len(data) {\n\t\tv190 = data[190]\n\t}\n\tv191 := 0\n\tif 191 < len(data) {\n\t\tv191 = data[191]\n\t}\n\tv192 := 0\n\tif 192 < len(data) {\n\t\tv192 = data[192]\n\t}\n\tv193 := 0\n\tif 193 < len(data) {\n\t\tv193 = data[193]\n\t}\n\tv194 := 0\n\tif 194 < len(data) {\n\t\tv194 = data[194]\n\t}\n\tv195 := 0\n\tif 195 < len(data) {\n\t\tv195 = data[195]\n\t}\n\tv196 := 0\n\tif 196 < len(data) {\n\t\tv196 = data[196]\n\t}\n\tv197 := 0\n\tif 197 < len(data) {\n\t\tv197 = data[197]\n\t}\n\tv198 := 0\n\tif 198 < len(data) {\n\t\tv198 = data[198]\n\t}\n\tv199 := 0\n\tif 199 < len(data) {\n\t\tv199 = data[199]",
|
||||
"token_estimate": 917
|
||||
},
|
||||
{
|
||||
"block_ids": [
|
||||
"5d269745b2e5dbdcbef0c09ba54b0bd6"
|
||||
],
|
||||
"chunk_id": "438127626378632c03780d10603de32c",
|
||||
"chunker_version": "code-go-ast-v1",
|
||||
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
|
||||
"heading_path": [],
|
||||
"policy_hash": "6cfe77abe2b0e5c3",
|
||||
"source_spans": [
|
||||
{
|
||||
"kind": "code",
|
||||
"lang": "go",
|
||||
"line_end": 890,
|
||||
"line_start": 848,
|
||||
"symbol": "BigCompute [part 5/5]"
|
||||
}
|
||||
],
|
||||
"text": "\t}\n\tv200 := 0\n\tif 200 < len(data) {\n\t\tv200 = data[200]\n\t}\n\tv201 := 0\n\tif 201 < len(data) {\n\t\tv201 = data[201]\n\t}\n\tv202 := 0\n\tif 202 < len(data) {\n\t\tv202 = data[202]\n\t}\n\tv203 := 0\n\tif 203 < len(data) {\n\t\tv203 = data[203]\n\t}\n\tv204 := 0\n\tif 204 < len(data) {\n\t\tv204 = data[204]\n\t}\n\tv205 := 0\n\tif 205 < len(data) {\n\t\tv205 = data[205]\n\t}\n\tv206 := 0\n\tif 206 < len(data) {\n\t\tv206 = data[206]\n\t}\n\tv207 := 0\n\tif 207 < len(data) {\n\t\tv207 = data[207]\n\t}\n\tv208 := 0\n\tif 208 < len(data) {\n\t\tv208 = data[208]\n\t}\n\tv209 := 0\n\tif 209 < len(data) {\n\t\tv209 = data[209]\n\t}\n\treturn len(data)\n}",
|
||||
"token_estimate": 191
|
||||
}
|
||||
]
|
||||
170
crates/kebab-chunk/tests/fixtures/code-sample.java.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.java.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
170
crates/kebab-chunk/tests/fixtures/code-sample.kt.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.kt.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
33
crates/kebab-chunk/tests/fixtures/sample.c
vendored
Normal file
33
crates/kebab-chunk/tests/fixtures/sample.c
vendored
Normal file
@@ -0,0 +1,33 @@
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
#define MAX_BUF 4096
|
||||
|
||||
typedef enum {
|
||||
OK = 0,
|
||||
ERR_PARSE,
|
||||
ERR_IO,
|
||||
} status_t;
|
||||
|
||||
typedef struct {
|
||||
int id;
|
||||
char name[64];
|
||||
status_t status;
|
||||
} record_t;
|
||||
|
||||
static int counter = 0;
|
||||
|
||||
int parse_record(const char *line, record_t *out) {
|
||||
if (line == NULL || out == NULL) return ERR_PARSE;
|
||||
return OK;
|
||||
}
|
||||
|
||||
void print_record(const record_t *r) {
|
||||
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
|
||||
}
|
||||
|
||||
int main(void) {
|
||||
record_t r = { .id = 1, .name = "foo", .status = OK };
|
||||
print_record(&r);
|
||||
return 0;
|
||||
}
|
||||
40
crates/kebab-chunk/tests/fixtures/sample.cpp
vendored
Normal file
40
crates/kebab-chunk/tests/fixtures/sample.cpp
vendored
Normal file
@@ -0,0 +1,40 @@
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace kebab {
|
||||
namespace chunk {
|
||||
|
||||
class MdHeadingV1Chunker {
|
||||
public:
|
||||
MdHeadingV1Chunker() = default;
|
||||
~MdHeadingV1Chunker() = default;
|
||||
|
||||
std::string chunk_doc(const std::string& doc) {
|
||||
return doc;
|
||||
}
|
||||
|
||||
int operator()(int x) const {
|
||||
return x * 2;
|
||||
}
|
||||
|
||||
private:
|
||||
int counter_ = 0;
|
||||
};
|
||||
|
||||
template <typename T>
|
||||
T identity(T value) {
|
||||
return value;
|
||||
}
|
||||
|
||||
} // namespace chunk
|
||||
|
||||
void global_helper() {
|
||||
// free function in kebab namespace
|
||||
}
|
||||
|
||||
} // namespace kebab
|
||||
|
||||
int main() {
|
||||
kebab::chunk::MdHeadingV1Chunker c;
|
||||
return 0;
|
||||
}
|
||||
5
crates/kebab-chunk/tests/fixtures/sample.dockerfile
vendored
Normal file
5
crates/kebab-chunk/tests/fixtures/sample.dockerfile
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
FROM rust:1.94-slim AS builder
|
||||
WORKDIR /app
|
||||
COPY . .
|
||||
RUN cargo build --release
|
||||
CMD ["/app/target/release/kebab"]
|
||||
7
crates/kebab-chunk/tests/fixtures/sample_cargo.toml
vendored
Normal file
7
crates/kebab-chunk/tests/fixtures/sample_cargo.toml
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
[package]
|
||||
name = "demo"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
serde = "1"
|
||||
5
crates/kebab-chunk/tests/fixtures/sample_go.mod
vendored
Normal file
5
crates/kebab-chunk/tests/fixtures/sample_go.mod
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
module example.com/demo
|
||||
|
||||
go 1.22
|
||||
|
||||
require github.com/spf13/cobra v1.8.0
|
||||
34
crates/kebab-chunk/tests/fixtures/sample_k8s.yaml
vendored
Normal file
34
crates/kebab-chunk/tests/fixtures/sample_k8s.yaml
vendored
Normal file
@@ -0,0 +1,34 @@
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: api-server
|
||||
namespace: prod
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: api-server
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: api-server
|
||||
spec:
|
||||
containers:
|
||||
- name: api
|
||||
image: example/api:1.2.3
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: api-server
|
||||
namespace: prod
|
||||
spec:
|
||||
selector:
|
||||
app: api-server
|
||||
ports:
|
||||
- port: 80
|
||||
targetPort: 8080
|
||||
---
|
||||
# Non-k8s document — apiVersion missing
|
||||
kind: ClusterIP
|
||||
foo: bar
|
||||
200
crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt
vendored
Normal file
200
crates/kebab-chunk/tests/fixtures/sample_long_paragraph.txt
vendored
Normal file
@@ -0,0 +1,200 @@
|
||||
line 001
|
||||
line 002
|
||||
line 003
|
||||
line 004
|
||||
line 005
|
||||
line 006
|
||||
line 007
|
||||
line 008
|
||||
line 009
|
||||
line 010
|
||||
line 011
|
||||
line 012
|
||||
line 013
|
||||
line 014
|
||||
line 015
|
||||
line 016
|
||||
line 017
|
||||
line 018
|
||||
line 019
|
||||
line 020
|
||||
line 021
|
||||
line 022
|
||||
line 023
|
||||
line 024
|
||||
line 025
|
||||
line 026
|
||||
line 027
|
||||
line 028
|
||||
line 029
|
||||
line 030
|
||||
line 031
|
||||
line 032
|
||||
line 033
|
||||
line 034
|
||||
line 035
|
||||
line 036
|
||||
line 037
|
||||
line 038
|
||||
line 039
|
||||
line 040
|
||||
line 041
|
||||
line 042
|
||||
line 043
|
||||
line 044
|
||||
line 045
|
||||
line 046
|
||||
line 047
|
||||
line 048
|
||||
line 049
|
||||
line 050
|
||||
line 051
|
||||
line 052
|
||||
line 053
|
||||
line 054
|
||||
line 055
|
||||
line 056
|
||||
line 057
|
||||
line 058
|
||||
line 059
|
||||
line 060
|
||||
line 061
|
||||
line 062
|
||||
line 063
|
||||
line 064
|
||||
line 065
|
||||
line 066
|
||||
line 067
|
||||
line 068
|
||||
line 069
|
||||
line 070
|
||||
line 071
|
||||
line 072
|
||||
line 073
|
||||
line 074
|
||||
line 075
|
||||
line 076
|
||||
line 077
|
||||
line 078
|
||||
line 079
|
||||
line 080
|
||||
line 081
|
||||
line 082
|
||||
line 083
|
||||
line 084
|
||||
line 085
|
||||
line 086
|
||||
line 087
|
||||
line 088
|
||||
line 089
|
||||
line 090
|
||||
line 091
|
||||
line 092
|
||||
line 093
|
||||
line 094
|
||||
line 095
|
||||
line 096
|
||||
line 097
|
||||
line 098
|
||||
line 099
|
||||
line 100
|
||||
line 101
|
||||
line 102
|
||||
line 103
|
||||
line 104
|
||||
line 105
|
||||
line 106
|
||||
line 107
|
||||
line 108
|
||||
line 109
|
||||
line 110
|
||||
line 111
|
||||
line 112
|
||||
line 113
|
||||
line 114
|
||||
line 115
|
||||
line 116
|
||||
line 117
|
||||
line 118
|
||||
line 119
|
||||
line 120
|
||||
line 121
|
||||
line 122
|
||||
line 123
|
||||
line 124
|
||||
line 125
|
||||
line 126
|
||||
line 127
|
||||
line 128
|
||||
line 129
|
||||
line 130
|
||||
line 131
|
||||
line 132
|
||||
line 133
|
||||
line 134
|
||||
line 135
|
||||
line 136
|
||||
line 137
|
||||
line 138
|
||||
line 139
|
||||
line 140
|
||||
line 141
|
||||
line 142
|
||||
line 143
|
||||
line 144
|
||||
line 145
|
||||
line 146
|
||||
line 147
|
||||
line 148
|
||||
line 149
|
||||
line 150
|
||||
line 151
|
||||
line 152
|
||||
line 153
|
||||
line 154
|
||||
line 155
|
||||
line 156
|
||||
line 157
|
||||
line 158
|
||||
line 159
|
||||
line 160
|
||||
line 161
|
||||
line 162
|
||||
line 163
|
||||
line 164
|
||||
line 165
|
||||
line 166
|
||||
line 167
|
||||
line 168
|
||||
line 169
|
||||
line 170
|
||||
line 171
|
||||
line 172
|
||||
line 173
|
||||
line 174
|
||||
line 175
|
||||
line 176
|
||||
line 177
|
||||
line 178
|
||||
line 179
|
||||
line 180
|
||||
line 181
|
||||
line 182
|
||||
line 183
|
||||
line 184
|
||||
line 185
|
||||
line 186
|
||||
line 187
|
||||
line 188
|
||||
line 189
|
||||
line 190
|
||||
line 191
|
||||
line 192
|
||||
line 193
|
||||
line 194
|
||||
line 195
|
||||
line 196
|
||||
line 197
|
||||
line 198
|
||||
line 199
|
||||
line 200
|
||||
7
crates/kebab-chunk/tests/fixtures/sample_package.json
vendored
Normal file
7
crates/kebab-chunk/tests/fixtures/sample_package.json
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"name": "demo",
|
||||
"version": "0.1.0",
|
||||
"dependencies": {
|
||||
"react": "^18.0.0"
|
||||
}
|
||||
}
|
||||
7
crates/kebab-chunk/tests/fixtures/sample_pom.xml
vendored
Normal file
7
crates/kebab-chunk/tests/fixtures/sample_pom.xml
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project xmlns="http://maven.apache.org/POM/4.0.0">
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
<groupId>com.demo</groupId>
|
||||
<artifactId>demo</artifactId>
|
||||
<version>0.1.0</version>
|
||||
</project>
|
||||
15
crates/kebab-chunk/tests/fixtures/sample_shell.sh
vendored
Normal file
15
crates/kebab-chunk/tests/fixtures/sample_shell.sh
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# First paragraph: env setup
|
||||
export KEBAB_HOME="${KEBAB_HOME:-$HOME/.local/share/kebab}"
|
||||
mkdir -p "$KEBAB_HOME"
|
||||
cd "$KEBAB_HOME"
|
||||
|
||||
# Second paragraph: ingest
|
||||
echo "ingesting workspace..."
|
||||
kebab ingest --config /etc/kebab/config.toml
|
||||
|
||||
# Third paragraph: report
|
||||
echo "done"
|
||||
kebab schema --json | jq '.stats'
|
||||
288
crates/kebab-chunk/tests/k8s_manifest_resource_v1.rs
Normal file
288
crates/kebab-chunk/tests/k8s_manifest_resource_v1.rs
Normal file
@@ -0,0 +1,288 @@
|
||||
//! Behavioural tests for `K8sManifestResourceV1Chunker`.
|
||||
//!
|
||||
//! Documents are constructed manually (no kebab-parse-code dependency) by
|
||||
//! placing the raw YAML text into a single `Block::Code`, mirroring the
|
||||
//! pattern used in `code_rust_ast_snapshot.rs`.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::K8sManifestResourceV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
|
||||
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
|
||||
WorkspacePath, id_for_block, id_for_doc,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
// ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
/// Build a `CanonicalDocument` with a single `Block::Code` containing `yaml_text`.
|
||||
fn yaml_doc(yaml_text: &str) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("manifests/deploy.yaml".into());
|
||||
let aid = AssetId("c".repeat(64));
|
||||
let pv = ParserVersion("code-yaml-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
let line_count = yaml_text.lines().count() as u32;
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 1,
|
||||
line_end: line_count.max(1),
|
||||
symbol: None,
|
||||
lang: Some("yaml".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
|
||||
let block = Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("yaml".into()),
|
||||
code: yaml_text.to_string(),
|
||||
});
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "deploy.yaml".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks: vec![block],
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("yaml".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("k8s-manifest-resource-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
// ── tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// Three YAML documents: 2 valid k8s resources + 1 non-k8s (no apiVersion).
|
||||
/// The chunker must emit exactly 2 chunks with the correct symbols and lang.
|
||||
#[test]
|
||||
fn k8s_multi_doc_emits_one_chunk_per_resource() {
|
||||
let fixture_path = fixtures_dir().join("sample_k8s.yaml");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = yaml_doc(&text);
|
||||
let chunks = K8sManifestResourceV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
2,
|
||||
"expected 2 k8s chunks, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let symbols: Vec<&str> = chunks
|
||||
.iter()
|
||||
.map(|c| {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
symbol.as_deref().expect("symbol must be Some for k8s chunks")
|
||||
}
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
assert_eq!(
|
||||
symbols,
|
||||
vec!["Deployment/prod/api-server", "Service/prod/api-server"],
|
||||
"symbols mismatch: {symbols:?}"
|
||||
);
|
||||
|
||||
// Verify lang = "yaml" on every chunk.
|
||||
for chunk in &chunks {
|
||||
match &chunk.source_spans[0] {
|
||||
SourceSpan::Code { lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("yaml"), "lang must be 'yaml'");
|
||||
}
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
// Verify chunker_version label.
|
||||
for chunk in &chunks {
|
||||
assert_eq!(chunk.chunker_version.0, "k8s-manifest-resource-v1");
|
||||
}
|
||||
|
||||
// Every chunk from a multi-resource file must have a distinct chunk_id.
|
||||
// Without the fix, all non-oversize resources get split_key=None which
|
||||
// collapses to the same id_hash (= base_policy_hash) → UNIQUE constraint
|
||||
// violation on the second resource.
|
||||
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
|
||||
assert_eq!(
|
||||
ids.len(),
|
||||
chunks.len(),
|
||||
"every k8s resource chunk must have a distinct chunk_id (multi-resource collision regression)"
|
||||
);
|
||||
}
|
||||
|
||||
/// A YAML document with an indentation error (tab in a space-indented context)
|
||||
/// must cause the chunker to return 0 chunks for the entire file.
|
||||
#[test]
|
||||
fn k8s_invalid_yaml_emits_zero_chunks() {
|
||||
// serde_yaml 0.9 is lenient about duplicate keys (last wins), so use a
|
||||
// genuine YAML structural error (unclosed flow sequence) to force a parse
|
||||
// failure.
|
||||
let actually_bad = "apiVersion: v1\nkind: Service\nfoo: [\nbar\n";
|
||||
|
||||
let doc = yaml_doc(actually_bad);
|
||||
let chunks = K8sManifestResourceV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk should not error — return Ok(vec![]) for invalid yaml");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
0,
|
||||
"invalid YAML must yield 0 chunks, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
}
|
||||
|
||||
/// A cluster-scoped resource (no `metadata.namespace`) must produce a symbol
|
||||
/// of the form `<Kind>/<name>` (two components, no namespace segment).
|
||||
#[test]
|
||||
fn k8s_cluster_scoped_resource_symbol() {
|
||||
let yaml = "\
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: cluster-admin
|
||||
rules:
|
||||
- apiGroups: [\"*\"]
|
||||
resources: [\"*\"]
|
||||
verbs: [\"*\"]
|
||||
";
|
||||
|
||||
let doc = yaml_doc(yaml);
|
||||
let chunks = K8sManifestResourceV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk for cluster-scoped resource, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("ClusterRole/cluster-admin"),
|
||||
"cluster-scoped symbol must be <Kind>/<name>"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("yaml"));
|
||||
}
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
/// 200+ line resource exercises `tier2_shared::push_chunks_with_oversize`'s
|
||||
/// line-window split branch. All chunks must share the same symbol
|
||||
/// (`<Kind>/<ns>/<name>`); their line ranges must form a contiguous
|
||||
/// partition; chunk_ids must all differ (the `#L{k}` suffix on `id_for_chunk`
|
||||
/// ensures uniqueness across windows). Spec p10-2 risks section explicitly
|
||||
/// flags "거대 ConfigMap" — this test covers that path.
|
||||
#[test]
|
||||
fn k8s_oversize_splits_into_line_windows_sharing_symbol() {
|
||||
// ConfigMap with 250 data keys → ~256 total lines, > AST_CHUNK_MAX_LINES (200).
|
||||
let mut yaml = String::from(
|
||||
"apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: big\n namespace: prod\ndata:\n",
|
||||
);
|
||||
for i in 0..250 {
|
||||
yaml.push_str(&format!(" key{i}: value{i}\n"));
|
||||
}
|
||||
|
||||
let doc = yaml_doc(&yaml);
|
||||
let chunks = K8sManifestResourceV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert!(
|
||||
chunks.len() >= 2,
|
||||
"expected ≥2 chunks for oversize resource, got {}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
// Every chunk must share the same symbol + lang.
|
||||
let expected_symbol = "ConfigMap/prod/big";
|
||||
for (i, c) in chunks.iter().enumerate() {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some(expected_symbol),
|
||||
"chunk[{i}] symbol must equal `{expected_symbol}`"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("yaml"));
|
||||
}
|
||||
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
// chunk_ids must all be distinct (oversize fallback's #L{k} suffix).
|
||||
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
|
||||
assert_eq!(
|
||||
ids.len(),
|
||||
chunks.len(),
|
||||
"oversize chunks must have distinct chunk_ids (the #L{{k}} suffix should disambiguate)"
|
||||
);
|
||||
|
||||
// Line ranges must form a contiguous partition: chunk[i].line_end + 1 == chunk[i+1].line_start.
|
||||
let ranges: Vec<(u32, u32)> = chunks
|
||||
.iter()
|
||||
.map(|c| match &c.source_spans[0] {
|
||||
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
|
||||
other => panic!("expected Code span, got {other:?}"),
|
||||
})
|
||||
.collect();
|
||||
for w in ranges.windows(2) {
|
||||
let (_, prev_end) = w[0];
|
||||
let (next_start, _) = w[1];
|
||||
assert_eq!(
|
||||
prev_end + 1,
|
||||
next_start,
|
||||
"line ranges must be contiguous: {} → {} (got gap or overlap)",
|
||||
prev_end,
|
||||
next_start
|
||||
);
|
||||
}
|
||||
}
|
||||
267
crates/kebab-chunk/tests/manifest_file_v1.rs
Normal file
267
crates/kebab-chunk/tests/manifest_file_v1.rs
Normal file
@@ -0,0 +1,267 @@
|
||||
//! Behavioural tests for `ManifestFileV1Chunker`.
|
||||
//!
|
||||
//! Documents are constructed manually (no kebab-parse-code dependency) by
|
||||
//! placing the raw manifest text into a single `Block::Code`, mirroring the
|
||||
//! pattern used in `dockerfile_file_v1.rs`.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::ManifestFileV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
|
||||
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
|
||||
WorkspacePath, id_for_block, id_for_doc,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
// ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
/// Build a `CanonicalDocument` with a single `Block::Code` containing manifest text.
|
||||
fn manifest_doc(lang: &str, manifest_text: &str) -> CanonicalDocument {
|
||||
let wp = WorkspacePath(format!("build/{}", manifest_filename(lang)));
|
||||
let aid = AssetId("m".repeat(64));
|
||||
let pv = ParserVersion("code-manifest-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
let line_count = manifest_text.lines().count() as u32;
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 1,
|
||||
line_end: line_count.max(1),
|
||||
symbol: None,
|
||||
lang: Some(lang.into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
|
||||
let block = Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some(lang.into()),
|
||||
code: manifest_text.to_string(),
|
||||
});
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: format!("Manifest ({})", lang),
|
||||
lang: Lang("und".into()),
|
||||
blocks: vec![block],
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some(lang.into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn manifest_filename(lang: &str) -> &'static str {
|
||||
match lang {
|
||||
"toml" => "Cargo.toml",
|
||||
"json" => "package.json",
|
||||
"xml" => "pom.xml",
|
||||
"go-mod" => "go.mod",
|
||||
_ => "manifest",
|
||||
}
|
||||
}
|
||||
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("manifest-file-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
// ── tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
/// A Cargo.toml fixture must emit exactly 1 chunk with the correct symbol,
|
||||
/// lang, and line range.
|
||||
#[test]
|
||||
fn cargo_toml_single_chunk_with_toml_lang() {
|
||||
let fixture_path = fixtures_dir().join("sample_cargo.toml");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = manifest_doc("toml", &text);
|
||||
let chunks = ManifestFileV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let span = chunks[0].source_spans.first().expect("at least one span");
|
||||
match span {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end: _,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(*line_start, 1, "line_start must be 1");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<manifest>"),
|
||||
"symbol must be '<manifest>'"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("toml"), "lang must be 'toml'");
|
||||
}
|
||||
other => panic!("expected SourceSpan::Code, got {other:?}"),
|
||||
}
|
||||
|
||||
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
|
||||
}
|
||||
|
||||
/// A package.json fixture must emit exactly 1 chunk with the correct symbol,
|
||||
/// lang, and line range.
|
||||
#[test]
|
||||
fn package_json_single_chunk_with_json_lang() {
|
||||
let fixture_path = fixtures_dir().join("sample_package.json");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = manifest_doc("json", &text);
|
||||
let chunks = ManifestFileV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let span = chunks[0].source_spans.first().expect("at least one span");
|
||||
match span {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end: _,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(*line_start, 1, "line_start must be 1");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<manifest>"),
|
||||
"symbol must be '<manifest>'"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("json"), "lang must be 'json'");
|
||||
}
|
||||
other => panic!("expected SourceSpan::Code, got {other:?}"),
|
||||
}
|
||||
|
||||
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
|
||||
}
|
||||
|
||||
/// A pom.xml fixture must emit exactly 1 chunk with the correct symbol,
|
||||
/// lang, and line range.
|
||||
#[test]
|
||||
fn pom_xml_single_chunk_with_xml_lang() {
|
||||
let fixture_path = fixtures_dir().join("sample_pom.xml");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = manifest_doc("xml", &text);
|
||||
let chunks = ManifestFileV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let span = chunks[0].source_spans.first().expect("at least one span");
|
||||
match span {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end: _,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(*line_start, 1, "line_start must be 1");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<manifest>"),
|
||||
"symbol must be '<manifest>'"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("xml"), "lang must be 'xml'");
|
||||
}
|
||||
other => panic!("expected SourceSpan::Code, got {other:?}"),
|
||||
}
|
||||
|
||||
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
|
||||
}
|
||||
|
||||
/// A go.mod fixture must emit exactly 1 chunk with the correct symbol,
|
||||
/// lang, and line range.
|
||||
#[test]
|
||||
fn go_mod_single_chunk_with_go_mod_lang() {
|
||||
let fixture_path = fixtures_dir().join("sample_go.mod");
|
||||
let text = std::fs::read_to_string(&fixture_path)
|
||||
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
|
||||
|
||||
let doc = manifest_doc("go-mod", &text);
|
||||
let chunks = ManifestFileV1Chunker
|
||||
.chunk(&doc, &policy())
|
||||
.expect("chunk");
|
||||
|
||||
assert_eq!(
|
||||
chunks.len(),
|
||||
1,
|
||||
"expected 1 chunk, got {}: {chunks:#?}",
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let span = chunks[0].source_spans.first().expect("at least one span");
|
||||
match span {
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end: _,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(*line_start, 1, "line_start must be 1");
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("<manifest>"),
|
||||
"symbol must be '<manifest>'"
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("go-mod"), "lang must be 'go-mod'");
|
||||
}
|
||||
other => panic!("expected SourceSpan::Code, got {other:?}"),
|
||||
}
|
||||
|
||||
assert_eq!(chunks[0].chunker_version.0, "manifest-file-v1");
|
||||
}
|
||||
@@ -275,6 +275,14 @@ enum Cmd {
|
||||
#[arg(long, group = "reset_scope")]
|
||||
config_only: bool,
|
||||
|
||||
/// Purge stored docs that are outside the current walker scope
|
||||
/// (config narrowing / removed sub-directory). No filesystem paths
|
||||
/// are removed — this is purely a store-level reconciliation.
|
||||
/// Filesystem existence is NOT checked; anything the current walker
|
||||
/// would not visit is considered an orphan and removed from the store.
|
||||
#[arg(long, group = "reset_scope")]
|
||||
orphans_only: bool,
|
||||
|
||||
/// Skip the interactive confirm. Required in non-interactive
|
||||
/// contexts (CI, pipes).
|
||||
#[arg(long)]
|
||||
@@ -595,14 +603,20 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
println!("{}", serde_json::to_string(&wire::wire_ingest(&report))?);
|
||||
} else {
|
||||
let skipped_breakdown = kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
|
||||
let purged_suffix = if report.purged_deleted_files > 0 {
|
||||
format!(" purged {}", report.purged_deleted_files)
|
||||
} else {
|
||||
String::new()
|
||||
};
|
||||
println!(
|
||||
"scanned {} new {} updated {} skipped {}{} errors {} ({} ms)",
|
||||
"scanned {} new {} updated {} skipped {}{} errors {}{} ({} ms)",
|
||||
report.scanned,
|
||||
report.new,
|
||||
report.updated,
|
||||
report.skipped,
|
||||
skipped_breakdown,
|
||||
report.errors,
|
||||
purged_suffix,
|
||||
report.duration_ms
|
||||
);
|
||||
}
|
||||
@@ -919,6 +933,15 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
let next = resp.next_cursor.as_deref().unwrap_or("(none)");
|
||||
eprintln!("[truncated; use --cursor {next} for the next page]");
|
||||
}
|
||||
// v0.17.0 A5 Step 4: short-query advisory. `resp.hint`
|
||||
// is `Some` only when the result list is empty and the
|
||||
// trimmed query is shorter than the trigram tokenizer
|
||||
// can resolve (raw FTS5 mode opts out). stderr so it
|
||||
// doesn't pollute the stdout hit list. `--json` skips
|
||||
// this branch entirely; the field rides the wire.
|
||||
if let Some(hint) = &resp.hint {
|
||||
eprintln!("[hint] {hint}");
|
||||
}
|
||||
if *trace {
|
||||
if let Some(t) = &resp.trace {
|
||||
eprintln!();
|
||||
@@ -1088,6 +1111,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
data_only: _,
|
||||
vector_only,
|
||||
config_only,
|
||||
orphans_only,
|
||||
yes,
|
||||
} => {
|
||||
use kebab_app::ResetScope;
|
||||
@@ -1101,11 +1125,50 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
ResetScope::VectorOnly
|
||||
} else if *config_only {
|
||||
ResetScope::ConfigOnly
|
||||
} else if *orphans_only {
|
||||
ResetScope::OrphansOnly
|
||||
} else {
|
||||
ResetScope::DataOnly
|
||||
};
|
||||
|
||||
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
|
||||
|
||||
if matches!(scope, ResetScope::OrphansOnly) {
|
||||
// OrphansOnly: confirm UI shows orphan count + sample paths
|
||||
// rather than on-disk directory sizes.
|
||||
let orphan_paths = kebab_app::enumerate_orphans(&cfg)?;
|
||||
|
||||
if !*yes {
|
||||
use std::io::IsTerminal;
|
||||
if !std::io::stdin().is_terminal() {
|
||||
anyhow::bail!(
|
||||
"reset --orphans-only is destructive and stdin is non-interactive — pass --yes to proceed"
|
||||
);
|
||||
}
|
||||
if !confirm_orphans_only(&orphan_paths)? {
|
||||
if !cli.quiet {
|
||||
eprintln!("aborted.");
|
||||
}
|
||||
return Ok(());
|
||||
}
|
||||
}
|
||||
|
||||
let report = kebab_app::reset::execute(scope, &cfg)?;
|
||||
if cli.json {
|
||||
println!("{}", serde_json::to_string(&wire::wire_reset(&report))?);
|
||||
} else {
|
||||
if report.orphans_purged > 0 {
|
||||
println!("orphans purged: {}", report.orphans_purged);
|
||||
for p in &report.purged_paths {
|
||||
println!(" - {}", p.0);
|
||||
}
|
||||
} else {
|
||||
println!("no orphaned docs found — store is already in sync with walker scope");
|
||||
}
|
||||
}
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let paths = kebab_app::reset::enumerate_paths(scope, &cfg);
|
||||
let bytes = kebab_app::reset::estimate_size_bytes(&paths);
|
||||
|
||||
@@ -1444,6 +1507,46 @@ fn confirm_destructive(
|
||||
Ok(matches!(s.as_str(), "y" | "yes"))
|
||||
}
|
||||
|
||||
/// Confirm prompt for `--orphans-only`: shows the orphan count + a
|
||||
/// sample of up to 5 paths so the user knows what will be purged before
|
||||
/// committing. No filesystem paths are removed — only store records.
|
||||
fn confirm_orphans_only(
|
||||
orphan_paths: &[kebab_core::WorkspacePath],
|
||||
) -> anyhow::Result<bool> {
|
||||
use std::io::Write;
|
||||
let n = orphan_paths.len();
|
||||
let mut out = std::io::stderr().lock();
|
||||
|
||||
if n == 0 {
|
||||
writeln!(out, "no orphaned docs found — nothing to purge.")?;
|
||||
out.flush()?;
|
||||
// Nothing to do; treat as confirmed so the caller can emit the
|
||||
// "no orphans" report without prompting.
|
||||
return Ok(true);
|
||||
}
|
||||
|
||||
let sample: Vec<&str> = orphan_paths
|
||||
.iter()
|
||||
.take(5)
|
||||
.map(|p| p.0.as_str())
|
||||
.collect();
|
||||
let sample_str = sample.join(", ");
|
||||
let ellipsis = if n > 5 { ", …" } else { "" };
|
||||
|
||||
writeln!(
|
||||
out,
|
||||
"Purge {n} stored doc(s) outside the current walker scope? (no filesystem paths removed)"
|
||||
)?;
|
||||
writeln!(out, " sample: {sample_str}{ellipsis}")?;
|
||||
write!(out, "[y/N] ")?;
|
||||
out.flush()?;
|
||||
|
||||
let mut line = String::new();
|
||||
std::io::stdin().read_line(&mut line)?;
|
||||
let s = line.trim().to_ascii_lowercase();
|
||||
Ok(matches!(s.as_str(), "y" | "yes"))
|
||||
}
|
||||
|
||||
/// p9-fb-35: human-friendly plain output for `kebab fetch`.
|
||||
fn render_fetch_plain(r: &kebab_core::FetchResult) {
|
||||
println!("# {} ({})", r.doc_path.0, format_kind(r.kind));
|
||||
|
||||
@@ -92,6 +92,14 @@ pub fn wire_search_response(r: &kebab_app::SearchResponse) -> Value {
|
||||
map.insert("trace".to_string(), trace_v);
|
||||
}
|
||||
}
|
||||
// v0.17.0 A5 Step 4b: emit `hint` only when set. Keeps responses
|
||||
// that don't carry a hint backward-compatible with v0 consumers
|
||||
// that don't know the field.
|
||||
if let Some(hint) = &r.hint {
|
||||
if let Value::Object(ref mut map) = v {
|
||||
map.insert("hint".to_string(), Value::String(hint.clone()));
|
||||
}
|
||||
}
|
||||
tag_object(v, "search_response.v1")
|
||||
}
|
||||
|
||||
@@ -260,6 +268,7 @@ mod tests {
|
||||
skipped_generated: 0,
|
||||
skipped_size_exceeded: 0,
|
||||
skip_examples: SkipExamples::default(),
|
||||
purged_deleted_files: 0,
|
||||
items: None,
|
||||
};
|
||||
let v = wire_ingest(&r);
|
||||
@@ -291,6 +300,7 @@ mod tests {
|
||||
next_cursor: Some("opaque-cursor-abc".to_string()),
|
||||
truncated: true,
|
||||
trace: None,
|
||||
hint: None,
|
||||
};
|
||||
let v = wire_search_response(&r);
|
||||
assert_eq!(schema_of(&v), Some("search_response.v1"));
|
||||
@@ -364,6 +374,8 @@ mod tests {
|
||||
scope: kebab_app::ResetScope::DataOnly,
|
||||
removed_paths: vec![std::path::PathBuf::from("/tmp/x")],
|
||||
embedding_rows_truncated: 0,
|
||||
orphans_purged: 0,
|
||||
purged_paths: vec![],
|
||||
};
|
||||
let v = wire_reset(&r);
|
||||
assert_eq!(schema_of(&v), Some("reset_report.v1"));
|
||||
@@ -402,6 +414,7 @@ mod tests {
|
||||
}],
|
||||
timing: TraceTiming { lexical_ms: 5, vector_ms: 0, fusion_ms: 1, total_ms: 7 },
|
||||
}),
|
||||
hint: None,
|
||||
};
|
||||
let v = wire_search_response(&r);
|
||||
assert_eq!(schema_of(&v), Some("search_response.v1"));
|
||||
@@ -417,6 +430,7 @@ mod tests {
|
||||
next_cursor: None,
|
||||
truncated: false,
|
||||
trace: None,
|
||||
hint: None,
|
||||
};
|
||||
let v = wire_search_response(&r);
|
||||
assert!(v.get("trace").is_none(), "trace field absent when None");
|
||||
|
||||
@@ -47,8 +47,20 @@ fn search_json_emits_search_response_v1_wrapper() {
|
||||
fn search_json_truncates_with_max_tokens() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
|
||||
let body: String = "rust ownership is a memory model. ".repeat(10);
|
||||
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
|
||||
// v0.17.0 trigram tokenizer makes FTS5 snippet() tokens 3-char wide
|
||||
// (was full words under unicode61), so an individual snippet stays
|
||||
// around ~60 chars — too short to ever exceed the snippet-shorten
|
||||
// budget cap on a single-hit fixture. To still exercise the budget
|
||||
// loop deterministically, we ingest multiple hits and pick a budget
|
||||
// small enough that the loop has to *pop* hits, which flips
|
||||
// truncated=true regardless of snippet length.
|
||||
for i in 0..5 {
|
||||
fs::write(
|
||||
workspace.join(format!("d{i}.md")),
|
||||
format!("# T{i}\n\nrust ownership is a memory model.\n"),
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
common::ingest(&cfg, &workspace);
|
||||
|
||||
let (stdout, _stderr) = common::run_search_with_args(
|
||||
@@ -211,8 +223,15 @@ fn search_stale_cursor_returns_error_v1_with_stale_cursor_code() {
|
||||
fn search_plain_emits_truncated_hint_to_stderr() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
|
||||
let body: String = "rust ownership is a memory model. ".repeat(10);
|
||||
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
|
||||
// v0.17.0 trigram tokenizer — same multi-doc rationale as
|
||||
// `search_json_truncates_with_max_tokens` above.
|
||||
for i in 0..5 {
|
||||
fs::write(
|
||||
workspace.join(format!("d{i}.md")),
|
||||
format!("# T{i}\n\nrust ownership is a memory model.\n"),
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
common::ingest(&cfg, &workspace);
|
||||
|
||||
let (_stdout, stderr) = common::run_search_with_args(
|
||||
@@ -224,3 +243,76 @@ fn search_plain_emits_truncated_hint_to_stderr() {
|
||||
"stderr must carry truncated hint: {stderr:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_plain_emits_short_query_hint_to_stderr() {
|
||||
// v0.17.0 A5 Step 6: 2-char query under trigram tokenizer emits
|
||||
// empty hits + stderr `[hint]` advisory. Empty workspace is enough
|
||||
// — hits are always empty so the hint condition depends only on
|
||||
// query length (<3 chars trimmed) + non-raw mode + hits.is_empty.
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
|
||||
common::ingest(&cfg, &workspace);
|
||||
|
||||
let (_stdout, stderr) = common::run_search_with_args(
|
||||
&cfg,
|
||||
&["--mode", "lexical", "ab"],
|
||||
);
|
||||
assert!(
|
||||
stderr.contains("[hint]"),
|
||||
"stderr must carry short-query hint: {stderr:?}"
|
||||
);
|
||||
assert!(
|
||||
stderr.contains("3자 이상"),
|
||||
"hint message must mention '3자 이상' (Korean advisory): {stderr:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_json_emits_hint_field_for_short_query() {
|
||||
// v0.17.0 A5 Step 6: --json mode carries the same advisory on the
|
||||
// `search_response.v1.hint` additive field. Empty hits + 2-char
|
||||
// query + non-raw mode trips the helper. Verifies the MCP-visible
|
||||
// surface (agents read the field instead of parsing stderr).
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
|
||||
common::ingest(&cfg, &workspace);
|
||||
|
||||
let (stdout, _stderr) = common::run_search_with_args(
|
||||
&cfg,
|
||||
&["--json", "--mode", "lexical", "ab"],
|
||||
);
|
||||
let v: Value = serde_json::from_str(stdout.trim())
|
||||
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
|
||||
assert!(
|
||||
v["hits"].as_array().unwrap().is_empty(),
|
||||
"empty hits expected for short query in empty KB: {v}"
|
||||
);
|
||||
assert_eq!(
|
||||
v["hint"].as_str().expect("hint field set on short empty result"),
|
||||
"3자 이상 키워드 권장 (trigram tokenizer 제약)",
|
||||
"hint must carry the standard advisory: {v}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_json_omits_hint_field_when_query_is_long_enough() {
|
||||
// v0.17.0 A5 Step 6 (negative case): 3+ char query never trips
|
||||
// hint, even on an empty KB. Verifies `serialize_search_response`
|
||||
// omits the additive `hint` field when `None` so existing wire
|
||||
// consumers stay backward-compatible.
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
|
||||
common::ingest(&cfg, &workspace);
|
||||
|
||||
let (stdout, _stderr) = common::run_search_with_args(
|
||||
&cfg,
|
||||
&["--json", "--mode", "lexical", "abc"],
|
||||
);
|
||||
let v: Value = serde_json::from_str(stdout.trim())
|
||||
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
|
||||
assert!(
|
||||
v.get("hint").is_none(),
|
||||
"hint must be absent for ≥3-char queries: {v}"
|
||||
);
|
||||
}
|
||||
|
||||
@@ -122,6 +122,23 @@ pub struct LlmCfg {
|
||||
pub endpoint: String,
|
||||
pub temperature: f32,
|
||||
pub seed: u64,
|
||||
/// v0.17.0 post-dogfood: Hard ceiling on a single HTTP exchange to
|
||||
/// the LLM endpoint (Ollama, etc.). Cold-loading an 8B+ model on
|
||||
/// CPU-only hosts can spend 60-90s on model load + several minutes
|
||||
/// on a first inference, blowing past the old hard-coded 300s cap
|
||||
/// and surfacing as `error: kb-rag: llm.generate_stream` to the
|
||||
/// user. Config-driven so 16-GB / CPU-only deployments using small
|
||||
/// (≤4B) models can keep the original 300s and large-model dogfood
|
||||
/// can dial it up (e.g. 1200s) without rebuilding.
|
||||
///
|
||||
/// **Edge case — `0` is NOT a disable sentinel.**
|
||||
/// `reqwest::ClientBuilder::timeout(Duration::from_secs(0))` sets a
|
||||
/// 0-second read timeout, so every request fails *immediately* with
|
||||
/// `error: kb-rag: ollama timeout`. To approximate "no cap", use a
|
||||
/// large finite value (e.g. `u64::MAX` ≈ 5.8 × 10¹¹ years, or
|
||||
/// just a generous number like `86400`).
|
||||
#[serde(default = "default_llm_request_timeout_secs")]
|
||||
pub request_timeout_secs: u64,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
@@ -147,6 +164,13 @@ fn default_cache_capacity() -> usize {
|
||||
256
|
||||
}
|
||||
|
||||
/// v0.17.0 post-dogfood: matches the legacy hard-coded ceiling so
|
||||
/// existing configs that omit the field keep behaving identically.
|
||||
/// Overridable per config / `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS`.
|
||||
fn default_llm_request_timeout_secs() -> u64 {
|
||||
300
|
||||
}
|
||||
|
||||
fn default_stale_threshold_days() -> u32 {
|
||||
30
|
||||
}
|
||||
@@ -204,6 +228,22 @@ pub struct OcrCfg {
|
||||
/// Cap the long edge of the image (in pixels) before sending. Larger
|
||||
/// images bloat prompt cost. Default `1600`.
|
||||
pub max_pixels: u32,
|
||||
/// v0.17.2 post-dogfood: Hard ceiling on a single HTTP exchange to
|
||||
/// the OCR endpoint. Sister knob to [`LlmCfg::request_timeout_secs`]
|
||||
/// — kept separate because OCR latency is typically shorter than
|
||||
/// chat-LLM cold start, and large vision models on CPU-only hosts
|
||||
/// occasionally need a different budget. See HOTFIXES 2026-05-25
|
||||
/// for the rationale.
|
||||
///
|
||||
/// **Edge case — `0` is NOT a disable sentinel.** Same semantics as
|
||||
/// [`LlmCfg::request_timeout_secs`]: `Duration::from_secs(0)` means
|
||||
/// "every request fails immediately" (reqwest 0.12.x — the read
|
||||
/// timeout is applied as a 0-second deadline), not "no timeout".
|
||||
/// To approximate "no cap", use a large finite value (e.g.
|
||||
/// `u64::MAX` ≈ 5.8 × 10¹¹ years, or just a generous number like
|
||||
/// `86400`).
|
||||
#[serde(default = "default_ocr_request_timeout_secs")]
|
||||
pub request_timeout_secs: u64,
|
||||
}
|
||||
|
||||
impl OcrCfg {
|
||||
@@ -215,10 +255,18 @@ impl OcrCfg {
|
||||
endpoint: None,
|
||||
languages: vec!["eng".to_string(), "kor".to_string()],
|
||||
max_pixels: 1600,
|
||||
request_timeout_secs: default_ocr_request_timeout_secs(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// v0.17.2 post-dogfood: matches the legacy hard-coded ceiling so
|
||||
/// existing configs that omit the field keep behaving identically.
|
||||
/// Overridable per config / `KEBAB_IMAGE_OCR_REQUEST_TIMEOUT_SECS`.
|
||||
fn default_ocr_request_timeout_secs() -> u64 {
|
||||
300
|
||||
}
|
||||
|
||||
/// Caption settings (P6-3). Caption uses the same Ollama-vision /
|
||||
/// `LanguageModel` pipeline as the rest of the workspace; the trait
|
||||
/// abstraction is the part the spec demands. `enabled` defaults to
|
||||
@@ -363,12 +411,14 @@ impl Config {
|
||||
// gemma4 계열 통일 — OCR (P6-2) + caption (P6-3)
|
||||
// 어댑터가 같은 family 사용. 사용자가 더 큰
|
||||
// variant (gemma4:26b 등) 원하면 자기 config.toml
|
||||
// 에서 override.
|
||||
// 에서 override. CPU-only / ≤16 GB RAM 환경이면
|
||||
// gemma3:4b 같은 ≤4B Q4 모델 권장 (README 참조).
|
||||
model: "gemma4:e4b".to_string(),
|
||||
context_tokens: 32768,
|
||||
endpoint: "http://127.0.0.1:11434".to_string(),
|
||||
temperature: 0.0,
|
||||
seed: 0,
|
||||
request_timeout_secs: default_llm_request_timeout_secs(),
|
||||
},
|
||||
},
|
||||
search: SearchCfg {
|
||||
@@ -621,6 +671,11 @@ impl Config {
|
||||
self.models.llm.seed = n;
|
||||
}
|
||||
}
|
||||
"KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS" => {
|
||||
if let Ok(n) = v.parse::<u64>() {
|
||||
self.models.llm.request_timeout_secs = n;
|
||||
}
|
||||
}
|
||||
|
||||
// search
|
||||
"KEBAB_SEARCH_DEFAULT_K" => {
|
||||
@@ -691,6 +746,11 @@ impl Config {
|
||||
self.image.ocr.max_pixels = n;
|
||||
}
|
||||
}
|
||||
"KEBAB_IMAGE_OCR_REQUEST_TIMEOUT_SECS" => {
|
||||
if let Ok(n) = v.parse::<u64>() {
|
||||
self.image.ocr.request_timeout_secs = n;
|
||||
}
|
||||
}
|
||||
|
||||
// image.caption (P6-3)
|
||||
"KEBAB_IMAGE_CAPTION_ENABLED" => {
|
||||
@@ -803,6 +863,83 @@ fn parse_bool(s: &str) -> bool {
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
/// Legacy TOML fixture written before the `request_timeout_secs`
|
||||
/// knobs (LLM in v0.17.1, OCR follow-up) existed. Shared by
|
||||
/// `legacy_config_without_request_timeout_secs_uses_default`
|
||||
/// (LLM-side) and `legacy_config_without_ocr_request_timeout_secs_uses_default`
|
||||
/// (OCR-side) so both invariants pin against the same on-disk
|
||||
/// shape — schema drift in the legacy form only needs one edit.
|
||||
const LEGACY_PRE_TIMEOUT_TOML: &str = r#"
|
||||
schema_version = 1
|
||||
|
||||
[workspace]
|
||||
root = "/tmp/x"
|
||||
exclude = []
|
||||
|
||||
[storage]
|
||||
data_dir = "/tmp/x"
|
||||
sqlite = "/tmp/x/kebab.sqlite"
|
||||
vector_dir = "/tmp/x/lancedb"
|
||||
asset_dir = "/tmp/x/assets"
|
||||
artifact_dir = "/tmp/x/artifacts"
|
||||
model_dir = "/tmp/x/models"
|
||||
runs_dir = "/tmp/x/runs"
|
||||
copy_threshold_mb = 100
|
||||
|
||||
[indexing]
|
||||
max_parallel_extractors = 2
|
||||
max_parallel_embeddings = 1
|
||||
watch_filesystem = false
|
||||
|
||||
[chunking]
|
||||
target_tokens = 500
|
||||
overlap_tokens = 80
|
||||
respect_markdown_headings = true
|
||||
chunker_version = "md-heading-v1"
|
||||
|
||||
[models.embedding]
|
||||
provider = "fastembed"
|
||||
model = "multilingual-e5-large"
|
||||
version = "v1"
|
||||
dimensions = 1024
|
||||
batch_size = 64
|
||||
|
||||
[models.llm]
|
||||
provider = "ollama"
|
||||
model = "gemma3:4b"
|
||||
context_tokens = 4096
|
||||
endpoint = "http://127.0.0.1:11434"
|
||||
temperature = 0.0
|
||||
seed = 0
|
||||
|
||||
[search]
|
||||
default_k = 10
|
||||
hybrid_fusion = "rrf"
|
||||
rrf_k = 60
|
||||
snippet_chars = 220
|
||||
|
||||
[rag]
|
||||
prompt_template_version = "rag-v2"
|
||||
score_gate = 0.3
|
||||
explain_default = false
|
||||
max_context_tokens = 8000
|
||||
|
||||
[image.ocr]
|
||||
enabled = false
|
||||
engine = "ollama-vision"
|
||||
model = "gemma3:4b"
|
||||
languages = ["eng"]
|
||||
max_pixels = 1600
|
||||
|
||||
[image.caption]
|
||||
enabled = false
|
||||
max_pixels = 768
|
||||
prompt_template_version = "caption-v1"
|
||||
|
||||
[ui]
|
||||
theme = "dark"
|
||||
"#;
|
||||
|
||||
#[test]
|
||||
fn defaults_are_serde_roundtrip_stable() {
|
||||
let c = Config::defaults();
|
||||
@@ -873,6 +1010,35 @@ mod tests {
|
||||
assert!((c.models.llm.temperature - 0.7).abs() < 1e-6);
|
||||
}
|
||||
|
||||
/// v0.17.0 post-dogfood: matches the legacy hard-coded 300s cap so
|
||||
/// existing configs that omit the new field are not affected.
|
||||
#[test]
|
||||
fn default_llm_request_timeout_secs_is_300() {
|
||||
assert_eq!(Config::defaults().models.llm.request_timeout_secs, 300);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn env_overrides_models_llm_request_timeout_secs() {
|
||||
let mut env = HashMap::new();
|
||||
env.insert(
|
||||
"KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS".to_string(),
|
||||
"1200".to_string(),
|
||||
);
|
||||
let c = Config::defaults().apply_env(&env);
|
||||
assert_eq!(c.models.llm.request_timeout_secs, 1200);
|
||||
}
|
||||
|
||||
/// v0.17.0 post-dogfood: a config file written before the field
|
||||
/// existed (no `request_timeout_secs` key) must still parse and fall
|
||||
/// back to the 300s default — backwards-compat invariant. Fixture
|
||||
/// shared with the OCR-side invariant via [`LEGACY_PRE_TIMEOUT_TOML`].
|
||||
#[test]
|
||||
fn legacy_config_without_request_timeout_secs_uses_default() {
|
||||
let c: Config = toml::from_str(LEGACY_PRE_TIMEOUT_TOML)
|
||||
.expect("parse legacy config");
|
||||
assert_eq!(c.models.llm.request_timeout_secs, 300);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn env_overrides_indexing_watch_filesystem_bool() {
|
||||
let mut env = HashMap::new();
|
||||
@@ -894,6 +1060,38 @@ mod tests {
|
||||
assert_eq!(c.image.ocr.max_pixels, 1600);
|
||||
}
|
||||
|
||||
/// v0.17.2 post-dogfood: matches the legacy hard-coded 300s cap so
|
||||
/// existing configs that omit the new field keep behaving identically.
|
||||
#[test]
|
||||
fn default_ocr_request_timeout_secs_is_300() {
|
||||
assert_eq!(
|
||||
Config::defaults().image.ocr.request_timeout_secs,
|
||||
300
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn env_overrides_image_ocr_request_timeout_secs() {
|
||||
let mut env = HashMap::new();
|
||||
env.insert(
|
||||
"KEBAB_IMAGE_OCR_REQUEST_TIMEOUT_SECS".to_string(),
|
||||
"900".to_string(),
|
||||
);
|
||||
let c = Config::defaults().apply_env(&env);
|
||||
assert_eq!(c.image.ocr.request_timeout_secs, 900);
|
||||
}
|
||||
|
||||
/// post-v0.17.1 dogfood: a config file written before the OCR
|
||||
/// timeout field existed must still parse and fall back to the
|
||||
/// 300s default — backwards-compat invariant. Fixture shared
|
||||
/// with the LLM-side invariant via [`LEGACY_PRE_TIMEOUT_TOML`].
|
||||
#[test]
|
||||
fn legacy_config_without_ocr_request_timeout_secs_uses_default() {
|
||||
let c: Config = toml::from_str(LEGACY_PRE_TIMEOUT_TOML)
|
||||
.expect("parse legacy config");
|
||||
assert_eq!(c.image.ocr.request_timeout_secs, 300);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn image_ocr_env_overrides() {
|
||||
let mut env = HashMap::new();
|
||||
|
||||
@@ -47,6 +47,12 @@ pub struct IngestReport {
|
||||
/// p10-1A-1: sample file paths per skip category (≤ 5 each).
|
||||
#[serde(default)]
|
||||
pub skip_examples: SkipExamples,
|
||||
/// Dogfood: docs whose on-disk file was deleted since the last ingest
|
||||
/// and were therefore removed from the store. Additive field — older
|
||||
/// wire consumers that pre-date this field read it as 0 via
|
||||
/// `#[serde(default)]`.
|
||||
#[serde(default)]
|
||||
pub purged_deleted_files: u32,
|
||||
/// `None` ↔ wire `items: null` (`--summary-only`).
|
||||
pub items: Option<Vec<IngestItem>>,
|
||||
}
|
||||
@@ -136,6 +142,7 @@ mod tests {
|
||||
builtin_blacklist: vec!["node_modules/x.js".into()],
|
||||
gitignore: vec![],
|
||||
},
|
||||
purged_deleted_files: 0,
|
||||
items: None,
|
||||
};
|
||||
let v = serde_json::to_value(&r).unwrap();
|
||||
|
||||
@@ -8,7 +8,7 @@ use serde_json::Value;
|
||||
use crate::asset::{RawAsset, WorkspacePath};
|
||||
use crate::chunk::Chunk;
|
||||
use crate::document::{Block, CanonicalDocument};
|
||||
use crate::ids::{ChunkId, DocumentId};
|
||||
use crate::ids::{AssetId, ChunkId, DocumentId};
|
||||
use crate::jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus};
|
||||
use crate::media::MediaType;
|
||||
use crate::search::{DocFilter, DocSummary, SearchFilters, SearchHit, SearchQuery};
|
||||
@@ -161,14 +161,51 @@ pub trait DocumentStore {
|
||||
fn get_document(&self, id: &DocumentId) -> anyhow::Result<Option<CanonicalDocument>>;
|
||||
fn get_chunk(&self, id: &ChunkId) -> anyhow::Result<Option<Chunk>>;
|
||||
fn list_documents(&self, filter: &DocFilter) -> anyhow::Result<Vec<DocSummary>>;
|
||||
/// Look up an asset row by its `asset_id` (PRIMARY KEY = blake3
|
||||
/// content hash). Twin-file safe: asset_id is PK so there is
|
||||
/// exactly one row per unique content hash, regardless of how many
|
||||
/// `documents` rows share it. Use this instead of
|
||||
/// `get_asset_by_workspace_path` when you already have a
|
||||
/// `CanonicalDocument` (which carries `source_asset_id`).
|
||||
fn get_asset(&self, id: &AssetId) -> anyhow::Result<Option<RawAsset>>;
|
||||
|
||||
/// p9-fb-23: look up an asset row by its workspace path. Used by
|
||||
/// the incremental-ingest skip path to compare the freshly
|
||||
/// computed blake3 checksum against what's already in SQLite. The
|
||||
/// schema enforces a unique workspace_path per asset.
|
||||
///
|
||||
/// NOTE: for twin files (identical content at different paths),
|
||||
/// `assets.workspace_path` is "last-registered path" — it
|
||||
/// flip-flops on every ingest. Prefer `get_asset` (by asset_id)
|
||||
/// when you have a `CanonicalDocument.source_asset_id`.
|
||||
fn get_asset_by_workspace_path(
|
||||
&self,
|
||||
path: &WorkspacePath,
|
||||
) -> anyhow::Result<Option<RawAsset>>;
|
||||
|
||||
/// Look up a document row by its workspace path. Used by the
|
||||
/// document-centric skip path in `try_skip_unchanged` to avoid the
|
||||
/// twin-file flip-flop that the asset-side lookup suffers from
|
||||
/// (multiple files with identical content share one `assets` row
|
||||
/// whose `workspace_path` is overwritten on every UPSERT, so
|
||||
/// `get_asset_by_workspace_path` returns the wrong twin's path).
|
||||
///
|
||||
/// `documents.workspace_path` is UNIQUE (V001), so each twin has
|
||||
/// its own stable document row regardless of the asset de-dup.
|
||||
fn get_document_by_workspace_path(
|
||||
&self,
|
||||
path: &WorkspacePath,
|
||||
) -> anyhow::Result<Option<CanonicalDocument>>;
|
||||
|
||||
/// Return every `workspace_path` stored in the `documents` table.
|
||||
///
|
||||
/// Used by the post-walker sweep in `kebab-app::ingest` to detect
|
||||
/// documents whose source file has been deleted from the filesystem.
|
||||
/// The set difference `(stored - scanned)` yields orphan candidates;
|
||||
/// each candidate is then existence-checked on disk so that
|
||||
/// out-of-scope files (config narrowing) are NOT purged — only truly
|
||||
/// absent files trigger the purge.
|
||||
fn all_workspace_paths(&self) -> anyhow::Result<Vec<WorkspacePath>>;
|
||||
}
|
||||
|
||||
pub trait VectorStore {
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
"chunk_id": "chunk000000000000000000000000000000",
|
||||
"doc_id": "doc00000000000000000000000000000000",
|
||||
"heading_path": [],
|
||||
"score": 0.3429983854293823
|
||||
"score": 0.35202541947364807
|
||||
},
|
||||
"has_answer": false,
|
||||
"hits_count": 1,
|
||||
@@ -19,7 +19,7 @@
|
||||
"chunk_id": "chunk000000000000000000000000000002",
|
||||
"doc_id": "doc00000000000000000000000000000002",
|
||||
"heading_path": [],
|
||||
"score": 0.3585492968559265
|
||||
"score": 0.3414848744869232
|
||||
},
|
||||
"has_answer": false,
|
||||
"hits_count": 1,
|
||||
|
||||
@@ -48,10 +48,17 @@ use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::error::LlmError;
|
||||
|
||||
/// Hard ceiling on a single HTTP exchange. Cold-loading a 14B model on
|
||||
/// first call can take ~30s; 5 minutes is generous without being
|
||||
/// open-ended.
|
||||
const REQUEST_TIMEOUT: Duration = Duration::from_secs(300);
|
||||
// v0.17.0 post-dogfood: the per-request ceiling now lives in
|
||||
// `kebab_config::LlmCfg::request_timeout_secs` (default 300s) so users
|
||||
// running larger models on CPU-only hosts can extend it without a
|
||||
// rebuild. Cold-loading an 8B+ model on first call routinely takes
|
||||
// 60-90 s plus multi-minute inference; 300s was the legacy hard
|
||||
// ceiling and remains the default for back-compat.
|
||||
//
|
||||
// Edge case: `request_timeout_secs = 0` becomes
|
||||
// `Duration::from_secs(0)` which is reqwest's "fail immediately", NOT
|
||||
// "disable". The field doc explains the workaround (use u64::MAX or a
|
||||
// large finite value).
|
||||
|
||||
/// `reqwest::blocking` adapter implementing [`LanguageModel`] over Ollama's
|
||||
/// local HTTP API. Construction is cheap and offline; the first network
|
||||
@@ -79,7 +86,7 @@ impl OllamaLanguageModel {
|
||||
pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self> {
|
||||
let llm = &config.models.llm;
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.timeout(REQUEST_TIMEOUT)
|
||||
.timeout(Duration::from_secs(llm.request_timeout_secs))
|
||||
.build()?;
|
||||
Ok(Self {
|
||||
client,
|
||||
@@ -262,9 +269,11 @@ struct OllamaLine {
|
||||
///
|
||||
/// Timeout invariant: the iterator has no inherent stop condition for an
|
||||
/// indefinitely-stalled server — only the underlying
|
||||
/// `reqwest::blocking::Client`'s read timeout (`REQUEST_TIMEOUT`, 300s)
|
||||
/// breaks the hang. Callers needing tighter cancellation should adjust
|
||||
/// the client timeout in [`OllamaLanguageModel::new`].
|
||||
/// `reqwest::blocking::Client`'s read timeout (configured via
|
||||
/// `kebab_config::LlmCfg::request_timeout_secs`, default 300 s) breaks
|
||||
/// the hang. Callers needing tighter / looser bounds should set
|
||||
/// `[models.llm] request_timeout_secs = N` (or
|
||||
/// `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=N`) before building.
|
||||
struct OllamaStream {
|
||||
reader: BufReader<reqwest::blocking::Response>,
|
||||
line_buf: Vec<u8>,
|
||||
|
||||
@@ -19,6 +19,11 @@ tree-sitter-rust = { workspace = true }
|
||||
tree-sitter-python = { workspace = true }
|
||||
tree-sitter-typescript = { workspace = true }
|
||||
tree-sitter-javascript = { workspace = true }
|
||||
tree-sitter-go = { workspace = true }
|
||||
tree-sitter-java = { workspace = true }
|
||||
tree-sitter-kotlin-ng = { workspace = true }
|
||||
tree-sitter-c = { workspace = true }
|
||||
tree-sitter-cpp = { workspace = true }
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = { workspace = true }
|
||||
|
||||
720
crates/kebab-parse-code/src/c.rs
Normal file
720
crates/kebab-parse-code/src/c.rs
Normal file
@@ -0,0 +1,720 @@
|
||||
//! `kebab-parse-code::c` — tree-sitter C AST extractor (P10-1D Task B).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("c")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit:
|
||||
//!
|
||||
//! - `function_definition` → 1 unit, symbol = function name (extracted
|
||||
//! from the declarator's innermost `identifier`, handles pointer-returning
|
||||
//! functions where the declarator is wrapped in `pointer_declarator`).
|
||||
//! - `struct_specifier` (named) → 1 unit, symbol = struct name.
|
||||
//! - `enum_specifier` (named) → 1 unit, symbol = enum name.
|
||||
//! - `union_specifier` (named) → 1 unit, symbol = union name.
|
||||
//!
|
||||
//! Everything else (`declaration`, `preproc_*`, `type_definition`,
|
||||
//! `linkage_specification`, etc.) collapses into a single `<top-level>`
|
||||
//! glue chunk. If the file produces zero units **and** zero glue, the
|
||||
//! `<module>` post-pass emits one unit covering the whole file (1A-2
|
||||
//! pattern).
|
||||
//!
|
||||
//! C symbol = function name only — no namespace, no class nesting
|
||||
//! (design §3.4 C row). Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-c-v2";
|
||||
|
||||
/// C AST extractor. Per-unit blocks via tree-sitter-c 0.24.2
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct CAstExtractor;
|
||||
|
||||
impl CAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for CAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for CAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "c")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for CAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec())
|
||||
.map_err(|e| anyhow::anyhow!("kebab-parse-code: C source is not valid UTF-8: {e}"))?;
|
||||
|
||||
let blocks = build_blocks(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("c".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted C doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Walk down the declarator chain of a `function_definition` to find
|
||||
/// the innermost `identifier` — the function name.
|
||||
///
|
||||
/// The tree for `int *foo(int x) { ... }` looks like:
|
||||
/// ```text
|
||||
/// function_definition
|
||||
/// type: primitive_type "int"
|
||||
/// declarator: pointer_declarator
|
||||
/// declarator: function_declarator
|
||||
/// declarator: identifier "foo"
|
||||
/// parameters: parameter_list
|
||||
/// body: compound_statement
|
||||
/// ```
|
||||
/// We walk `declarator` fields recursively until we reach an `identifier`
|
||||
/// or run out of nodes. Returns `None` if no identifier is found
|
||||
/// (malformed / unsupported declarator shape).
|
||||
fn extract_fn_name<'a>(decl_node: tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
let mut cur = decl_node;
|
||||
loop {
|
||||
match cur.kind() {
|
||||
"identifier" => return Some(&src[cur.start_byte()..cur.end_byte()]),
|
||||
// pointer_declarator, function_declarator, array_declarator,
|
||||
// attributed_declarator, parenthesized_declarator —
|
||||
// all carry a `declarator` field pointing deeper.
|
||||
_ => {
|
||||
if let Some(inner) = cur.child_by_field_name("declarator") {
|
||||
cur = inner;
|
||||
} else {
|
||||
// No further `declarator` field; give up.
|
||||
return None;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_c::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-c language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
let root = tree.root_node();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue is accumulated as (start, end) pairs and flushed into one
|
||||
// "<top-level>" block (or "<module>" if no real unit exists).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
let mut glue: Vec<(u32, u32)> = Vec::new();
|
||||
|
||||
/// Walk preceding `comment` siblings to extend the unit's line range
|
||||
/// upward, folding doc / line comments into the unit (1B pattern).
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
|
||||
match child.kind() {
|
||||
"function_definition" => {
|
||||
if let Some(decl) = child.child_by_field_name("declarator") {
|
||||
if let Some(name) = extract_fn_name(decl, source) {
|
||||
flush_glue(&mut glue, &mut units);
|
||||
units.push((name.to_string(), s, e, true));
|
||||
} else {
|
||||
// Could not extract name — treat as glue.
|
||||
glue.push((s, e));
|
||||
}
|
||||
} else {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
"struct_specifier" | "enum_specifier" | "union_specifier" => {
|
||||
if let Some(name_node) = child.child_by_field_name("name") {
|
||||
let name = &source[name_node.start_byte()..name_node.end_byte()];
|
||||
flush_glue(&mut glue, &mut units);
|
||||
units.push((name.to_string(), s, e, true));
|
||||
} else {
|
||||
// Anonymous struct/enum/union at the top level (not
|
||||
// wrapped in typedef) — glue. typedef-wrapped case
|
||||
// is recovered in the `type_definition` arm below.
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
"type_definition" => {
|
||||
// v0.17.0 PR-B: typedef-wrapped anonymous aggregate
|
||||
// recovery. `typedef struct { ... } Foo;` exposes only
|
||||
// the alias `Foo` as a useful symbol — the inner
|
||||
// struct_specifier has no `name` field. Pre-v0.17.0
|
||||
// this whole construct collapsed into glue and hid the
|
||||
// alias from search (HOTFIXES 2026-05-21). v2 recovers
|
||||
// the alias from the `declarator` field and emits a
|
||||
// synthetic unit so `Citation::Code.symbol = "Foo"`.
|
||||
// Plain `typedef int MyInt;` (no inner aggregate) stays
|
||||
// glue — there's no struct body to name.
|
||||
if let Some(name) = recover_typedef_alias(child, source) {
|
||||
flush_glue(&mut glue, &mut units);
|
||||
units.push((name, s, e, true));
|
||||
} else {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
// Everything else: preprocessor directives, plain declarations
|
||||
// (global var / fn prototype), linkage_specification, etc.
|
||||
// — all collapse into glue.
|
||||
_ => {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
}
|
||||
flush_glue(&mut glue, &mut units);
|
||||
|
||||
// Post-pass: if the file has no real semantic unit (only glue, or
|
||||
// completely empty), rename the single glue unit to "<module>" and
|
||||
// emit it. If there are zero units AND zero glue, synthesise a
|
||||
// one-line "<module>" covering the whole file.
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
|
||||
if units.is_empty() {
|
||||
// Completely empty file or whitespace/comments only.
|
||||
let total = lines.len() as u32;
|
||||
units.push((
|
||||
"<module>".to_string(),
|
||||
1,
|
||||
total.max(1),
|
||||
false,
|
||||
));
|
||||
}
|
||||
// If there is only glue (no real unit) the single pushed "<top-level>"
|
||||
// label should be "<module>" — rename it now.
|
||||
if !has_real_unit {
|
||||
for (sym, _, _, _) in units.iter_mut() {
|
||||
if sym == "<top-level>" {
|
||||
*sym = "<module>".to_string();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("c".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("c".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-B: try to recover the typedef alias name from a
|
||||
/// `type_definition` node *iff* the inner type-specifier is an
|
||||
/// anonymous struct/enum/union. Returns `None` for any other shape
|
||||
/// (named aggregate handled elsewhere, plain type alias has no body
|
||||
/// worth naming).
|
||||
fn recover_typedef_alias(node: tree_sitter::Node, source: &str) -> Option<String> {
|
||||
let mut has_anon_aggregate = false;
|
||||
let mut cursor = node.walk();
|
||||
for sub in node.children(&mut cursor) {
|
||||
match sub.kind() {
|
||||
"struct_specifier" | "enum_specifier" | "union_specifier" => {
|
||||
if sub.child_by_field_name("name").is_none() {
|
||||
has_anon_aggregate = true;
|
||||
} else {
|
||||
// Named inner aggregate (e.g. `typedef struct Pt {...} P;`)
|
||||
// — the named struct itself is the primary symbol and
|
||||
// is *not* extracted at the top level today (it lives
|
||||
// inside `type_definition`, not as a sibling
|
||||
// `struct_specifier`). For v2 we keep behavior conservative:
|
||||
// return None so the type_definition stays glue, matching
|
||||
// pre-v2 behavior for this minor case. Real-world C tends
|
||||
// to use one of: bare named struct, typedef alias only,
|
||||
// or typedef on anonymous body — the latter is what we fix.
|
||||
return None;
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
if !has_anon_aggregate {
|
||||
return None;
|
||||
}
|
||||
let decl = node.child_by_field_name("declarator")?;
|
||||
extract_typedef_alias_name(decl, source).map(str::to_string)
|
||||
}
|
||||
|
||||
/// Extract the typedef alias identifier from a declarator subtree.
|
||||
/// Handles the common shapes: direct `type_identifier`, or one wrapped
|
||||
/// in pointer / function declarator nodes (the alias is always the
|
||||
/// rightmost `type_identifier` descendant).
|
||||
fn extract_typedef_alias_name<'a>(
|
||||
decl: tree_sitter::Node,
|
||||
source: &'a str,
|
||||
) -> Option<&'a str> {
|
||||
if decl.kind() == "type_identifier" {
|
||||
return Some(&source[decl.start_byte()..decl.end_byte()]);
|
||||
}
|
||||
let mut cursor = decl.walk();
|
||||
for sub in decl.children(&mut cursor) {
|
||||
if let Some(found) = extract_typedef_alias_name(sub, source) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
|
||||
units.push(("<top-level>".to_string(), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests_support {
|
||||
use kebab_core::*;
|
||||
use std::path::PathBuf;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
|
||||
RawAsset {
|
||||
asset_id: AssetId("a".repeat(64)),
|
||||
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
|
||||
workspace_path: WorkspacePath(workspace_path.to_string()),
|
||||
media_type: MediaType::Code(lang.to_string()),
|
||||
byte_len: 0,
|
||||
checksum: Checksum("b".repeat(64)),
|
||||
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
stored: AssetStorage::Reference {
|
||||
path: PathBuf::from(workspace_path),
|
||||
sha: Checksum("b".repeat(64)),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
pub fn extract_c(src: &str, path: &str) -> kebab_core::CanonicalDocument {
|
||||
use super::CAstExtractor;
|
||||
use kebab_core::Extractor;
|
||||
let asset = fixed_code_asset(path, "c");
|
||||
let cfg = ExtractConfig::default();
|
||||
let root = PathBuf::from("/tmp");
|
||||
let ctx = ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
CAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
|
||||
doc.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. } => symbol.clone(),
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_c() {
|
||||
let e = CAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("c".into())));
|
||||
assert!(!e.supports(&MediaType::Code("cpp".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_simple_function() {
|
||||
let src = "int add(int a, int b) { return a + b; }\n";
|
||||
let doc = tests_support::extract_c(src, "x/math.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "add"), "got {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_pointer_return_function() {
|
||||
let src = "int *find(int *arr, int n) { return arr; }\n";
|
||||
let doc = tests_support::extract_c(src, "x/find.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "find"), "ptr-return fn missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_static_function() {
|
||||
let src = "static void helper(void) {}\n";
|
||||
let doc = tests_support::extract_c(src, "x/helper.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "helper"), "static fn missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_extern_function() {
|
||||
let src = "extern int compute(int x);\n";
|
||||
// extern prototype is a declaration → glue
|
||||
let doc = tests_support::extract_c(src, "x/compute.c");
|
||||
let s = syms(&doc);
|
||||
// declaration (prototype) falls into glue → "<module>"
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<module>"),
|
||||
"expected <module> for extern proto: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_inline_function() {
|
||||
let src = "inline int square(int x) { return x * x; }\n";
|
||||
let doc = tests_support::extract_c(src, "x/square.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "square"), "inline fn missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_named_struct() {
|
||||
let src = "struct Point { int x; int y; };\n";
|
||||
let doc = tests_support::extract_c(src, "x/point.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "Point"), "struct missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_named_enum() {
|
||||
let src = "enum Color { RED, GREEN, BLUE };\n";
|
||||
let doc = tests_support::extract_c(src, "x/color.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "Color"), "enum missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_named_union() {
|
||||
let src = "union Data { int i; float f; };\n";
|
||||
let doc = tests_support::extract_c(src, "x/data.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "Data"), "union missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_anonymous_struct_falls_into_glue() {
|
||||
// Anonymous struct (no name field) → glue → "<module>" (only glue, no real unit)
|
||||
let src = "struct { int x; int y; } origin;\n";
|
||||
let doc = tests_support::extract_c(src, "x/anon.c");
|
||||
let s = syms(&doc);
|
||||
// anonymous struct is a declaration containing anonymous struct_specifier → glue
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<module>"),
|
||||
"expected <module> for anon struct: {s:?}"
|
||||
);
|
||||
// Must NOT emit a unit named after anything else
|
||||
assert!(
|
||||
!s.iter().any(|x| x == "origin"),
|
||||
"unexpected 'origin' unit: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_typedef_struct_emits_unit() {
|
||||
// v0.17.0 PR-B: `typedef struct { ... } Foo;` was previously a
|
||||
// hotfix-tracked deviation (HOTFIXES.md 2026-05-21) — the inner
|
||||
// struct_specifier is anonymous so the named-struct arm didn't
|
||||
// fire, dropping the whole construct into glue and hiding the
|
||||
// `Foo` alias from symbol search. The v2 extractor recovers the
|
||||
// typedef alias from the `declarator` field on the
|
||||
// `type_definition` node and emits a synthetic unit with that
|
||||
// name. parser_version bumped `code-c-v1` → `code-c-v2`.
|
||||
let src = "typedef struct { int x; int y; } Point;\n";
|
||||
let doc = tests_support::extract_c(src, "x/typedef.c");
|
||||
let s = syms(&doc);
|
||||
// The typedef alias surfaces as a Code symbol.
|
||||
assert!(
|
||||
s.iter().any(|x| x == "Point"),
|
||||
"expected 'Point' unit from typedef alias: {s:?}"
|
||||
);
|
||||
// No `<module>` (the file has exactly one semantic unit now,
|
||||
// the typedef alias — no glue-only fallback needed).
|
||||
assert!(
|
||||
!s.iter().any(|x| x == "<module>"),
|
||||
"no <module> fallback expected when typedef emits a unit: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_typedef_enum_emits_unit() {
|
||||
// Parallel coverage for enum_specifier — same typedef-alias
|
||||
// synthesis path. `typedef enum { A, B } Color;` → unit `Color`.
|
||||
let src = "typedef enum { A, B } Color;\n";
|
||||
let doc = tests_support::extract_c(src, "x/typedef_enum.c");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "Color"),
|
||||
"expected 'Color' unit from typedef enum alias: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_typedef_union_emits_unit() {
|
||||
// Parallel coverage for union_specifier.
|
||||
let src = "typedef union { int i; float f; } IntOrFloat;\n";
|
||||
let doc = tests_support::extract_c(src, "x/typedef_union.c");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "IntOrFloat"),
|
||||
"expected 'IntOrFloat' unit from typedef union alias: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_typedef_to_existing_type_stays_glue() {
|
||||
// Negative case: `typedef int MyInt;` has no inner struct/enum/
|
||||
// union — there's no struct body to attach the alias to, so the
|
||||
// construct falls into glue (becomes `<module>` when alone).
|
||||
// Confirms the new arm only fires for anonymous-struct typedef.
|
||||
let src = "typedef int MyInt;\n";
|
||||
let doc = tests_support::extract_c(src, "x/typedef_alias.c");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<module>"),
|
||||
"expected <module> for plain typedef alias: {s:?}"
|
||||
);
|
||||
assert!(
|
||||
!s.iter().any(|x| x == "MyInt"),
|
||||
"plain typedef alias must not emit a unit: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_preprocessor_directives_are_glue() {
|
||||
let src = "#include <stdio.h>\n#define MAX 100\n#ifdef DEBUG\n#endif\n";
|
||||
let doc = tests_support::extract_c(src, "x/macros.c");
|
||||
let s = syms(&doc);
|
||||
// Only preprocessor → no real unit → "<module>"
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<module>"),
|
||||
"expected <module> for preproc-only file: {s:?}"
|
||||
);
|
||||
assert_eq!(s.len(), 1, "expected exactly 1 block: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_multiple_functions_correct_count() {
|
||||
let src = "int foo(void) { return 1; }\nint bar(void) { return 2; }\nint baz(void) { return 3; }\n";
|
||||
let doc = tests_support::extract_c(src, "x/multi.c");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "foo"), "foo missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "bar"), "bar missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "baz"), "baz missing: {s:?}");
|
||||
assert_eq!(s.len(), 3, "expected 3 units: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_empty_file_produces_module() {
|
||||
let src = "";
|
||||
let doc = tests_support::extract_c(src, "x/empty.c");
|
||||
let s = syms(&doc);
|
||||
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_preprocessor_only_produces_module() {
|
||||
let src = "#include <stdlib.h>\n#define VERSION \"1.0\"\n";
|
||||
let doc = tests_support::extract_c(src, "x/header.c");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<module>"),
|
||||
"expected <module> for preproc-only file: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_mixed_functions_and_glue() {
|
||||
let src = r#"#include <stdio.h>
|
||||
|
||||
int compute(int x) {
|
||||
return x * 2;
|
||||
}
|
||||
|
||||
extern int lookup(int key);
|
||||
|
||||
void print_result(int v) {
|
||||
printf("%d\n", v);
|
||||
}
|
||||
"#;
|
||||
let doc = tests_support::extract_c(src, "x/mixed.c");
|
||||
let s = syms(&doc);
|
||||
// Two real functions + one glue block
|
||||
assert!(s.iter().any(|x| x == "compute"), "compute missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "print_result"), "print_result missing: {s:?}");
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<top-level>"),
|
||||
"<top-level> glue missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn c_extractor_deterministic_across_runs() {
|
||||
let src = r#"
|
||||
struct Node { int val; };
|
||||
int sum(int a, int b) { return a + b; }
|
||||
void noop(void) {}
|
||||
"#;
|
||||
let a = tests_support::extract_c(src, "x/det.c");
|
||||
for _ in 0..20 {
|
||||
assert_eq!(
|
||||
tests_support::extract_c(src, "x/det.c").blocks,
|
||||
a.blocks
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
883
crates/kebab-parse-code/src/cpp.rs
Normal file
883
crates/kebab-parse-code/src/cpp.rs
Normal file
@@ -0,0 +1,883 @@
|
||||
//! `kebab-parse-code::cpp` — tree-sitter C++ AST extractor (P10-1D Task C).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("cpp")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit, each carrying [`SourceSpan::Code`] with
|
||||
//! the unit's `::` separated symbol path (design §3.4 C++ row).
|
||||
//!
|
||||
//! ## Symbol formation
|
||||
//!
|
||||
//! Symbol = `namespace::Class::method` via recursive `build_blocks`:
|
||||
//!
|
||||
//! - `namespace_definition` (named) → push namespace name, recurse into body.
|
||||
//! - Anonymous namespace (`namespace { ... }`) → push `<anonymous>`, recurse.
|
||||
//! - `nested_namespace_specifier` (`outer::inner`) → push all segments, recurse.
|
||||
//! - `class_specifier` / `struct_specifier` (named) → emit class unit + recurse
|
||||
//! into body with class name pushed.
|
||||
//! - `function_definition` → emit method/function unit. Symbol is built from
|
||||
//! the prefix chain + the extracted declarator name component.
|
||||
//! - Out-of-class method def (`void Foo::bar() {}`) — the declarator's inner
|
||||
//! node is a `qualified_identifier`; its scope chain is prepended to the
|
||||
//! current prefix to form the full symbol.
|
||||
//! - `template_declaration` → recurse into named children with same prefix;
|
||||
//! the inner function/class body is matched by its own arm. Template params
|
||||
//! are NOT included in the symbol.
|
||||
//! - `enum_specifier` (named) → emit type unit.
|
||||
//! - `concept_definition` (C++20) → emit type unit.
|
||||
//! - `linkage_specification` (extern "C") → recurse into body with same prefix.
|
||||
//!
|
||||
//! ## Constructor / destructor / operator overload
|
||||
//!
|
||||
//! - Constructor: `function_declarator > identifier` matching the class name.
|
||||
//! Symbol = `Class::Class` (name duplicated, same convention as Java).
|
||||
//! - Destructor: `function_declarator > destructor_name`. Symbol = `Class::~Foo`.
|
||||
//! - Operator overload: `function_declarator > operator_name`. Symbol = `Class::operator+`.
|
||||
//! - Conversion operator: `function_definition.declarator` is `operator_cast`.
|
||||
//! Symbol = `Class::operator <type>` (e.g. `Class::operator bool`).
|
||||
//!
|
||||
//! ## Glue
|
||||
//!
|
||||
//! Everything not in the unit list collapses into a single `<top-level>` glue
|
||||
//! chunk (preproc, declarations, using, typedef, etc.). If the file produces
|
||||
//! zero units AND zero glue, the `<module>` post-pass emits one unit covering
|
||||
//! the whole file.
|
||||
//!
|
||||
//! Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-cpp-v1";
|
||||
|
||||
/// C++ AST extractor. Per-unit blocks via tree-sitter-cpp 0.23.4
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct CppAstExtractor;
|
||||
|
||||
impl CppAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for CppAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for CppAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "cpp")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for CppAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: C++ source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let blocks = build_blocks_top(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("cpp".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted C++ doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Core block-building logic
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Top-level entry: parse source, walk the `translation_unit` root, assemble
|
||||
/// units + glue, apply the `<module>` post-pass, and emit `Block::Code`s.
|
||||
fn build_blocks_top(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_cpp::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-cpp language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C++ source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
let root = tree.root_node();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue is accumulated as (start, end) pairs and flushed into one
|
||||
// "<top-level>" block (or "<module>" if no real unit exists).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
let mut glue: Vec<(u32, u32)> = Vec::new();
|
||||
|
||||
build_blocks(root, source, &[], &mut units, &mut glue);
|
||||
flush_glue(&mut glue, &mut units);
|
||||
|
||||
// Post-pass: if the file has no real semantic unit (only glue, or
|
||||
// completely empty), rename the single glue unit to "<module>".
|
||||
// If there are zero units AND zero glue, synthesize a one-line
|
||||
// "<module>" covering the whole file.
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
|
||||
if units.is_empty() {
|
||||
let total = lines.len() as u32;
|
||||
units.push(("<module>".to_string(), 1, total.max(1), false));
|
||||
}
|
||||
if !has_real_unit {
|
||||
for (sym, _, _, _) in units.iter_mut() {
|
||||
if sym == "<top-level>" {
|
||||
*sym = "<module>".to_string();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("cpp".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("cpp".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
/// Walk preceding `comment` siblings to extend the unit's line range upward,
|
||||
/// folding leading doc / line comments into the unit (1B pattern).
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
|
||||
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
|
||||
units.push(("<top-level>".to_string(), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
/// Walk a scope node (translation_unit, declaration_list, field_declaration_list)
|
||||
/// emitting unit + glue blocks. `prefix` is the current namespace/class chain
|
||||
/// (e.g. `["kebab", "Chunk", "Foo"]`).
|
||||
///
|
||||
/// After returning, any pending glue in `glue` is NOT flushed — callers
|
||||
/// responsible for flushing at the scope boundary (top-level flush in
|
||||
/// `build_blocks_top`). Within recursive scope bodies (namespace/class) we
|
||||
/// do flush before returning so that glue doesn't leak across scopes.
|
||||
fn build_blocks(
|
||||
node: tree_sitter::Node,
|
||||
source: &str,
|
||||
prefix: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(u32, u32)>,
|
||||
) {
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
|
||||
match child.kind() {
|
||||
"namespace_definition" => {
|
||||
// Flush pending glue before starting this namespace block.
|
||||
flush_glue(glue, units);
|
||||
|
||||
let name_node = child.child_by_field_name("name");
|
||||
let body = child
|
||||
.child_by_field_name("body")
|
||||
.unwrap_or(child);
|
||||
|
||||
match name_node {
|
||||
None => {
|
||||
// Anonymous namespace: push "<anonymous>", recurse.
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
new_prefix.push("<anonymous>".to_string());
|
||||
build_blocks(body, source, &new_prefix, units, glue);
|
||||
flush_glue(glue, units);
|
||||
}
|
||||
Some(nn) => match nn.kind() {
|
||||
"namespace_identifier" => {
|
||||
let name = &source[nn.start_byte()..nn.end_byte()];
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
new_prefix.push(name.to_string());
|
||||
build_blocks(body, source, &new_prefix, units, glue);
|
||||
flush_glue(glue, units);
|
||||
}
|
||||
"nested_namespace_specifier" => {
|
||||
// e.g. `namespace outer::inner { ... }`
|
||||
// All named children are namespace_identifier nodes.
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
let mut nc = nn.walk();
|
||||
for seg in nn.named_children(&mut nc) {
|
||||
new_prefix.push(source[seg.start_byte()..seg.end_byte()].to_string());
|
||||
}
|
||||
build_blocks(body, source, &new_prefix, units, glue);
|
||||
flush_glue(glue, units);
|
||||
}
|
||||
_ => {
|
||||
// Unknown name kind — treat entire namespace as glue.
|
||||
glue.push((s, e));
|
||||
}
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
"class_specifier" | "struct_specifier" => {
|
||||
let name_node = child.child_by_field_name("name");
|
||||
let Some(nn) = name_node else {
|
||||
// Anonymous class/struct — glue.
|
||||
glue.push((s, e));
|
||||
continue;
|
||||
};
|
||||
let name = match nn.kind() {
|
||||
"type_identifier" => &source[nn.start_byte()..nn.end_byte()],
|
||||
_ => {
|
||||
// template_type or qualified_identifier — use full text
|
||||
// as the symbol segment (includes template args).
|
||||
&source[nn.start_byte()..nn.end_byte()]
|
||||
}
|
||||
};
|
||||
|
||||
flush_glue(glue, units);
|
||||
let sym = build_symbol(prefix, &[name]);
|
||||
units.push((sym, s, e, true));
|
||||
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
new_prefix.push(name.to_string());
|
||||
build_blocks(body, source, &new_prefix, units, glue);
|
||||
flush_glue(glue, units);
|
||||
}
|
||||
}
|
||||
|
||||
"function_definition" => {
|
||||
let decl = child.child_by_field_name("declarator");
|
||||
let Some(decl_node) = decl else {
|
||||
glue.push((s, e));
|
||||
continue;
|
||||
};
|
||||
|
||||
match extract_fn_symbol(decl_node, source, prefix) {
|
||||
Some(sym) => {
|
||||
flush_glue(glue, units);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
None => {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
"template_declaration" => {
|
||||
// Unwrap: recurse into named children with same prefix.
|
||||
// The inner function/class/concept will be matched by their own
|
||||
// arms. template_parameter_list is not a unit; it will fall
|
||||
// through to glue (it's not a named child of the template_declaration
|
||||
// that matches any of our arms).
|
||||
build_blocks(child, source, prefix, units, glue);
|
||||
// Do NOT flush glue here — template body may be part of a glue group.
|
||||
}
|
||||
|
||||
"enum_specifier" => {
|
||||
if let Some(nn) = child.child_by_field_name("name") {
|
||||
let name = &source[nn.start_byte()..nn.end_byte()];
|
||||
flush_glue(glue, units);
|
||||
let sym = build_symbol(prefix, &[name]);
|
||||
units.push((sym, s, e, true));
|
||||
} else {
|
||||
// Anonymous enum — glue.
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
|
||||
"concept_definition" => {
|
||||
// C++20. Has required "name" field (identifier).
|
||||
if let Some(nn) = child.child_by_field_name("name") {
|
||||
let name = &source[nn.start_byte()..nn.end_byte()];
|
||||
flush_glue(glue, units);
|
||||
let sym = build_symbol(prefix, &[name]);
|
||||
units.push((sym, s, e, true));
|
||||
} else {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
|
||||
"linkage_specification" => {
|
||||
// extern "C" { ... } — glue-wrapper, but recurse into body
|
||||
// with same prefix so inner definitions are extracted.
|
||||
let body = child.child_by_field_name("body").unwrap_or(child);
|
||||
// The linkage_spec itself is glue; inner defs handled by recursion.
|
||||
// Don't emit the wrapper as a unit; but also don't push it as glue
|
||||
// since recursion will push its inner children individually.
|
||||
build_blocks(body, source, prefix, units, glue);
|
||||
}
|
||||
|
||||
// Everything else: preproc, declarations, using, typedef, etc.
|
||||
_ => {
|
||||
glue.push((s, e));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Join prefix + extras into a `::` separated symbol.
|
||||
fn build_symbol(prefix: &[String], extras: &[&str]) -> String {
|
||||
let mut parts: Vec<&str> = prefix.iter().map(String::as_str).collect();
|
||||
parts.extend_from_slice(extras);
|
||||
parts.join("::")
|
||||
}
|
||||
|
||||
/// Extract the symbol for a `function_definition` given its top-level
|
||||
/// `declarator` node. Returns `None` if the name cannot be determined.
|
||||
///
|
||||
/// The declarator chain may be:
|
||||
/// - `function_declarator` (plain fn or method)
|
||||
/// - `pointer_declarator` wrapping `function_declarator` (fn returning pointer)
|
||||
/// - `reference_declarator` wrapping `function_declarator` (fn returning ref)
|
||||
/// - `operator_cast` (conversion operator — e.g. `operator bool`)
|
||||
///
|
||||
/// The inner `function_declarator.declarator` is one of:
|
||||
/// - `identifier` → free fn or constructor, symbol = `prefix::name`
|
||||
/// - `field_identifier` → method in class body, symbol = `prefix::name`
|
||||
/// - `destructor_name` → `~Foo`, symbol = `prefix::~Foo`
|
||||
/// - `operator_name` → `operator+` etc., symbol = `prefix::operator+`
|
||||
/// - `qualified_identifier` → out-of-class def `Foo::bar` or `ns::Foo::bar`;
|
||||
/// the scope chain is extracted and prepended to prefix.
|
||||
///
|
||||
/// For `qualified_identifier`, the scope hierarchy (which may itself be a
|
||||
/// `qualified_identifier`) is flattened into a list of segments. These
|
||||
/// segments REPLACE the current prefix (since out-of-class defs carry their
|
||||
/// full scope explicitly). Example: `void ns::Foo::bar() {}` at top level
|
||||
/// with prefix=[] → segments=[ns, Foo, bar] → symbol = `ns::Foo::bar`.
|
||||
fn extract_fn_symbol(
|
||||
decl_node: tree_sitter::Node,
|
||||
source: &str,
|
||||
prefix: &[String],
|
||||
) -> Option<String> {
|
||||
// Walk down pointer/reference wrapper layers to reach the
|
||||
// function_declarator (or operator_cast at definition level).
|
||||
let fn_decl = unwrap_to_fn_declarator(decl_node, source)?;
|
||||
|
||||
match fn_decl.kind() {
|
||||
"operator_cast" => {
|
||||
// e.g. `operator bool() const` — the function_definition.declarator
|
||||
// IS the operator_cast (no function_declarator wrapper).
|
||||
// Symbol = `prefix::operator <type>`.
|
||||
let type_node = fn_decl.child_by_field_name("type")?;
|
||||
let type_text = &source[type_node.start_byte()..type_node.end_byte()];
|
||||
Some(build_symbol(prefix, &[&format!("operator {type_text}")]))
|
||||
}
|
||||
"function_declarator" => {
|
||||
let inner = fn_decl.child_by_field_name("declarator")?;
|
||||
extract_name_node(inner, source, prefix)
|
||||
}
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Walk pointer_declarator / reference_declarator chains down to the
|
||||
/// first `function_declarator` or `operator_cast` node.
|
||||
///
|
||||
/// Returns `None` if no such node is found (e.g. a function definition
|
||||
/// whose declarator is malformed or unknown).
|
||||
fn unwrap_to_fn_declarator<'a>(
|
||||
mut node: tree_sitter::Node<'a>,
|
||||
_source: &str,
|
||||
) -> Option<tree_sitter::Node<'a>> {
|
||||
loop {
|
||||
match node.kind() {
|
||||
"function_declarator" | "operator_cast" => return Some(node),
|
||||
"pointer_declarator" => {
|
||||
node = node.child_by_field_name("declarator")?;
|
||||
}
|
||||
"reference_declarator" | "rvalue_reference_declarator" => {
|
||||
// reference_declarator has no `declarator` field; its child
|
||||
// is in the unnamed children list.
|
||||
let mut walker = node.walk();
|
||||
node = node.named_children(&mut walker).next()?;
|
||||
}
|
||||
_ => return None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Given the innermost name node of a function_declarator, produce the symbol.
|
||||
fn extract_name_node(
|
||||
inner: tree_sitter::Node,
|
||||
source: &str,
|
||||
prefix: &[String],
|
||||
) -> Option<String> {
|
||||
match inner.kind() {
|
||||
"identifier" | "field_identifier" => {
|
||||
let name = &source[inner.start_byte()..inner.end_byte()];
|
||||
Some(build_symbol(prefix, &[name]))
|
||||
}
|
||||
"destructor_name" => {
|
||||
// destructor_name text includes the `~` prefix (e.g. "~Foo").
|
||||
let full = &source[inner.start_byte()..inner.end_byte()];
|
||||
Some(build_symbol(prefix, &[full]))
|
||||
}
|
||||
"operator_name" => {
|
||||
// Full text e.g. "operator+", "operator->", "operator()".
|
||||
let full = &source[inner.start_byte()..inner.end_byte()];
|
||||
Some(build_symbol(prefix, &[full]))
|
||||
}
|
||||
"template_function" | "template_method" => {
|
||||
// Template function like `foo<int>()`. Use the `name` field
|
||||
// (the identifier / field_identifier before `<`).
|
||||
let name_node = inner.child_by_field_name("name")?;
|
||||
let name = &source[name_node.start_byte()..name_node.end_byte()];
|
||||
Some(build_symbol(prefix, &[name]))
|
||||
}
|
||||
"qualified_identifier" => {
|
||||
// Out-of-class method definition. Flatten the nested
|
||||
// qualified_identifier chain into ordered segments.
|
||||
// Example: `ns::Foo::method`
|
||||
// qualified_identifier {
|
||||
// scope: namespace_identifier "ns"
|
||||
// name: qualified_identifier {
|
||||
// scope: namespace_identifier "Foo"
|
||||
// name: identifier "method"
|
||||
// }
|
||||
// }
|
||||
// → ["ns", "Foo", "method"]
|
||||
//
|
||||
// These segments are combined with the current prefix so that a
|
||||
// top-level out-of-class def `void Foo::bar() {}` inside a
|
||||
// namespace body with prefix=["ns"] produces `ns::Foo::bar`.
|
||||
let mut segments: Vec<String> = Vec::new();
|
||||
flatten_qualified_id(inner, source, &mut segments);
|
||||
if segments.is_empty() {
|
||||
return None;
|
||||
}
|
||||
// Build: prefix + all segments (scope chain + leaf).
|
||||
let mut all: Vec<&str> = prefix.iter().map(String::as_str).collect();
|
||||
for seg in &segments {
|
||||
all.push(seg.as_str());
|
||||
}
|
||||
Some(all.join("::"))
|
||||
}
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Recursively flatten a `qualified_identifier` node into ordered string
|
||||
/// segments. For `ns::Foo::method` this produces `["ns", "Foo", "method"]`.
|
||||
fn flatten_qualified_id(node: tree_sitter::Node, source: &str, out: &mut Vec<String>) {
|
||||
// A qualified_identifier has:
|
||||
// scope: namespace_identifier | (None for global-scope `::foo`)
|
||||
// name: identifier | field_identifier | destructor_name |
|
||||
// operator_name | qualified_identifier | template_function |
|
||||
// template_method | ...
|
||||
let scope_node = node.child_by_field_name("scope");
|
||||
let name_node = node.child_by_field_name("name");
|
||||
|
||||
if let Some(s) = scope_node {
|
||||
out.push(source[s.start_byte()..s.end_byte()].to_string());
|
||||
}
|
||||
|
||||
match name_node {
|
||||
Some(n) if n.kind() == "qualified_identifier" => {
|
||||
// Recurse: more nesting.
|
||||
flatten_qualified_id(n, source, out);
|
||||
}
|
||||
Some(n) => {
|
||||
// Leaf name — push its text.
|
||||
out.push(source[n.start_byte()..n.end_byte()].to_string());
|
||||
}
|
||||
None => {}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests_support {
|
||||
use kebab_core::*;
|
||||
use std::path::PathBuf;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
|
||||
RawAsset {
|
||||
asset_id: AssetId("a".repeat(64)),
|
||||
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
|
||||
workspace_path: WorkspacePath(workspace_path.to_string()),
|
||||
media_type: MediaType::Code(lang.to_string()),
|
||||
byte_len: 0,
|
||||
checksum: Checksum("b".repeat(64)),
|
||||
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
stored: AssetStorage::Reference {
|
||||
path: PathBuf::from(workspace_path),
|
||||
sha: Checksum("b".repeat(64)),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
pub fn extract_cpp(src: &str, path: &str) -> kebab_core::CanonicalDocument {
|
||||
use super::CppAstExtractor;
|
||||
use kebab_core::Extractor;
|
||||
let asset = fixed_code_asset(path, "cpp");
|
||||
let cfg = ExtractConfig::default();
|
||||
let root = PathBuf::from("/tmp");
|
||||
let ctx = ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
|
||||
let mut s: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. } => symbol.clone(),
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
s.sort();
|
||||
s
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_cpp() {
|
||||
let e = CppAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("cpp".into())));
|
||||
assert!(!e.supports(&MediaType::Code("c".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn free_function() {
|
||||
let src = "void foo() {}\n";
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "foo"), "got {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn namespace_and_class() {
|
||||
let src = r#"
|
||||
namespace ns {
|
||||
class Foo {
|
||||
public:
|
||||
void method() {}
|
||||
Foo() {}
|
||||
~Foo() {}
|
||||
int operator+(const Foo& o) { return 0; }
|
||||
};
|
||||
}
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "ns::Foo"), "ns::Foo missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "ns::Foo::method"), "method missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "ns::Foo::Foo"), "ctor missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "ns::Foo::~Foo"), "dtor missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "ns::Foo::operator+"), "op+ missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn anonymous_namespace() {
|
||||
let src = r#"
|
||||
namespace {
|
||||
void hidden_fn() {}
|
||||
}
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "<anonymous>::hidden_fn"),
|
||||
"anon fn missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn nested_namespace_specifier() {
|
||||
let src = r#"
|
||||
namespace outer::inner {
|
||||
void fn_in_nested() {}
|
||||
}
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "outer::inner::fn_in_nested"),
|
||||
"nested ns fn missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn out_of_class_method_def() {
|
||||
let src = r#"
|
||||
void ns::Foo::method() { }
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "ns::Foo::method"),
|
||||
"out-of-class method missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn template_declaration() {
|
||||
let src = r#"
|
||||
template<typename T>
|
||||
class Bar {
|
||||
void tmpl_method() {}
|
||||
};
|
||||
|
||||
template<typename T>
|
||||
void tmpl_free_fn(T x) {}
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "Bar"), "Bar class missing: {s:?}");
|
||||
assert!(
|
||||
s.iter().any(|x| x == "Bar::tmpl_method"),
|
||||
"Bar::tmpl_method missing: {s:?}"
|
||||
);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "tmpl_free_fn"),
|
||||
"tmpl_free_fn missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn enum_and_concept() {
|
||||
let src = r#"
|
||||
enum class Color { Red, Green };
|
||||
|
||||
template<typename T>
|
||||
concept Printable = requires(T t) { t.print(); };
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "Color"), "Color missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "Printable"), "Printable missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extern_c_block() {
|
||||
let src = r#"
|
||||
extern "C" {
|
||||
void c_fn1() {}
|
||||
void c_fn2() {}
|
||||
}
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "c_fn1"), "c_fn1 missing: {s:?}");
|
||||
assert!(s.iter().any(|x| x == "c_fn2"), "c_fn2 missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn conversion_operator() {
|
||||
let src = r#"
|
||||
class Foo {
|
||||
operator bool() const { return true; }
|
||||
};
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "Foo::operator bool"),
|
||||
"conversion op missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_file_produces_module() {
|
||||
let src = "";
|
||||
let doc = tests_support::extract_cpp(src, "x/empty.cpp");
|
||||
let s = syms(&doc);
|
||||
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glue_only_produces_module() {
|
||||
let src = "#include <vector>\nusing namespace std;\n";
|
||||
let doc = tests_support::extract_cpp(src, "x/glue.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "<module>"), "expected <module>: got {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ptr_returning_function() {
|
||||
let src = "int* ptr_fn(int x) { return &x; }\n";
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(s.iter().any(|x| x == "ptr_fn"), "ptr_fn missing: {s:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ref_returning_operator() {
|
||||
let src = r#"
|
||||
class Foo {
|
||||
Foo& operator=(const Foo& o) { return *this; }
|
||||
};
|
||||
"#;
|
||||
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
let s = syms(&doc);
|
||||
assert!(
|
||||
s.iter().any(|x| x == "Foo::operator="),
|
||||
"operator= missing: {s:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let src = r#"
|
||||
namespace ns {
|
||||
class Foo {
|
||||
void method() {}
|
||||
};
|
||||
}
|
||||
void free_fn() {}
|
||||
"#;
|
||||
let a = tests_support::extract_cpp(src, "x/foo.cpp");
|
||||
for _ in 0..20 {
|
||||
assert_eq!(tests_support::extract_cpp(src, "x/foo.cpp").blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
}
|
||||
451
crates/kebab-parse-code/src/go.rs
Normal file
451
crates/kebab-parse-code/src/go.rs
Normal file
@@ -0,0 +1,451 @@
|
||||
//! `kebab-parse-code::go` — tree-sitter Go AST extractor (P10-1C-Go Task D).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("go")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit (free fn, method, each type spec) carrying
|
||||
//! [`SourceSpan::Code`] with the unit's self-reference symbol path
|
||||
//! (design §3.4 Go row). Glue declarations (`import` / `const` / `var`)
|
||||
//! collapse into one grouped `<top-level>` (or `<module>`) unit.
|
||||
//!
|
||||
//! Unlike the Python/TS/JS extractors which path-derive their module
|
||||
//! prefix from the workspace file path, Go's package identity comes from
|
||||
//! the source itself (the leading `package` clause) — `extract_package`
|
||||
//! reads it from the AST. If the `package_clause` is missing (invalid Go
|
||||
//! in practice) the prefix falls back to `"<unknown>"`.
|
||||
//!
|
||||
//! Doc comments immediately preceding an item are folded into that
|
||||
//! item's line range via `unit_start` (1B pattern). Go has no separate
|
||||
//! attribute/decorator AST nodes.
|
||||
//!
|
||||
//! Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-go-v1";
|
||||
|
||||
/// Go AST extractor. Per-unit blocks via tree-sitter-go 0.25
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct GoAstExtractor;
|
||||
|
||||
impl GoAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for GoAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for GoAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "go")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for GoAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec())
|
||||
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Go source is not valid UTF-8: {e}"))?;
|
||||
|
||||
let blocks = build_blocks(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("go".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted Go doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1C-Go: extract `package` declaration text from a tree-sitter-go
|
||||
/// `source_file`. Returns `None` if no `package_clause` (invalid Go in
|
||||
/// practice but defense-in-depth). Per design §3.4 Go row.
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_clause" {
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "package_identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_go::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-go language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Go source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
let root = tree.root_node();
|
||||
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (1B post-pass
|
||||
// mirror).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
|
||||
// used by the glue flush to pick `<module>` vs `<top-level>`
|
||||
// provisional label.
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
/// Walk preceding `comment` siblings to extend the unit's line range
|
||||
/// upward, folding leading doc / line comments into the unit. Go has
|
||||
/// no decorator/attribute nodes — doc comments are simply preceding
|
||||
/// `comment` siblings (the 1B pattern).
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
|
||||
/// Extract the receiver type text for a `method_declaration`. The
|
||||
/// returned slice INCLUDES the leading `*` for pointer receivers
|
||||
/// (`(*Foo).Bar`) per design §3.4 Go row example. Returns `None` if
|
||||
/// the receiver is malformed (defense in depth).
|
||||
fn receiver_type_text<'a>(method_node: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
let recv = method_node.child_by_field_name("receiver")?;
|
||||
let mut cw = recv.walk();
|
||||
for p in recv.named_children(&mut cw) {
|
||||
if p.kind() == "parameter_declaration" {
|
||||
if let Some(ty) = p.child_by_field_name("type") {
|
||||
return Some(&src[ty.start_byte()..ty.end_byte()]);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_declaration" => {
|
||||
if let Some(name) = node_name_text(&child, source) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(&mut glue, &mut units, &mod_prefix);
|
||||
let sym = join_symbol(&mod_prefix, &[], name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"method_declaration" => {
|
||||
if let Some(name_node) = child.child_by_field_name("name") {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(&mut glue, &mut units, &mod_prefix);
|
||||
let owner = receiver_type_text(&child, source).unwrap_or("<unknown>");
|
||||
let method_name = &source[name_node.start_byte()..name_node.end_byte()];
|
||||
let sym = format!("{mod_prefix}.({owner}).{method_name}");
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"type_declaration" => {
|
||||
// One unit per inner `type_spec`. Each type_spec gets
|
||||
// the type_declaration's whole upward-folded `s` range
|
||||
// start so doc comments are attached to the first spec;
|
||||
// subsequent specs use their own start. Match 1B
|
||||
// pattern: keep the outer `s` only when there's a single
|
||||
// spec; otherwise use the spec's own start.
|
||||
let mut tcur = child.walk();
|
||||
let specs: Vec<tree_sitter::Node> = child
|
||||
.named_children(&mut tcur)
|
||||
.filter(|c| c.kind() == "type_spec")
|
||||
.collect();
|
||||
let single = specs.len() == 1;
|
||||
for spec in specs {
|
||||
let name_node = match spec.child_by_field_name("name") {
|
||||
Some(n) => n,
|
||||
None => continue,
|
||||
};
|
||||
let spec_s = if single {
|
||||
s
|
||||
} else {
|
||||
spec.start_position().row as u32 + 1
|
||||
};
|
||||
let spec_e = spec.end_position().row as u32 + 1;
|
||||
glue.retain(|(_, gs, _)| *gs < spec_s);
|
||||
flush_glue(&mut glue, &mut units, &mod_prefix);
|
||||
let name = &source[name_node.start_byte()..name_node.end_byte()];
|
||||
let sym = join_symbol(&mod_prefix, &[], name);
|
||||
units.push((sym, spec_s, spec_e, true));
|
||||
}
|
||||
}
|
||||
"import_declaration" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
"const_declaration" | "var_declaration" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(&mut glue, &mut units, &mod_prefix);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import/const/var-only group becomes `<top-level>`
|
||||
// (same post-pass as 1B). Match on the suffix so the demotion stays
|
||||
// mod-prefix-agnostic.
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("go".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("go".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
// Provisional label: `<module>` only if the group is exclusively
|
||||
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
|
||||
// `<module>` to `<top-level>` if the file produced any real unit.
|
||||
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
|
||||
let label = if only_imports { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, &[], label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/tests/fixtures/sample.go"
|
||||
))
|
||||
.unwrap();
|
||||
// Reuse the cross-language test-support helper promoted in 1B.
|
||||
let asset = crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.go", "go");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_go() {
|
||||
let e = GoAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("go".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn go_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("go"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
syms.sort();
|
||||
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
|
||||
assert!(syms.iter().any(|s| s == "chunk.init"), "got {syms:?}");
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(syms.iter().any(|s| s == "chunk.Stringer"), "got {syms:?}");
|
||||
// import + const grouped into one glue unit (no isolated `<module>`).
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "chunk.<top-level>"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 {
|
||||
assert_eq!(extract_fixture().blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
}
|
||||
543
crates/kebab-parse-code/src/java.rs
Normal file
543
crates/kebab-parse-code/src/java.rs
Normal file
@@ -0,0 +1,543 @@
|
||||
//! `kebab-parse-code::java` — tree-sitter Java AST extractor (P10-1C-JK Task D).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("java")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit (class / interface / enum / record /
|
||||
//! annotation-type at any nesting level, plus methods + constructors
|
||||
//! inside class / interface / record bodies), each carrying
|
||||
//! [`SourceSpan::Code`] with the unit's dotted self-reference symbol
|
||||
//! path (design §3.4 Java row). Glue declarations (`import`) collapse
|
||||
//! into one grouped `<top-level>` (or `<module>`) unit.
|
||||
//!
|
||||
//! Like the Go extractor, Java's package identity comes from the
|
||||
//! source itself (the `package_declaration` clause), not from the
|
||||
//! workspace file path — `extract_package` reads it from the AST. If
|
||||
//! the clause is missing the prefix falls back to `"<unknown>"`.
|
||||
//!
|
||||
//! Class/interface/record bodies are recursed (1B Python pattern):
|
||||
//! the type name is pushed onto `mod_path` so methods and nested
|
||||
//! types become `<pkg>.<Outer>.<Inner>.<method>`. Constructors use
|
||||
//! the Java convention `<pkg>.<...>.<Class>.<ClassName>` (name
|
||||
//! duplicated, per design §3.4). Enum bodies are not recursed for
|
||||
//! the 1차 cut — enum constants are not emitted as units.
|
||||
//!
|
||||
//! Javadoc (`/** ... */` → `block_comment`) and line comments
|
||||
//! immediately preceding an item are folded into that item's line
|
||||
//! range via `unit_start` (1B pattern). Annotations are children of
|
||||
//! the declaration node itself (inside `modifiers`), so they are
|
||||
//! already part of the declaration's span — no separate unwrap arm.
|
||||
//!
|
||||
//! Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-java-v1";
|
||||
|
||||
/// Java AST extractor. Per-unit blocks via tree-sitter-java 0.23
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct JavaAstExtractor;
|
||||
|
||||
impl JavaAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for JavaAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for JavaAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "java")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for JavaAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec())
|
||||
.map_err(|e| anyhow::anyhow!("kebab-parse-code: Java source is not valid UTF-8: {e}"))?;
|
||||
|
||||
let blocks = build_blocks(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("java".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted Java doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-java
|
||||
/// `program`. Returns `None` if no `package_declaration` (default-package
|
||||
/// Java file). The package_declaration's named children are either a
|
||||
/// single `identifier` (single-segment package, rare) or a
|
||||
/// `scoped_identifier` (dotted, common). Per design §3.4 Java row.
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_declaration" {
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Walk preceding `line_comment` / `block_comment` siblings to extend
|
||||
/// the unit's line range upward, folding leading Javadoc / line
|
||||
/// comments into the unit. Annotations live INSIDE `modifiers` on the
|
||||
/// declaration node itself, so their lines are already inside
|
||||
/// `n.start_position()` — no separate unwrap arm is needed for them.
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
let k = p.kind();
|
||||
if k == "line_comment" || k == "block_comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
|
||||
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_java::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-java language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Java source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
let root = tree.root_node();
|
||||
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (1B/1C-Go pattern).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_import 0/1, s, e). `is_import` flags `import_declaration` —
|
||||
// used by the glue flush to pick `<module>` vs `<top-level>`
|
||||
// provisional label.
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import-only group becomes `<top-level>` (same
|
||||
// post-pass as 1B / 1C-Go).
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("java".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("java".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
/// Walk the file's top-level children — `program` named children:
|
||||
/// `package_declaration` (handled by `extract_package`), `import_declaration`
|
||||
/// (glue), and the five type declarations (`class` / `interface` /
|
||||
/// `enum` / `record` / `annotation_type`). Type-declaration bodies
|
||||
/// are recursed via [`walk_body`] with the type name pushed onto
|
||||
/// `mod_path` (1B Python pattern). Enum bodies are NOT recursed
|
||||
/// (1차 cut — see module-level doc).
|
||||
fn walk_top(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
let mod_path: &[String] = &[];
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"class_declaration"
|
||||
| "interface_declaration"
|
||||
| "record_declaration" => {
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let np: Vec<String> = vec![name.to_string()];
|
||||
walk_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
"enum_declaration" => {
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
// Enum body NOT recursed for 1차 — enum constants are
|
||||
// not emitted as units, and method declarations inside
|
||||
// enum bodies (rare) live under `enum_body_declarations`
|
||||
// not `class_body`. Skip per design §3.4 1차 scope.
|
||||
}
|
||||
}
|
||||
"annotation_type_declaration" => {
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"import_declaration" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
// package_declaration is handled by `extract_package`; no
|
||||
// glue entry — it's structural metadata, not a unit.
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
}
|
||||
|
||||
/// Walk a `class_body` / `interface_body` (or record's `class_body`).
|
||||
/// Emits one unit per method / constructor, and recurses into nested
|
||||
/// type declarations. Field declarations are NOT emitted (would
|
||||
/// explode unit count). `compact_constructor_declaration` (records)
|
||||
/// is handled the same as `constructor_declaration`.
|
||||
///
|
||||
/// No `glue` parameter: Java does not have imports inside type
|
||||
/// bodies — they only appear at file top level, handled by
|
||||
/// [`walk_top`].
|
||||
fn walk_body(
|
||||
body: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
) {
|
||||
let mut cur = body.walk();
|
||||
for child in body.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"method_declaration"
|
||||
| "constructor_declaration"
|
||||
| "compact_constructor_declaration" => {
|
||||
// Constructor: name field equals the class name. Per
|
||||
// design §3.4 Java convention, symbol is
|
||||
// `<pkg>.<mod_path>.<ClassName>` with the constructor
|
||||
// name (== class name) as the trailing segment. This
|
||||
// means the symbol duplicates the class name (e.g.
|
||||
// `com.x.Foo.Foo`), which is the documented convention.
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"class_declaration"
|
||||
| "interface_declaration"
|
||||
| "record_declaration"
|
||||
| "enum_declaration"
|
||||
| "annotation_type_declaration" => {
|
||||
// Nested type — emit unit, then recurse into its body
|
||||
// (skipped for enum + annotation_type per 1차 scope).
|
||||
let name = match node_name_text(&child, src) {
|
||||
Some(n) => n,
|
||||
None => continue,
|
||||
};
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if child.kind() != "enum_declaration"
|
||||
&& child.kind() != "annotation_type_declaration"
|
||||
{
|
||||
if let Some(inner_body) = child.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_body(inner_body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
// field_declaration, static_initializer, block: NOT emitted.
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
// Provisional label: `<module>` only if the group is exclusively
|
||||
// imports (1A's `only_mod_decls` analog). The post-pass demotes any
|
||||
// `<module>` to `<top-level>` if the file produced any real unit.
|
||||
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
|
||||
let label = if only_imports { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/tests/fixtures/sample.java"
|
||||
))
|
||||
.unwrap();
|
||||
let asset =
|
||||
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.java", "java");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_java() {
|
||||
let e = JavaAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("java".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn java_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("java"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
syms.sort();
|
||||
// package extracted from source = com.kebab.chunk
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// constructor — Java convention is class-name-as-method-name
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// static nested class
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// package-private interface + enum
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// import grouped as <top-level>
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 {
|
||||
assert_eq!(extract_fixture().blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
}
|
||||
627
crates/kebab-parse-code/src/kotlin.rs
Normal file
627
crates/kebab-parse-code/src/kotlin.rs
Normal file
@@ -0,0 +1,627 @@
|
||||
//! `kebab-parse-code::kotlin` — tree-sitter Kotlin AST extractor (P10-1C-JK Task G).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("kotlin")`].
|
||||
//! Mirrors the Java extractor (JVM family, source-side `package` extraction +
|
||||
//! class-nesting) with Kotlin-specific adjustments:
|
||||
//!
|
||||
//! * Root is `source_file` (not `program`).
|
||||
//! * `package_header` carries a single `qualified_identifier` child whose
|
||||
//! slice text IS the dotted package path — never a bare `identifier`
|
||||
//! sub-form for the package (the grammar always wraps a single segment
|
||||
//! in `qualified_identifier` too).
|
||||
//! * `class_declaration` covers `class`, `data class`, `sealed class`,
|
||||
//! `enum class`, AND `interface` — Kotlin uses ONE node kind with a
|
||||
//! `modifiers` child rather than separate `interface_declaration` /
|
||||
//! `enum_declaration` nodes (verified via tree-sitter-kotlin-ng
|
||||
//! `node-types.json`).
|
||||
//! * The body child of `class_declaration` is either `class_body` (normal
|
||||
//! classes / interfaces) OR `enum_class_body` (enum class). Neither
|
||||
//! carries a `body` field name, so it is matched by kind, not by
|
||||
//! `child_by_field_name("body")`.
|
||||
//! * `companion_object` is a SEPARATE node kind (not `object_declaration`
|
||||
//! with a modifier). Its `name` field is OPTIONAL — when omitted (the
|
||||
//! common case `companion object { ... }`) the symbol uses the
|
||||
//! implicit Kotlin convention name `Companion`.
|
||||
//! * `object_declaration` (named singleton) carries a `name` field and a
|
||||
//! `class_body` child.
|
||||
//! * `function_declaration` may appear at top level (Kotlin top-level
|
||||
//! function) AND inside `class_body` — same node kind, the
|
||||
//! `mod_path` state distinguishes the two emit forms.
|
||||
//!
|
||||
//! Enum bodies (`enum_class_body`) are NOT recursed for the 1차 cut —
|
||||
//! `enum_entry` declarations are not emitted as units, matching the
|
||||
//! Java extractor's enum policy (design §3.4 1차 scope).
|
||||
//!
|
||||
//! Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-kotlin-v1";
|
||||
|
||||
/// Kotlin AST extractor. Per-unit blocks via tree-sitter-kotlin-ng 1.1
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct KotlinAstExtractor;
|
||||
|
||||
impl KotlinAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for KotlinAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for KotlinAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "kotlin")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for KotlinAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: Kotlin source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let blocks = build_blocks(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("kotlin".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted Kotlin doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1C-JK: extract `package` declaration text from a tree-sitter-kotlin
|
||||
/// `source_file`. Returns `None` if no `package_header` (default-package
|
||||
/// Kotlin file). The package_header's single named child is a
|
||||
/// `qualified_identifier`; its slice text is the dotted path. Per design
|
||||
/// §3.4 Kotlin row.
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_header" {
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
let k = sub.kind();
|
||||
if k == "qualified_identifier" || k == "identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Walk preceding `line_comment` / `block_comment` siblings to extend
|
||||
/// the unit's line range upward, folding leading KDoc / line comments
|
||||
/// into the unit. Modifiers / annotations live INSIDE the declaration
|
||||
/// node itself, so their lines are already inside `n.start_position()`.
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
let k = p.kind();
|
||||
if k == "line_comment" || k == "block_comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
|
||||
fn node_name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
|
||||
/// Find the first child of a node with one of the given kinds. Used to
|
||||
/// locate `class_body` / `enum_class_body` on `class_declaration` since
|
||||
/// the kotlin grammar attaches them without a `body` field name.
|
||||
fn first_child_of_kinds<'a>(
|
||||
n: &tree_sitter::Node<'a>,
|
||||
kinds: &[&str],
|
||||
) -> Option<tree_sitter::Node<'a>> {
|
||||
let mut cur = n.walk();
|
||||
n.named_children(&mut cur)
|
||||
.find(|child| kinds.contains(&child.kind()))
|
||||
}
|
||||
|
||||
/// `true` iff a `class_declaration` carries the `enum` class modifier.
|
||||
/// Detected by walking `modifiers` → `class_modifier` and checking the
|
||||
/// child text. The grammar exposes "enum" / "sealed" / "data" /
|
||||
/// "annotation" / "inner" as named `class_modifier` children of
|
||||
/// `modifiers`. We only need to know about "enum" to decide whether to
|
||||
/// look for `class_body` or `enum_class_body` and whether to skip body
|
||||
/// recursion.
|
||||
fn class_decl_is_enum(n: &tree_sitter::Node, src: &str) -> bool {
|
||||
let mut cur = n.walk();
|
||||
for child in n.named_children(&mut cur) {
|
||||
if child.kind() == "modifiers" {
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "class_modifier" {
|
||||
let text = &src[sub.start_byte()..sub.end_byte()];
|
||||
if text == "enum" {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_kotlin_ng::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-kotlin-ng language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Kotlin source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
let root = tree.root_node();
|
||||
let mod_prefix = extract_package(root, source).unwrap_or_else(|| "<unknown>".to_string());
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (JVM family pattern).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_import 0/1, s, e). `is_import` flags `import` — used by the
|
||||
// glue flush to pick `<module>` vs `<top-level>` provisional label.
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
walk_top(root, source, &mod_prefix, &mut units, &mut glue);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import-only group becomes `<top-level>` (same
|
||||
// post-pass as 1B / 1C-Go / Java).
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("kotlin".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("kotlin".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
/// Walk the file's top-level children — `source_file` named children:
|
||||
/// `package_header` (handled by `extract_package`), `import` (glue),
|
||||
/// `class_declaration` (class / interface / enum class), `object_declaration`,
|
||||
/// `function_declaration` (top-level), `property_declaration` (top-level),
|
||||
/// `type_alias` (currently treated as glue). Class / object bodies are
|
||||
/// recursed via [`walk_body`] with the type name pushed onto `mod_path`
|
||||
/// (JVM family pattern). Enum bodies are NOT recursed (1차 cut).
|
||||
fn walk_top(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
let mod_path: &[String] = &[];
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"class_declaration" => {
|
||||
// Covers class / data class / sealed class / interface /
|
||||
// enum class — single grammar node, the modifiers child
|
||||
// distinguishes them. The body is `class_body` for
|
||||
// non-enum and `enum_class_body` for enum class; both
|
||||
// attach without a `body` field name.
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
let is_enum = class_decl_is_enum(&child, src);
|
||||
if !is_enum {
|
||||
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
|
||||
let np: Vec<String> = vec![name.to_string()];
|
||||
walk_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
// enum_class_body NOT recursed — enum constants are
|
||||
// not emitted as units (1차 scope, matches Java).
|
||||
}
|
||||
}
|
||||
"object_declaration" => {
|
||||
// Singleton object — name field is required by the grammar.
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(body) = first_child_of_kinds(&child, &["class_body"]) {
|
||||
let np: Vec<String> = vec![name.to_string()];
|
||||
walk_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
"function_declaration" => {
|
||||
// Top-level Kotlin function (unlike Java).
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"import" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
// `property_declaration` (top-level val/var) and `type_alias`
|
||||
// are not emitted as standalone units in the 1차 cut — they
|
||||
// glue into the import group instead. `package_header` is
|
||||
// handled by `extract_package` (structural metadata, not a
|
||||
// unit).
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
}
|
||||
|
||||
/// Walk a `class_body` (or object's `class_body`). Emits one unit per
|
||||
/// method / secondary constructor and recurses into nested type
|
||||
/// declarations + companion objects. Property declarations are NOT
|
||||
/// emitted (would explode unit count, parallel to Java field policy).
|
||||
///
|
||||
/// `companion_object` carries an optional `name` field — when omitted
|
||||
/// (the common case `companion object { ... }`) the implicit Kotlin
|
||||
/// convention name `Companion` is used.
|
||||
///
|
||||
/// No `glue` parameter: Kotlin imports are file-level only.
|
||||
fn walk_body(
|
||||
body: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
) {
|
||||
let mut cur = body.walk();
|
||||
for child in body.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_declaration" => {
|
||||
if let Some(name) = node_name_text(&child, src) {
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"secondary_constructor" => {
|
||||
// Kotlin secondary constructor — no `name` field on the
|
||||
// grammar node. Per design §3.4 (Java JVM convention) the
|
||||
// symbol uses the enclosing class name as the trailing
|
||||
// segment (matches the Java `<pkg>.<...>.<Class>.<Class>`
|
||||
// duplication for constructors).
|
||||
if let Some(class_name) = mod_path.last() {
|
||||
let sym = join_symbol(mod_prefix, mod_path, class_name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"companion_object" => {
|
||||
// Companion's name field is OPTIONAL — fall back to the
|
||||
// Kotlin implicit name `Companion`.
|
||||
let name: &str = node_name_text(&child, src).unwrap_or("Companion");
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_body(inner_body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
"class_declaration" => {
|
||||
let name = match node_name_text(&child, src) {
|
||||
Some(n) => n,
|
||||
None => continue,
|
||||
};
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
let is_enum = class_decl_is_enum(&child, src);
|
||||
if !is_enum {
|
||||
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_body(inner_body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
"object_declaration" => {
|
||||
let name = match node_name_text(&child, src) {
|
||||
Some(n) => n,
|
||||
None => continue,
|
||||
};
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(inner_body) = first_child_of_kinds(&child, &["class_body"]) {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_body(inner_body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
// property_declaration, anonymous_initializer: NOT emitted.
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
// Provisional label: `<module>` only if the group is exclusively
|
||||
// imports. The post-pass demotes any `<module>` to `<top-level>` if
|
||||
// the file produced any real unit.
|
||||
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
|
||||
let label = if only_imports { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/tests/fixtures/sample.kt"
|
||||
))
|
||||
.unwrap();
|
||||
let asset =
|
||||
crate::rust::tests_support::fixed_code_asset("crates/x/src/sample.kt", "kotlin");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
KotlinAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_kotlin() {
|
||||
let e = KotlinAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("kotlin".into())));
|
||||
assert!(!e.supports(&MediaType::Code("java".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn kotlin_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("kotlin"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
syms.sort();
|
||||
// package extracted from source = com.kebab.chunk
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// Implicit companion object name = Companion (grammar leaves the
|
||||
// name field unset; the extractor fills it in).
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter()
|
||||
.any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Companion.withName"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// interface — also via class_declaration in the grammar
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Stringer"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// enum class — also via class_declaration; body NOT recursed
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Mode"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// Kotlin top-level fn — unlike Java
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.freeFunction"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// Singleton object + its method
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Singleton"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.Singleton.ping"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
// import grouped as <top-level>
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 {
|
||||
assert_eq!(extract_fixture().blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -10,33 +10,57 @@ use std::path::Path;
|
||||
/// `None` if the extension / filename is not recognized.
|
||||
///
|
||||
/// Matching priority:
|
||||
/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`)
|
||||
/// 2. lowercase extension match
|
||||
/// 1. Tier 1 basename exact match (e.g. `Dockerfile`, `Makefile`)
|
||||
/// 2. Tier 2 basename match (e.g. `Cargo.toml`, `package.json`, `build.gradle`)
|
||||
/// 3. Tier 2 `Dockerfile.*` prefix variant
|
||||
/// 4. Tier 1 + Tier 2 extension fallback (lowercase)
|
||||
pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
|
||||
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
|
||||
// Tier 1 basename exact match
|
||||
match name {
|
||||
"Dockerfile" => return Some("dockerfile"),
|
||||
"Makefile" | "GNUmakefile" => return Some("make"),
|
||||
_ => {}
|
||||
}
|
||||
|
||||
// Tier 2 basename match (configuration / manifest files)
|
||||
match name {
|
||||
"Cargo.toml" | "pyproject.toml" => return Some("toml"),
|
||||
"package.json" | "tsconfig.json" => return Some("json"),
|
||||
"go.mod" => return Some("go-mod"),
|
||||
"pom.xml" => return Some("xml"),
|
||||
"build.gradle" => return Some("groovy"),
|
||||
_ => {}
|
||||
}
|
||||
|
||||
// Tier 2: `Dockerfile.*` prefix variant (e.g. `Dockerfile.dev`, `Dockerfile.prod`)
|
||||
if name.starts_with("Dockerfile.") && name.len() > "Dockerfile.".len() {
|
||||
return Some("dockerfile");
|
||||
}
|
||||
}
|
||||
|
||||
// Extension fallback (Tier 1 + Tier 2)
|
||||
let ext = path.extension()?.to_str()?.to_ascii_lowercase();
|
||||
match ext.as_str() {
|
||||
// Tier 1 extensions
|
||||
"rs" => Some("rust"),
|
||||
"py" | "pyi" => Some("python"),
|
||||
"ts" | "tsx" => Some("typescript"),
|
||||
"ts" | "tsx" | "mts" | "cts" => Some("typescript"),
|
||||
"js" | "mjs" | "cjs" | "jsx" => Some("javascript"),
|
||||
"go" => Some("go"),
|
||||
"java" => Some("java"),
|
||||
"kt" | "kts" => Some("kotlin"),
|
||||
"c" | "h" => Some("c"),
|
||||
"cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"),
|
||||
"sh" | "bash" | "zsh" => Some("shell"),
|
||||
"mk" => Some("make"),
|
||||
// Tier 2 extensions
|
||||
"yaml" | "yml" => Some("yaml"),
|
||||
"toml" => Some("toml"),
|
||||
"json" => Some("json"),
|
||||
"sh" | "bash" | "zsh" => Some("shell"),
|
||||
"mk" => Some("make"),
|
||||
"xml" => Some("xml"),
|
||||
"dockerfile" => Some("dockerfile"),
|
||||
"gradle" => Some("groovy"),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
@@ -82,7 +106,7 @@ pub fn module_path_for_python(workspace_path: &str) -> String {
|
||||
/// (no slash replacement, no source-root strip). See plan §Task C.
|
||||
pub fn module_path_for_tsjs(workspace_path: &str) -> String {
|
||||
let p = workspace_path;
|
||||
for ext in [".tsx", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
|
||||
for ext in [".tsx", ".mts", ".cts", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
|
||||
if let Some(stripped) = p.strip_suffix(ext) {
|
||||
return stripped.to_string();
|
||||
}
|
||||
@@ -110,7 +134,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn module_path_for_tsjs_keeps_slashes_and_strips_ext() {
|
||||
for ext in ["ts", "tsx", "js", "jsx", "mjs", "cjs"] {
|
||||
for ext in ["ts", "tsx", "mts", "cts", "js", "jsx", "mjs", "cjs"] {
|
||||
let p = format!("src/search/retriever/Retriever.{ext}");
|
||||
assert_eq!(module_path_for_tsjs(&p), "src/search/retriever/Retriever");
|
||||
}
|
||||
@@ -118,4 +142,28 @@ mod tests {
|
||||
assert_eq!(module_path_for_tsjs("a/b/c.ts"), "a/b/c");
|
||||
assert_eq!(module_path_for_tsjs("packages/x/src/Foo.ts"), "packages/x/src/Foo");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tier2_basename_takes_precedence_over_extension() {
|
||||
assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo/Dockerfile.dev")), Some("dockerfile"));
|
||||
assert_eq!(code_lang_for_path(Path::new("myapp.dockerfile")), Some("dockerfile"));
|
||||
assert_eq!(code_lang_for_path(Path::new("repo/Cargo.toml")), Some("toml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("pyproject.toml")), Some("toml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("repo/package.json")), Some("json"));
|
||||
assert_eq!(code_lang_for_path(Path::new("tsconfig.json")), Some("json"));
|
||||
assert_eq!(code_lang_for_path(Path::new("go.mod")), Some("go-mod"));
|
||||
assert_eq!(code_lang_for_path(Path::new("pom.xml")), Some("xml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("build.gradle")), Some("groovy"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tier2_extension_fallback() {
|
||||
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yaml")), Some("yaml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("k8s/deploy.yml")), Some("yaml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo/bar.toml")), Some("toml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo/bar.json")), Some("json"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo/bar.xml")), Some("xml"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo/bar.gradle")), Some("groovy"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,7 +13,12 @@
|
||||
//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed
|
||||
//! / llm / rag.
|
||||
|
||||
pub mod c;
|
||||
pub mod cpp;
|
||||
pub mod go;
|
||||
pub mod java;
|
||||
pub mod javascript;
|
||||
pub mod kotlin;
|
||||
pub mod lang;
|
||||
pub mod python;
|
||||
pub mod repo;
|
||||
@@ -22,7 +27,12 @@ pub(crate) mod scaffold;
|
||||
pub mod skip;
|
||||
pub mod typescript;
|
||||
|
||||
pub use c::{PARSER_VERSION as C_PARSER_VERSION, CAstExtractor};
|
||||
pub use cpp::{PARSER_VERSION as CPP_PARSER_VERSION, CppAstExtractor};
|
||||
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
|
||||
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
|
||||
pub use javascript::{PARSER_VERSION as JS_PARSER_VERSION, JavascriptAstExtractor};
|
||||
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
|
||||
pub use lang::{code_lang_for_path, module_path_for_python, module_path_for_tsjs};
|
||||
pub use python::{PARSER_VERSION as PYTHON_PARSER_VERSION, PythonAstExtractor};
|
||||
pub use repo::{RepoMeta, detect_repo};
|
||||
|
||||
@@ -173,8 +173,9 @@ impl Extractor for TypescriptAstExtractor {
|
||||
}
|
||||
|
||||
/// Select the tree-sitter grammar based on the workspace path's
|
||||
/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.d.ts`,
|
||||
/// missing extension) → TypeScript grammar.
|
||||
/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.mts`,
|
||||
/// `.cts`, `.d.ts`, missing extension) → TypeScript grammar (the JSX-
|
||||
/// agnostic variants all share one grammar in tree-sitter-typescript 0.23).
|
||||
fn select_grammar(workspace_path: &str) -> tree_sitter::Language {
|
||||
if workspace_path.ends_with(".tsx") {
|
||||
tree_sitter_typescript::LANGUAGE_TSX.into()
|
||||
|
||||
34
crates/kebab-parse-code/tests/fixtures/sample.go
vendored
Normal file
34
crates/kebab-parse-code/tests/fixtures/sample.go
vendored
Normal file
@@ -0,0 +1,34 @@
|
||||
// sample.go
|
||||
package chunk
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
)
|
||||
|
||||
const Version = "v1"
|
||||
|
||||
type MdHeadingV1Chunker struct {
|
||||
Name string
|
||||
}
|
||||
|
||||
// ChunkDoc returns a stub list of strings.
|
||||
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
|
||||
return []string{m.Name}
|
||||
}
|
||||
|
||||
func (m MdHeadingV1Chunker) Name2() string {
|
||||
return m.Name
|
||||
}
|
||||
|
||||
type Stringer interface {
|
||||
String() string
|
||||
}
|
||||
|
||||
func Free(x int) int {
|
||||
return x + 1
|
||||
}
|
||||
|
||||
func init() {
|
||||
fmt.Println(strings.ToUpper("init"))
|
||||
}
|
||||
36
crates/kebab-parse-code/tests/fixtures/sample.java
vendored
Normal file
36
crates/kebab-parse-code/tests/fixtures/sample.java
vendored
Normal file
@@ -0,0 +1,36 @@
|
||||
// sample.java
|
||||
package com.kebab.chunk;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
/**
|
||||
* Heading-aware Markdown chunker.
|
||||
*/
|
||||
public class MdHeadingV1Chunker {
|
||||
private final String name;
|
||||
|
||||
public MdHeadingV1Chunker(String name) {
|
||||
this.name = name;
|
||||
}
|
||||
|
||||
public List<String> chunkDoc(String input) {
|
||||
return List.of(name, input);
|
||||
}
|
||||
|
||||
public String getName() {
|
||||
return name;
|
||||
}
|
||||
|
||||
public static class Builder {
|
||||
private String name;
|
||||
public Builder withName(String n) { this.name = n; return this; }
|
||||
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
|
||||
}
|
||||
}
|
||||
|
||||
interface Stringer {
|
||||
String asString();
|
||||
}
|
||||
|
||||
enum Mode { DEFAULT, FAST }
|
||||
29
crates/kebab-parse-code/tests/fixtures/sample.kt
vendored
Normal file
29
crates/kebab-parse-code/tests/fixtures/sample.kt
vendored
Normal file
@@ -0,0 +1,29 @@
|
||||
// sample.kt
|
||||
package com.kebab.chunk
|
||||
|
||||
import java.util.List
|
||||
|
||||
/**
|
||||
* Heading-aware Markdown chunker.
|
||||
*/
|
||||
class MdHeadingV1Chunker(val name: String) {
|
||||
fun chunkDoc(input: String): List<String> = listOf(name, input)
|
||||
|
||||
fun getName(): String = name
|
||||
|
||||
companion object {
|
||||
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
|
||||
}
|
||||
}
|
||||
|
||||
interface Stringer {
|
||||
fun asString(): String
|
||||
}
|
||||
|
||||
enum class Mode { DEFAULT, FAST }
|
||||
|
||||
fun freeFunction(x: Int): Int = x + 1
|
||||
|
||||
object Singleton {
|
||||
fun ping(): String = "pong"
|
||||
}
|
||||
@@ -9,6 +9,8 @@ fn known_extensions_map_to_canonical_identifiers() {
|
||||
("foo.pyi", Some("python")),
|
||||
("foo.ts", Some("typescript")),
|
||||
("foo.tsx", Some("typescript")),
|
||||
("foo.mts", Some("typescript")), // ESM TS — same grammar
|
||||
("foo.cts", Some("typescript")), // CommonJS TS — same grammar
|
||||
("foo.js", Some("javascript")),
|
||||
("foo.mjs", Some("javascript")),
|
||||
("foo.cjs", Some("javascript")),
|
||||
|
||||
@@ -39,10 +39,6 @@ use crate::image_prep;
|
||||
/// Engine name written into `OcrText.engine` for the Ollama-vision adapter.
|
||||
pub const OLLAMA_VISION_ENGINE: &str = "ollama-vision";
|
||||
|
||||
/// Hard ceiling on the OCR HTTP exchange. Cold-loading a vision model on
|
||||
/// first call can take ~30s; 5 minutes is generous without being open-ended.
|
||||
const REQUEST_TIMEOUT: Duration = Duration::from_secs(300);
|
||||
|
||||
/// Lower bound on `config.image.ocr.max_pixels`. Anything below this is
|
||||
/// silently bumped to keep the model from receiving an unreadable thumbnail.
|
||||
const MIN_LONG_EDGE: u32 = 256;
|
||||
@@ -139,7 +135,13 @@ impl OllamaVisionOcr {
|
||||
Some(s) if !s.is_empty() => s.to_string(),
|
||||
_ => config.models.llm.endpoint.clone(),
|
||||
};
|
||||
Self::build(endpoint, ocr.model.clone(), ocr.languages.clone(), ocr.max_pixels)
|
||||
Self::build(
|
||||
endpoint,
|
||||
ocr.model.clone(),
|
||||
ocr.languages.clone(),
|
||||
ocr.max_pixels,
|
||||
ocr.request_timeout_secs,
|
||||
)
|
||||
}
|
||||
|
||||
/// Build directly from explicit fields. Useful for tests that need
|
||||
@@ -153,8 +155,15 @@ impl OllamaVisionOcr {
|
||||
model: impl Into<String>,
|
||||
languages: Vec<String>,
|
||||
max_pixels: u32,
|
||||
request_timeout_secs: u64,
|
||||
) -> Result<Self> {
|
||||
Self::build(endpoint.into(), model.into(), languages, max_pixels)
|
||||
Self::build(
|
||||
endpoint.into(),
|
||||
model.into(),
|
||||
languages,
|
||||
max_pixels,
|
||||
request_timeout_secs,
|
||||
)
|
||||
}
|
||||
|
||||
/// Shared validation + construction. Centralised so `new` and
|
||||
@@ -164,6 +173,7 @@ impl OllamaVisionOcr {
|
||||
model: String,
|
||||
languages: Vec<String>,
|
||||
requested_max_pixels: u32,
|
||||
request_timeout_secs: u64,
|
||||
) -> Result<Self> {
|
||||
if endpoint.is_empty() {
|
||||
anyhow::bail!(
|
||||
@@ -183,7 +193,7 @@ impl OllamaVisionOcr {
|
||||
);
|
||||
}
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.timeout(REQUEST_TIMEOUT)
|
||||
.timeout(Duration::from_secs(request_timeout_secs))
|
||||
.build()
|
||||
.context("building OCR HTTP client")?;
|
||||
Ok(Self {
|
||||
@@ -375,6 +385,7 @@ mod tests {
|
||||
"m",
|
||||
vec!["eng".into(), "kor".into()],
|
||||
1024,
|
||||
300,
|
||||
)
|
||||
.unwrap();
|
||||
let p = engine.build_prompt(Some(&Lang("ko".into())));
|
||||
@@ -389,6 +400,7 @@ mod tests {
|
||||
"m",
|
||||
vec!["eng".into()],
|
||||
1024,
|
||||
300,
|
||||
)
|
||||
.unwrap();
|
||||
let p = engine.build_prompt(Some(&Lang("und".into())));
|
||||
@@ -400,7 +412,7 @@ mod tests {
|
||||
/// the constructor cannot drift to "silently accept a bad config".
|
||||
#[test]
|
||||
fn build_rejects_empty_endpoint() {
|
||||
let r = OllamaVisionOcr::from_parts("", "m", vec![], 1024);
|
||||
let r = OllamaVisionOcr::from_parts("", "m", vec![], 1024, 300);
|
||||
let err = r.expect_err("empty endpoint must bail").to_string();
|
||||
assert!(
|
||||
err.contains("endpoint is empty"),
|
||||
@@ -413,7 +425,7 @@ mod tests {
|
||||
/// so testing `from_parts` covers both.
|
||||
#[test]
|
||||
fn build_rejects_empty_model_after_trim() {
|
||||
let r = OllamaVisionOcr::from_parts("http://x", " ", vec![], 1024);
|
||||
let r = OllamaVisionOcr::from_parts("http://x", " ", vec![], 1024, 300);
|
||||
let err = r.expect_err("empty model must bail").to_string();
|
||||
assert!(
|
||||
err.contains("model is empty"),
|
||||
@@ -428,10 +440,10 @@ mod tests {
|
||||
#[test]
|
||||
fn build_clamps_max_pixels_outside_legal_range() {
|
||||
let too_small =
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 1).unwrap();
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 1, 300).unwrap();
|
||||
assert_eq!(too_small.max_pixels(), MIN_LONG_EDGE);
|
||||
let too_big =
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], u32::MAX).unwrap();
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], u32::MAX, 300).unwrap();
|
||||
assert_eq!(too_big.max_pixels(), MAX_LONG_EDGE);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -322,7 +322,8 @@ async fn ocr_downscales_large_image_before_sending() {
|
||||
#[test]
|
||||
fn from_parts_clamps_max_pixels_into_legal_range() {
|
||||
// Below MIN_LONG_EDGE — bumped up to the floor.
|
||||
let too_small = OllamaVisionOcr::from_parts("http://x", "m", vec![], 10).unwrap();
|
||||
let too_small =
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 10, 300).unwrap();
|
||||
assert_eq!(
|
||||
too_small.max_pixels(),
|
||||
256,
|
||||
@@ -331,7 +332,7 @@ fn from_parts_clamps_max_pixels_into_legal_range() {
|
||||
|
||||
// Above MAX_LONG_EDGE — capped at the ceiling.
|
||||
let too_big =
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 99_999).unwrap();
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 99_999, 300).unwrap();
|
||||
assert_eq!(
|
||||
too_big.max_pixels(),
|
||||
4096,
|
||||
@@ -339,7 +340,8 @@ fn from_parts_clamps_max_pixels_into_legal_range() {
|
||||
);
|
||||
|
||||
// Inside the legal range — pass through untouched.
|
||||
let in_range = OllamaVisionOcr::from_parts("http://x", "m", vec![], 1024).unwrap();
|
||||
let in_range =
|
||||
OllamaVisionOcr::from_parts("http://x", "m", vec![], 1024, 300).unwrap();
|
||||
assert_eq!(in_range.max_pixels(), 1024);
|
||||
}
|
||||
|
||||
|
||||
@@ -162,18 +162,53 @@ impl Retriever for LexicalRetriever {
|
||||
|
||||
/// Translate a user-typed query into an FTS5 match string.
|
||||
///
|
||||
/// Rules (from the task spec):
|
||||
/// v0.17.0 — trigram-aware redesign (see design §5.5 + plan
|
||||
/// `docs/superpowers/plans/2026-05-22-korean-trigram-tokenizer.md`
|
||||
/// Task A5). The FTS5 tokenizer is `trigram` so any term shorter than
|
||||
/// three Unicode chars has no index entry and would zero out an AND
|
||||
/// branch. Korean compounds typically split into 2-char eojeols (e.g.
|
||||
/// `해시 충돌`), so a naive token AND drops the dominant usage pattern.
|
||||
///
|
||||
/// - The query is wrapped in a single pair of `'...'` → strip the quotes
|
||||
/// and pass the inner text through verbatim. The user has explicitly
|
||||
/// opted into FTS5 syntax (e.g. `'rust AND cargo'`, `'foo*'`).
|
||||
/// post-v0.17.1 dogfood — `text` column filter (closure of HOTFIXES
|
||||
/// 2026-05-24 `heading_path_json` 노이즈). The `chunks_fts` virtual
|
||||
/// table indexes both `heading_path` (the JSON-serialized
|
||||
/// `chunks.heading_path_json` per V002/V007 triggers) and `text`. Under
|
||||
/// the trigram tokenizer the JSON punctuation (`[`, `"`, `,`) plus the
|
||||
/// path segments (`app`, `src`, …) become indexable 3-grams, so a
|
||||
/// query can hit a chunk purely because its file's heading JSON shares
|
||||
/// a path segment with the query — false positives that have no body
|
||||
/// relevance. The default match expression therefore scopes to the
|
||||
/// `text` column. The `heading_path` column stays indexed (V007 / §5.5
|
||||
/// verbatim block is preserved) so a user who *wants* heading matching
|
||||
/// can opt in via raw mode (`'heading_path : foo'`).
|
||||
///
|
||||
/// - Otherwise: split on whitespace, escape every token by wrapping it
|
||||
/// in `"..."` (FTS5 string literal), with any inner `"` doubled. Join
|
||||
/// with spaces — FTS5 default operator is implicit AND.
|
||||
/// Rules:
|
||||
///
|
||||
/// - An empty / whitespace-only token list → return `None` (caller
|
||||
/// short-circuits to `Ok(vec![])`).
|
||||
/// - Raw mode (unchanged): the query is wrapped in a single pair of
|
||||
/// `'...'` → strip the quotes and pass the inner text through verbatim.
|
||||
/// The user has explicitly opted into FTS5 syntax (e.g.
|
||||
/// `'rust AND cargo'`, `'foo*'`, `'heading_path : agent'`). No column
|
||||
/// scoping is applied — the raw expression is honored as-is.
|
||||
///
|
||||
/// - Otherwise build up to two MATCH candidates:
|
||||
/// 1. **whole-phrase**: the entire trimmed input wrapped as one FTS5
|
||||
/// string literal, *only* if it has ≥3 Unicode chars. FTS5 treats
|
||||
/// a quoted string with spaces as a phrase match.
|
||||
/// 2. **token AND**: whitespace-split tokens, kept only when each has
|
||||
/// ≥3 Unicode chars (shorter ones are dropped — they would zero
|
||||
/// out the AND under trigram).
|
||||
///
|
||||
/// - Combine: `(whole) OR (token_and)` when both exist *and differ*;
|
||||
/// either alone when only one exists; `None` when neither exists
|
||||
/// (caller short-circuits to `Ok(vec![])`, avoiding an FTS5 syntax
|
||||
/// error from an empty MATCH).
|
||||
///
|
||||
/// - A single-token long query (`러스트`, `foo`) yields `whole == token_and`
|
||||
/// → return the bare quoted form so the OR doesn't duplicate.
|
||||
///
|
||||
/// - Finally wrap the combined expression in `text : (<expr>)` so the
|
||||
/// match is scoped to the body column. FTS5's column-filter syntax
|
||||
/// accepts an arbitrary OR/AND sub-expression inside the parens.
|
||||
fn build_match_string(text: &str) -> Option<String> {
|
||||
let trimmed = text.trim();
|
||||
if trimmed.is_empty() {
|
||||
@@ -186,15 +221,29 @@ fn build_match_string(text: &str) -> Option<String> {
|
||||
}
|
||||
return Some(inner_trim.to_string());
|
||||
}
|
||||
let tokens: Vec<String> = trimmed
|
||||
.split_whitespace()
|
||||
.map(escape_fts5_token)
|
||||
.collect();
|
||||
if tokens.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(tokens.join(" "))
|
||||
}
|
||||
|
||||
const MIN_TRIGRAM_CHARS: usize = 3;
|
||||
|
||||
let whole_candidate: Option<String> = (trimmed.chars().count() >= MIN_TRIGRAM_CHARS)
|
||||
.then(|| escape_fts5_token(trimmed));
|
||||
|
||||
let token_and_candidate: Option<String> = {
|
||||
let toks: Vec<String> = trimmed
|
||||
.split_whitespace()
|
||||
.filter(|t| t.chars().count() >= MIN_TRIGRAM_CHARS)
|
||||
.map(escape_fts5_token)
|
||||
.collect();
|
||||
(!toks.is_empty()).then(|| toks.join(" "))
|
||||
};
|
||||
|
||||
let expression = match (whole_candidate, token_and_candidate) {
|
||||
(None, None) => return None,
|
||||
(Some(w), None) => w,
|
||||
(None, Some(a)) => a,
|
||||
(Some(w), Some(a)) if w == a => w,
|
||||
(Some(w), Some(a)) => format!("({w}) OR ({a})"),
|
||||
};
|
||||
Some(format!("text : ({expression})"))
|
||||
}
|
||||
|
||||
/// Return `Some(inner)` if `s` is wrapped in a matching pair of single
|
||||
@@ -555,39 +604,109 @@ mod tests {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_match_string_default_is_quoted_and_anded() {
|
||||
fn build_match_string_default_emits_or_of_phrase_and_and() {
|
||||
// Two long tokens: both whole-phrase and token-AND candidates
|
||||
// exist and differ, so the builder combines them with OR
|
||||
// inside a `text : (...)` column filter (post-v0.17.1 dogfood:
|
||||
// text-only scoping to avoid heading_path_json false positives).
|
||||
let s = build_match_string("rust cargo").unwrap();
|
||||
// Two tokens, each quoted, joined by a space (implicit AND).
|
||||
assert_eq!(s, r#""rust" "cargo""#);
|
||||
assert_eq!(s, r#"text : (("rust cargo") OR ("rust" "cargo"))"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_match_string_escapes_special_chars() {
|
||||
// `*`, `(`, `)`, `:`, `^`, `"` should all be wrapped inside
|
||||
// FTS5 string-literal quotes so they're treated as literal
|
||||
// text rather than FTS5 operators.
|
||||
// text rather than FTS5 operators. Every token is ≥3 chars,
|
||||
// so both the whole-phrase and token-AND candidates exist,
|
||||
// wrapped in the `text : (...)` column filter.
|
||||
let s = build_match_string(r#"foo* (bar) baz:qux ^head he"llo"#).unwrap();
|
||||
assert_eq!(
|
||||
s,
|
||||
r#""foo*" "(bar)" "baz:qux" "^head" "he""llo""#
|
||||
r#"text : (("foo* (bar) baz:qux ^head he""llo") OR ("foo*" "(bar)" "baz:qux" "^head" "he""llo"))"#
|
||||
);
|
||||
// The doubled `""` is FTS5's way of embedding a literal quote
|
||||
// inside a string literal.
|
||||
// inside a string literal. Appears in both whole-phrase and
|
||||
// token-AND halves.
|
||||
assert!(s.contains(r#"he""llo"#));
|
||||
// Sanity: every special character lives between matching `"`
|
||||
// delimiters — there is no bare-token (unquoted) span anywhere.
|
||||
// We check this by confirming the string starts and ends with `"`
|
||||
// and the count of unescaped `"` is even (each token is wrapped).
|
||||
assert!(s.starts_with('"') && s.ends_with('"'));
|
||||
// Sanity: outermost wrapper is the column filter.
|
||||
assert!(s.starts_with("text : ("));
|
||||
assert!(s.ends_with(')'));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_match_string_passthrough_when_single_quoted() {
|
||||
// The FTS5 expression is preserved verbatim.
|
||||
// Raw mode bypasses column scoping — the FTS5 expression is
|
||||
// preserved verbatim, including any explicit column filter
|
||||
// (e.g. `'heading_path : foo'`) the user opts into.
|
||||
let s = build_match_string("'foo OR bar*'").unwrap();
|
||||
assert_eq!(s, "foo OR bar*");
|
||||
}
|
||||
|
||||
/// Raw mode preserves an explicit `heading_path :` column filter
|
||||
/// — opt-in path for users who deliberately want heading matching
|
||||
/// (post-v0.17.1 dogfood default scopes to `text` only).
|
||||
#[test]
|
||||
fn build_match_string_raw_mode_preserves_heading_filter() {
|
||||
let s = build_match_string("'heading_path : agent'").unwrap();
|
||||
assert_eq!(s, "heading_path : agent");
|
||||
assert!(!s.starts_with("text : "));
|
||||
}
|
||||
|
||||
// ── v0.17.0 trigram-aware redesign coverage ──────────────────────────
|
||||
|
||||
/// 2-char Korean query (`충돌`) yields neither a whole-phrase nor a
|
||||
/// token-AND candidate → `None`. Caller short-circuits to an empty
|
||||
/// hit list rather than executing an FTS5 syntax error on `""` MATCH.
|
||||
#[test]
|
||||
fn build_match_string_short_korean_returns_none() {
|
||||
assert!(build_match_string("충돌").is_none());
|
||||
assert!(build_match_string("키").is_none());
|
||||
assert!(build_match_string(" 충돌 ").is_none());
|
||||
}
|
||||
|
||||
/// `해시 충돌` — both tokens are 2 chars (dropped from the AND), but
|
||||
/// the whole-phrase candidate (`"해시 충돌"`, 5 chars total) survives.
|
||||
/// This is the dominant Korean usage pattern targeted by A5.
|
||||
/// The whole-phrase candidate is then wrapped in the `text : (...)`
|
||||
/// column filter.
|
||||
#[test]
|
||||
fn build_match_string_whole_phrase_only_when_all_tokens_short() {
|
||||
let s = build_match_string("해시 충돌").unwrap();
|
||||
assert_eq!(s, r#"text : ("해시 충돌")"#);
|
||||
}
|
||||
|
||||
/// Single long token: whole-phrase and token-AND candidates collapse
|
||||
/// to the same string. The builder returns the bare quoted form so
|
||||
/// the MATCH expression doesn't carry a redundant `(x) OR (x)`,
|
||||
/// wrapped in `text : (...)`.
|
||||
#[test]
|
||||
fn build_match_string_single_long_token_no_duplicate_or() {
|
||||
assert_eq!(build_match_string("러스트").unwrap(), r#"text : ("러스트")"#);
|
||||
assert_eq!(build_match_string("rust").unwrap(), r#"text : ("rust")"#);
|
||||
}
|
||||
|
||||
/// Mixed Korean+English multi-token query where every token is ≥3
|
||||
/// chars: both candidates exist and differ, OR-combined inside
|
||||
/// `text : (...)`.
|
||||
#[test]
|
||||
fn build_match_string_mixed_lang_emits_or_of_phrase_and_and() {
|
||||
let s = build_match_string("Rust 충돌은").unwrap();
|
||||
assert_eq!(s, r#"text : (("Rust 충돌은") OR ("Rust" "충돌은"))"#);
|
||||
}
|
||||
|
||||
/// One ≥3 token + one <3 token: short token is dropped from the
|
||||
/// AND, leaving a single long token there; whole-phrase exists
|
||||
/// independently. Both candidates differ → OR-combined inside
|
||||
/// `text : (...)`.
|
||||
#[test]
|
||||
fn build_match_string_drops_short_token_in_and_keeps_whole() {
|
||||
// "키" (1 char) dropped from AND; "해시테이블" (5 chars) kept.
|
||||
// Whole phrase "키 해시테이블" (7 chars) keeps the short token.
|
||||
let s = build_match_string("키 해시테이블").unwrap();
|
||||
assert_eq!(s, r#"text : (("키 해시테이블") OR ("해시테이블"))"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_bm25_top_score_in_unit_interval() {
|
||||
// A "perfect" hit is bm25 = -1.0 → normalized 0.5.
|
||||
|
||||
@@ -19,9 +19,9 @@
|
||||
"indexed_at": "2024-01-01T00:00:00Z",
|
||||
"rank": 1,
|
||||
"retrieval": {
|
||||
"fusion_score": 1.4490997273242101e-6,
|
||||
"fusion_score": 1.4615362715630908e-6,
|
||||
"lexical_rank": 1,
|
||||
"lexical_score": 1.4490997273242101e-6,
|
||||
"lexical_score": 1.4615362715630908e-6,
|
||||
"method": "lexical",
|
||||
"vector_rank": null,
|
||||
"vector_score": null
|
||||
@@ -51,9 +51,9 @@
|
||||
"indexed_at": "2024-01-01T00:00:00Z",
|
||||
"rank": 2,
|
||||
"retrieval": {
|
||||
"fusion_score": 9.641424867368187e-7,
|
||||
"fusion_score": 9.207039965986041e-7,
|
||||
"lexical_rank": 2,
|
||||
"lexical_score": 9.641424867368187e-7,
|
||||
"lexical_score": 9.207039965986041e-7,
|
||||
"method": "lexical",
|
||||
"vector_rank": null,
|
||||
"vector_score": null
|
||||
|
||||
@@ -1060,3 +1060,99 @@ fn lexical_snapshot_run_1() {
|
||||
let expected: serde_json::Value = serde_json::from_str(&baseline_text).unwrap();
|
||||
assert_eq!(actual, expected, "lexical run-1 snapshot drift");
|
||||
}
|
||||
|
||||
// ── post-v0.17.1 dogfood — `text` column filter ──────────────────────────
|
||||
|
||||
/// Heading-only token (unique to `chunks.heading_path_json`, absent
|
||||
/// from `chunks.text`) must NOT hit in default mode after the column
|
||||
/// filter clamp. Pins HOTFIXES 2026-05-24 closure — the JSON
|
||||
/// punctuation + path segments in `heading_path_json` are no longer
|
||||
/// matchable from a plain query.
|
||||
#[test]
|
||||
fn lexical_heading_only_token_does_not_hit_default_mode() {
|
||||
let env = Env::new();
|
||||
let conn = env.raw_conn();
|
||||
insert_document(
|
||||
&conn,
|
||||
&id32("d"),
|
||||
"notes/heading-only.md",
|
||||
"Heading-only fixture",
|
||||
"en",
|
||||
"primary",
|
||||
&[],
|
||||
);
|
||||
insert_chunk(
|
||||
&conn,
|
||||
&id32("c1"),
|
||||
&id32("d"),
|
||||
"bravo charlie delta echo",
|
||||
&["kubernetes-agent-controller"],
|
||||
Some("Heading"),
|
||||
r#"[{"kind":"line","start":1,"end":2}]"#,
|
||||
"v1",
|
||||
);
|
||||
drop(conn);
|
||||
|
||||
let r = env.retriever();
|
||||
let hits = r
|
||||
.search(&SearchQuery {
|
||||
// "kubernetes-agent-controller" is in heading_path only.
|
||||
text: "kubernetes-agent-controller".to_string(),
|
||||
mode: SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: SearchFilters::default(),
|
||||
})
|
||||
.unwrap();
|
||||
assert!(
|
||||
hits.is_empty(),
|
||||
"heading-only token must not hit text column; got {} hits",
|
||||
hits.len()
|
||||
);
|
||||
}
|
||||
|
||||
/// Raw mode (`'heading_path : <token>'`) is the opt-in escape hatch
|
||||
/// for users who deliberately want heading-column matching after the
|
||||
/// default text-only clamp. The same fixture that 0-hits in default
|
||||
/// mode must hit when the user explicitly scopes to `heading_path`.
|
||||
#[test]
|
||||
fn lexical_raw_mode_can_opt_into_heading_path_filter() {
|
||||
let env = Env::new();
|
||||
let conn = env.raw_conn();
|
||||
insert_document(
|
||||
&conn,
|
||||
&id32("d"),
|
||||
"notes/heading-only.md",
|
||||
"Heading-only fixture",
|
||||
"en",
|
||||
"primary",
|
||||
&[],
|
||||
);
|
||||
insert_chunk(
|
||||
&conn,
|
||||
&id32("c1"),
|
||||
&id32("d"),
|
||||
"bravo charlie delta echo",
|
||||
&["kubernetes-agent-controller"],
|
||||
Some("Heading"),
|
||||
r#"[{"kind":"line","start":1,"end":2}]"#,
|
||||
"v1",
|
||||
);
|
||||
drop(conn);
|
||||
|
||||
let r = env.retriever();
|
||||
let hits = r
|
||||
.search(&SearchQuery {
|
||||
// Raw mode: outer single quotes opt out of column-filter
|
||||
// wrapping and pass the FTS5 expression through verbatim.
|
||||
text: "'heading_path : \"kubernetes-agent-controller\"'".to_string(),
|
||||
mode: SearchMode::Lexical,
|
||||
k: 10,
|
||||
filters: SearchFilters::default(),
|
||||
})
|
||||
.unwrap();
|
||||
assert_eq!(
|
||||
hits.len(),
|
||||
1,
|
||||
"raw-mode heading_path filter must hit the seeded chunk"
|
||||
);
|
||||
}
|
||||
|
||||
@@ -18,6 +18,7 @@ blake3 = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
walkdir = "2"
|
||||
ignore = "0.4"
|
||||
globset = "0.4"
|
||||
|
||||
[dev-dependencies]
|
||||
serde_json = { workspace = true }
|
||||
|
||||
@@ -86,7 +86,7 @@ impl FsSourceConnector {
|
||||
excludes.extend(scope.exclude.iter().cloned());
|
||||
let kbignore = read_kbignore(&root)?;
|
||||
|
||||
let overrides = build_overrides(&root, &excludes, &kbignore)?;
|
||||
let overrides = build_overrides(&root, &excludes, &kbignore, &scope.include)?;
|
||||
Ok((root, overrides))
|
||||
}
|
||||
|
||||
@@ -103,8 +103,6 @@ impl FsSourceConnector {
|
||||
) -> Result<(Vec<RawAsset>, FsScanSkips)> {
|
||||
let (root, overrides) = self.resolve_scan_params(scope)?;
|
||||
|
||||
log_scope_include_warning(scope);
|
||||
|
||||
let (files, skipped_entries) = walk_files_with_skips(&root, &overrides)?;
|
||||
|
||||
// Accumulate per-category skip counts and sample paths.
|
||||
@@ -284,14 +282,6 @@ fn build_assets(
|
||||
Ok(assets)
|
||||
}
|
||||
|
||||
fn log_scope_include_warning(scope: &SourceScope) {
|
||||
if !scope.include.is_empty() {
|
||||
tracing::debug!(
|
||||
count = scope.include.len(),
|
||||
"FsSourceConnector ignores scope.include — handled by extractor router"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
impl SourceConnector for FsSourceConnector {
|
||||
fn scan(&self, scope: &SourceScope) -> Result<Vec<RawAsset>> {
|
||||
|
||||
@@ -12,6 +12,12 @@ use kebab_core::{AudioType, ImageType, MediaType};
|
||||
/// `MediaType::Image(_)` / `MediaType::Audio(_)`. Anything else (including
|
||||
/// missing extension) → `MediaType::Other(ext)`.
|
||||
pub(crate) fn media_type_for(path: &Path) -> MediaType {
|
||||
// p10-2: code_lang_for_path is the single source of truth for code lang
|
||||
// (design §3.5). Delegate before falling back to extension branches.
|
||||
if let Some(lang) = kebab_parse_code::code_lang_for_path(path) {
|
||||
return MediaType::Code(lang.to_string());
|
||||
}
|
||||
|
||||
let ext = path
|
||||
.extension()
|
||||
.and_then(|s| s.to_str())
|
||||
@@ -19,7 +25,9 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
|
||||
.unwrap_or_default();
|
||||
|
||||
match ext.as_str() {
|
||||
"md" => MediaType::Markdown,
|
||||
// Markdown + MDX (markdown + JSX, treated as plain markdown — the
|
||||
// JSX islands are folded into raw passthrough by the md parser).
|
||||
"md" | "mdx" => MediaType::Markdown,
|
||||
"pdf" => MediaType::Pdf,
|
||||
|
||||
"png" => MediaType::Image(ImageType::Png),
|
||||
@@ -34,15 +42,6 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
|
||||
"flac" => MediaType::Audio(AudioType::Flac),
|
||||
"ogg" => MediaType::Audio(AudioType::Ogg),
|
||||
|
||||
// p10-1A-2: Rust is the only code lang activated in 1A. Other
|
||||
// recognized code langs stay Other until their phase (1B+).
|
||||
"rs" => MediaType::Code("rust".to_string()),
|
||||
|
||||
// p10-1B: Python / TS / JS AST chunkers active.
|
||||
"py" | "pyi" => MediaType::Code("python".into()),
|
||||
"ts" | "tsx" => MediaType::Code("typescript".into()),
|
||||
"js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),
|
||||
|
||||
// Empty string (no extension) and any other extension: bucket as
|
||||
// Other and let downstream extractors decide if they support it.
|
||||
_ => MediaType::Other(ext),
|
||||
@@ -86,7 +85,8 @@ mod tests {
|
||||
media_type_for(Path::new("crates/kebab-core/src/lib.rs")),
|
||||
MediaType::Code("rust".to_string())
|
||||
);
|
||||
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string()));
|
||||
// Cargo.toml is a Tier 2 code manifest (p10-2), handled by code_lang_for_path
|
||||
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Code("toml".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -102,6 +102,32 @@ mod tests {
|
||||
assert_eq!(media_type_for(Path::new("a/b.rs")), MediaType::Code("rust".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ts_variants_mts_cts() {
|
||||
// .mts / .cts are TypeScript ESM / CommonJS — same grammar as .ts.
|
||||
assert_eq!(media_type_for(Path::new("a/b.mts")), MediaType::Code("typescript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.cts")), MediaType::Code("typescript".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn mdx_routes_to_markdown() {
|
||||
// MDX is markdown with JSX islands; the md parser folds the JSX
|
||||
// through as raw passthrough.
|
||||
assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn go_files_map_to_media_code_go() {
|
||||
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn java_kotlin_files_map_to_media_code() {
|
||||
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn unknown_and_missing_extension() {
|
||||
assert_eq!(
|
||||
@@ -113,4 +139,14 @@ mod tests {
|
||||
MediaType::Other(String::new())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tier2_files_map_to_media_code() {
|
||||
assert_eq!(media_type_for(Path::new("a/deploy.yaml")), MediaType::Code("yaml".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/Dockerfile")), MediaType::Code("dockerfile".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/Cargo.toml")), MediaType::Code("toml".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/pom.xml")), MediaType::Code("xml".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/build.gradle")), MediaType::Code("groovy".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/go.mod")), MediaType::Code("go-mod".into()));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -44,6 +44,7 @@ use std::collections::HashSet;
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use globset::{GlobBuilder, GlobSet, GlobSetBuilder};
|
||||
use ignore::overrides::{Override, OverrideBuilder};
|
||||
use walkdir::WalkDir;
|
||||
|
||||
@@ -69,6 +70,11 @@ const DEFAULT_EXCLUDES: &[&str] = &[
|
||||
///
|
||||
/// `default_and_config` covers DEFAULT_EXCLUDES + `config.workspace.exclude`
|
||||
/// — these do NOT map to any of the three named `IngestReport` counters.
|
||||
///
|
||||
/// `include` is the compiled `scope.include` allow-list. When the set is
|
||||
/// empty (no patterns) every file passes; when non-empty a file must match
|
||||
/// at least one pattern to be accepted (directories always pass, so the
|
||||
/// walker can still descend into them).
|
||||
pub(crate) struct WalkOverrides {
|
||||
/// Merged matcher — same as today's `Override`; used for the walk decision.
|
||||
pub combined: Override,
|
||||
@@ -78,6 +84,8 @@ pub(crate) struct WalkOverrides {
|
||||
pub kebabignore: Override,
|
||||
/// Matcher built from `kebab_parse_code::BUILTIN_BLACKLIST` only.
|
||||
pub builtin: Override,
|
||||
/// Compiled allow-list from `scope.include`. Empty set = pass all.
|
||||
pub include: GlobSet,
|
||||
}
|
||||
|
||||
/// Skip attribution category. Used by the connector when counting per-source
|
||||
@@ -161,10 +169,15 @@ fn build_single_matcher_owned(root: &Path, patterns: &[String]) -> Result<Overri
|
||||
/// The three per-source matchers (`gitignore`, `kebabignore`, `builtin`) are
|
||||
/// built in addition to the combined one so the connector can attribute skips
|
||||
/// to the correct `IngestReport` counter without a second walker pass.
|
||||
///
|
||||
/// `include_patterns` (from `scope.include`) are compiled into an allow-list
|
||||
/// `GlobSet`. Empty slice → pass-all (backward-compat); non-empty → file
|
||||
/// must match at least one pattern to be accepted.
|
||||
pub(crate) fn build_overrides(
|
||||
root: &Path,
|
||||
config_exclude: &[String],
|
||||
kbignore_patterns: &[String],
|
||||
include_patterns: &[String],
|
||||
) -> Result<WalkOverrides> {
|
||||
let gitignore_patterns = read_gitignore(root)?;
|
||||
|
||||
@@ -209,14 +222,41 @@ pub(crate) fn build_overrides(
|
||||
.build()
|
||||
.context("failed to compile combined override set")?;
|
||||
|
||||
// Allow-list GlobSet: empty Vec → matches nothing (= pass all); non-empty
|
||||
// → file must match at least one glob to be accepted. We compile with
|
||||
// `case_insensitive=false` to keep the semantics consistent with the
|
||||
// OverrideBuilder exclude patterns above.
|
||||
let include = build_include_globset(include_patterns)?;
|
||||
|
||||
Ok(WalkOverrides {
|
||||
combined,
|
||||
gitignore,
|
||||
kebabignore,
|
||||
builtin,
|
||||
include,
|
||||
})
|
||||
}
|
||||
|
||||
/// Compile `scope.include` patterns into a `GlobSet` allow-list.
|
||||
///
|
||||
/// Each pattern uses `GlobBuilder` with `literal_separator = true` so that
|
||||
/// `**` can cross directory boundaries while `*` stops at `/`, matching the
|
||||
/// gitignore convention used throughout the rest of the walker.
|
||||
///
|
||||
/// An empty slice produces an empty `GlobSet` — callers interpret that as
|
||||
/// "pass all files" (no allow-list constraint).
|
||||
fn build_include_globset(patterns: &[String]) -> Result<GlobSet> {
|
||||
let mut builder = GlobSetBuilder::new();
|
||||
for pat in patterns {
|
||||
let glob = GlobBuilder::new(pat)
|
||||
.literal_separator(true)
|
||||
.build()
|
||||
.with_context(|| format!("invalid include pattern: {pat}"))?;
|
||||
builder.add(glob);
|
||||
}
|
||||
builder.build().context("failed to compile include globset")
|
||||
}
|
||||
|
||||
/// Classify why a path was excluded, using per-source matchers in spec §5.2
|
||||
/// priority order: built-in > gitignore > kebabignore > other.
|
||||
///
|
||||
@@ -391,6 +431,13 @@ pub(crate) fn walk_files_with_skips(
|
||||
}
|
||||
|
||||
if entry.file_type().is_file() {
|
||||
// Apply include allow-list: if non-empty, the file's path
|
||||
// relative to root must match at least one pattern.
|
||||
if !overrides.include.is_empty() && !overrides.include.is_match(rel) {
|
||||
// Not in the allow-list — silently drop (no skip counter;
|
||||
// the include filter is not a "skip" source in IngestReport).
|
||||
continue;
|
||||
}
|
||||
accepted.push(path.to_path_buf());
|
||||
}
|
||||
}
|
||||
@@ -406,7 +453,7 @@ mod tests {
|
||||
#[test]
|
||||
fn empty_inputs_compile_into_an_override() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[]).unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
|
||||
// Default-excludes only; non-special files should not match.
|
||||
let m = ov.combined.matched(Path::new("notes/alpha.md"), false);
|
||||
assert!(!m.is_ignore());
|
||||
@@ -415,7 +462,7 @@ mod tests {
|
||||
#[test]
|
||||
fn default_excludes_ds_store_and_resource_forks() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[]).unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
|
||||
assert!(ov.combined.matched(Path::new(".DS_Store"), false).is_ignore());
|
||||
assert!(
|
||||
ov.combined.matched(Path::new("notes/.DS_Store"), false).is_ignore()
|
||||
@@ -433,6 +480,7 @@ mod tests {
|
||||
dir.path(),
|
||||
&["*.tmp".to_string(), "node_modules/**".to_string()],
|
||||
&[],
|
||||
&[],
|
||||
)
|
||||
.unwrap();
|
||||
assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
@@ -452,6 +500,7 @@ mod tests {
|
||||
dir.path(),
|
||||
&["*.tmp".to_string()],
|
||||
&["secret/**".to_string()],
|
||||
&[],
|
||||
)
|
||||
.unwrap();
|
||||
assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
@@ -491,7 +540,7 @@ mod tests {
|
||||
fs::write(root.join("src/main.rs"), "x").unwrap();
|
||||
fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[]).unwrap();
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// Override::matched expects paths relative to the builder's root.
|
||||
let m_in = overrides.combined.matched(Path::new("src/main.rs"), false);
|
||||
let m_out = overrides.combined.matched(Path::new("node_modules/foo/bar.js"), false);
|
||||
@@ -514,7 +563,7 @@ mod tests {
|
||||
fs::create_dir_all(root.join("ok")).unwrap();
|
||||
fs::write(root.join("ok/z.txt"), "z").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[]).unwrap();
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// Override::matched expects paths relative to the builder's root.
|
||||
for blacklisted in [
|
||||
"target/x/y.txt",
|
||||
@@ -544,7 +593,7 @@ mod tests {
|
||||
fs::create_dir_all(root.join("dist")).unwrap();
|
||||
fs::write(root.join("dist/bundle.js"), "x").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[]).unwrap();
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
assert!(overrides.combined.matched(Path::new("a.log"), false).is_ignore());
|
||||
assert!(overrides.combined.matched(Path::new("dist/bundle.js"), false).is_ignore());
|
||||
assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
|
||||
@@ -562,7 +611,7 @@ mod tests {
|
||||
fs::write(root.join("src/main.rs"), "x").unwrap();
|
||||
|
||||
// No .gitignore present — patterns from .gitignore should not affect overrides.
|
||||
let overrides = build_overrides(root, &[], &[]).unwrap();
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
assert!(!overrides.combined.matched(Path::new("a.log"), false).is_ignore());
|
||||
assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
|
||||
}
|
||||
@@ -577,7 +626,7 @@ mod tests {
|
||||
// semantics, but at minimum it must not produce double-`!` corruption.
|
||||
fs::write(root.join(".gitignore"), "!keep/\n").unwrap();
|
||||
// Just verify build_overrides doesn't error.
|
||||
let result = build_overrides(root, &[], &[]);
|
||||
let result = build_overrides(root, &[], &[], &[]);
|
||||
assert!(result.is_ok(), "should not error on negation pattern: {:?}", result.err());
|
||||
}
|
||||
|
||||
@@ -594,7 +643,7 @@ mod tests {
|
||||
// .gitignore entry. Builtin must win (priority order §5.2).
|
||||
fs::write(root.join(".gitignore"), "node_modules/\n").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[]).unwrap();
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// node_modules/ dir itself
|
||||
let cat = classify_skip(Path::new("node_modules"), true, &ov);
|
||||
assert_eq!(cat, SkipCategory::BuiltinBlacklist, "builtin must have priority");
|
||||
@@ -609,7 +658,7 @@ mod tests {
|
||||
let root = tmp.path();
|
||||
fs::write(root.join(".gitignore"), "*.log\n").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[]).unwrap();
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let cat = classify_skip(Path::new("app.log"), false, &ov);
|
||||
assert_eq!(cat, SkipCategory::Gitignore);
|
||||
}
|
||||
@@ -621,7 +670,7 @@ mod tests {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
|
||||
let ov = build_overrides(root, &[], &["*.secret".to_string()]).unwrap();
|
||||
let ov = build_overrides(root, &[], &["*.secret".to_string()], &[]).unwrap();
|
||||
let cat = classify_skip(Path::new("creds.secret"), false, &ov);
|
||||
assert_eq!(cat, SkipCategory::Kebabignore);
|
||||
}
|
||||
@@ -637,7 +686,7 @@ mod tests {
|
||||
fs::write(root.join("ok.md"), "# ok").unwrap();
|
||||
fs::write(root.join("skipme.log"), "x").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[]).unwrap();
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();
|
||||
|
||||
let accepted_names: Vec<_> = accepted
|
||||
@@ -677,7 +726,7 @@ mod tests {
|
||||
fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();
|
||||
fs::write(root.join("ok.md"), "# ok").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[]).unwrap();
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();
|
||||
|
||||
let accepted_names: Vec<_> = accepted
|
||||
|
||||
111
crates/kebab-source-fs/tests/include_allowlist.rs
Normal file
111
crates/kebab-source-fs/tests/include_allowlist.rs
Normal file
@@ -0,0 +1,111 @@
|
||||
//! Integration test: `scope.include` enforces an allow-list.
|
||||
//!
|
||||
//! Semantics (gitignore convention):
|
||||
//! - `include` is empty Vec → all files pass through (backward-compat).
|
||||
//! - `include` is non-empty → only files matching at least one pattern
|
||||
//! are accepted. `exclude` rules still apply after include.
|
||||
//!
|
||||
//! Layout (built per-test in a TempDir):
|
||||
//! root/
|
||||
//! ├── a.md
|
||||
//! ├── b.py
|
||||
//! ├── c.png
|
||||
//! └── d.pdf
|
||||
|
||||
use std::fs;
|
||||
|
||||
use kebab_config::Config;
|
||||
use kebab_core::{SourceConnector, SourceScope};
|
||||
use kebab_source_fs::FsSourceConnector;
|
||||
|
||||
fn cfg_with_root(root: &str) -> Config {
|
||||
let mut c = Config::defaults();
|
||||
c.workspace.root = root.to_string();
|
||||
c.workspace.exclude.clear();
|
||||
// Disable size / generated caps so small test files always pass.
|
||||
c.ingest.code.max_file_bytes = u64::MAX;
|
||||
c.ingest.code.max_file_lines = u32::MAX;
|
||||
c.ingest.code.skip_generated_header = false;
|
||||
c
|
||||
}
|
||||
|
||||
fn setup_mixed_dir() -> tempfile::TempDir {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
fs::write(root.join("a.md"), b"md").unwrap();
|
||||
fs::write(root.join("b.py"), b"py").unwrap();
|
||||
fs::write(root.join("c.png"), b"\x89PNG").unwrap();
|
||||
fs::write(root.join("d.pdf"), b"%PDF").unwrap();
|
||||
dir
|
||||
}
|
||||
|
||||
/// Empty include → all 4 files pass (backward-compat).
|
||||
#[test]
|
||||
fn include_empty_accepts_all_files() {
|
||||
let dir = setup_mixed_dir();
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec![],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"a.md".to_string()), "a.md missing; got: {names:?}");
|
||||
assert!(names.contains(&"b.py".to_string()), "b.py missing; got: {names:?}");
|
||||
assert!(names.contains(&"c.png".to_string()), "c.png missing; got: {names:?}");
|
||||
assert!(names.contains(&"d.pdf".to_string()), "d.pdf missing; got: {names:?}");
|
||||
assert_eq!(names.len(), 4, "expected exactly 4 files; got: {names:?}");
|
||||
}
|
||||
|
||||
/// Non-empty include → only md + py come back; png + pdf are excluded.
|
||||
#[test]
|
||||
fn include_nonempty_is_allowlist() {
|
||||
let dir = setup_mixed_dir();
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec!["**/*.md".to_string(), "**/*.py".to_string()],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"a.md".to_string()), "a.md should be accepted; got: {names:?}");
|
||||
assert!(names.contains(&"b.py".to_string()), "b.py should be accepted; got: {names:?}");
|
||||
assert!(
|
||||
!names.contains(&"c.png".to_string()),
|
||||
"c.png must be rejected by include allowlist; got: {names:?}"
|
||||
);
|
||||
assert!(
|
||||
!names.contains(&"d.pdf".to_string()),
|
||||
"d.pdf must be rejected by include allowlist; got: {names:?}"
|
||||
);
|
||||
assert_eq!(names.len(), 2, "expected exactly 2 files; got: {names:?}");
|
||||
}
|
||||
|
||||
/// include + exclude are ANDed: a file matching include but also matching
|
||||
/// exclude must be rejected.
|
||||
#[test]
|
||||
fn include_and_exclude_are_anded() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
fs::write(root.join("keep.md"), b"keep").unwrap();
|
||||
fs::write(root.join("drop.md"), b"drop").unwrap();
|
||||
fs::write(root.join("other.py"), b"py").unwrap();
|
||||
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec!["**/*.md".to_string()],
|
||||
exclude: vec!["drop.md".to_string()],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"keep.md".to_string()), "keep.md should be accepted; got: {names:?}");
|
||||
assert!(
|
||||
!names.contains(&"drop.md".to_string()),
|
||||
"drop.md should be excluded (matched exclude); got: {names:?}"
|
||||
);
|
||||
assert!(
|
||||
!names.contains(&"other.py".to_string()),
|
||||
"other.py should be excluded (not in include); got: {names:?}"
|
||||
);
|
||||
}
|
||||
@@ -56,5 +56,6 @@
|
||||
"skipped_kebabignore": 0,
|
||||
"skipped_size_exceeded": 0,
|
||||
"unchanged": 0,
|
||||
"purged_deleted_files": 0,
|
||||
"updated": 1
|
||||
}
|
||||
|
||||
@@ -264,6 +264,28 @@ impl kebab_core::DocumentStore for SqliteStore {
|
||||
}))
|
||||
}
|
||||
|
||||
fn get_asset(
|
||||
&self,
|
||||
id: &kebab_core::AssetId,
|
||||
) -> Result<Option<kebab_core::RawAsset>> {
|
||||
let conn = self.lock_conn();
|
||||
let result = conn.query_row(
|
||||
r#"SELECT
|
||||
asset_id, source_uri, workspace_path, media_type,
|
||||
byte_len, checksum, storage_kind, storage_path,
|
||||
discovered_at
|
||||
FROM assets
|
||||
WHERE asset_id = ?"#,
|
||||
rusqlite::params![id.0.as_str()],
|
||||
asset_from_row,
|
||||
);
|
||||
match result {
|
||||
Ok(asset) => Ok(Some(asset)),
|
||||
Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None),
|
||||
Err(e) => Err(e.into()),
|
||||
}
|
||||
}
|
||||
|
||||
fn get_asset_by_workspace_path(
|
||||
&self,
|
||||
path: &kebab_core::WorkspacePath,
|
||||
@@ -286,6 +308,88 @@ impl kebab_core::DocumentStore for SqliteStore {
|
||||
}
|
||||
}
|
||||
|
||||
fn get_document_by_workspace_path(
|
||||
&self,
|
||||
path: &kebab_core::WorkspacePath,
|
||||
) -> Result<Option<kebab_core::CanonicalDocument>> {
|
||||
let conn = self.lock_conn();
|
||||
let row: Option<DocumentRow> = conn
|
||||
.query_row(
|
||||
"SELECT
|
||||
doc_id, asset_id, workspace_path, title, lang,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version, metadata_json,
|
||||
provenance_json, created_at, updated_at,
|
||||
last_chunker_version, last_embedding_version
|
||||
FROM documents WHERE workspace_path = ?",
|
||||
params![path.0],
|
||||
document_row_from_sql,
|
||||
)
|
||||
.map(Some)
|
||||
.or_else(rows_optional)
|
||||
.map_err(StoreError::from)?;
|
||||
let Some(row) = row else { return Ok(None) };
|
||||
|
||||
let doc_id = kebab_core::DocumentId(row.doc_id.clone());
|
||||
let mut blocks_stmt = conn
|
||||
.prepare(
|
||||
"SELECT payload_json FROM blocks
|
||||
WHERE doc_id = ? ORDER BY ordinal ASC",
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
let block_rows = blocks_stmt
|
||||
.query_map(params![row.doc_id], |r| {
|
||||
let payload_json: String = r.get(0)?;
|
||||
Ok(payload_json)
|
||||
})
|
||||
.map_err(StoreError::from)?;
|
||||
let mut blocks: Vec<kebab_core::Block> = Vec::new();
|
||||
for block_row in block_rows {
|
||||
let payload_json = block_row.map_err(StoreError::from)?;
|
||||
let block: kebab_core::Block = serde_json::from_str(&payload_json)
|
||||
.context("deserialize block payload_json")?;
|
||||
blocks.push(block);
|
||||
}
|
||||
|
||||
let metadata: kebab_core::Metadata = serde_json::from_str(&row.metadata_json)
|
||||
.context("deserialize metadata_json")?;
|
||||
let provenance: kebab_core::Provenance =
|
||||
serde_json::from_str(&row.provenance_json)
|
||||
.context("deserialize provenance_json")?;
|
||||
|
||||
Ok(Some(kebab_core::CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: kebab_core::AssetId(row.asset_id),
|
||||
workspace_path: kebab_core::WorkspacePath(row.workspace_path),
|
||||
title: row.title.unwrap_or_default(),
|
||||
lang: kebab_core::Lang(row.lang.unwrap_or_default()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance,
|
||||
parser_version: kebab_core::ParserVersion(row.parser_version),
|
||||
schema_version: row.schema_version as u32,
|
||||
doc_version: row.doc_version as u32,
|
||||
last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion),
|
||||
last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion),
|
||||
}))
|
||||
}
|
||||
|
||||
fn all_workspace_paths(&self) -> Result<Vec<kebab_core::WorkspacePath>> {
|
||||
let conn = self.lock_conn();
|
||||
let mut stmt = conn
|
||||
.prepare("SELECT workspace_path FROM documents")
|
||||
.map_err(StoreError::from)?;
|
||||
let rows = stmt
|
||||
.query_map([], |r| r.get::<_, String>(0))
|
||||
.map_err(StoreError::from)?;
|
||||
let mut out = Vec::new();
|
||||
for row in rows {
|
||||
let path = row.map_err(StoreError::from)?;
|
||||
out.push(kebab_core::WorkspacePath(path));
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
fn list_documents(
|
||||
&self,
|
||||
filter: &kebab_core::DocFilter,
|
||||
@@ -550,7 +654,8 @@ fn rows_optional<T>(err: rusqlite::Error) -> rusqlite::Result<Option<T>> {
|
||||
|
||||
/// Reconstruct a [`kebab_core::RawAsset`] from one `assets` row.
|
||||
/// Row mapper for `RawAsset`. Column names are self-documenting; the
|
||||
/// SELECT in [`DocumentStore::get_asset_by_workspace_path`] must include
|
||||
/// SELECTs in [`DocumentStore::get_asset`] and
|
||||
/// [`DocumentStore::get_asset_by_workspace_path`] must both include
|
||||
/// all nine columns by their schema names.
|
||||
fn asset_from_row(row: &rusqlite::Row<'_>) -> rusqlite::Result<kebab_core::RawAsset> {
|
||||
use std::path::PathBuf;
|
||||
|
||||
@@ -35,4 +35,4 @@ pub use error::StoreError;
|
||||
pub use eval::{EvalQueryResultRecord, EvalRunRecord, EvalRunRow};
|
||||
pub use fts::rebuild_chunks_fts;
|
||||
pub use jobs::IngestRunRow;
|
||||
pub use store::{CountSummary, NotIndexed, SqliteStore};
|
||||
pub use store::{CountSummary, NotIndexed, SqliteStore, purge_deleted_workspace_path};
|
||||
|
||||
@@ -464,6 +464,74 @@ impl SqliteStore {
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-B: sister of [`Self::stale_chunk_ids_at`] for the
|
||||
/// `parser_version` bump cascade. When `doc_id` depends on
|
||||
/// `parser_version` (design §9) and an extractor ships a new
|
||||
/// `PARSER_VERSION`, the next ingest computes a fresh `doc_id` for
|
||||
/// the *same* `(workspace_path, asset_id)` pair. The existing
|
||||
/// asset_id-keyed [`Self::stale_chunk_ids_at`] does NOT fire (same
|
||||
/// asset), so the legacy `chunks` rows and their LanceDB shadows
|
||||
/// would orphan. This helper queries by `workspace_path` instead,
|
||||
/// excluding the freshly-computed `keep_doc_id` so a re-entry
|
||||
/// during the same ingest doesn't re-sweep the new row.
|
||||
///
|
||||
/// Caller usage: pass the *new* `doc_id` if known; pass an empty
|
||||
/// string when called before the new INSERT (the case in
|
||||
/// `try_skip_unchanged`) — all existing docs at `workspace_path`
|
||||
/// are then collected as stale.
|
||||
pub fn stale_chunk_ids_for_workspace_path_except_doc_id(
|
||||
&self,
|
||||
workspace_path: &str,
|
||||
keep_doc_id: &str,
|
||||
) -> Result<Vec<kebab_core::ChunkId>> {
|
||||
let conn = self.lock_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT c.chunk_id
|
||||
FROM chunks c
|
||||
INNER JOIN documents d ON c.doc_id = d.doc_id
|
||||
WHERE d.workspace_path = ?1 AND d.doc_id != ?2",
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
let rows = stmt
|
||||
.query_map(params![workspace_path, keep_doc_id], |row| {
|
||||
row.get::<_, String>(0)
|
||||
})
|
||||
.map_err(StoreError::from)?;
|
||||
let mut out: Vec<kebab_core::ChunkId> = Vec::new();
|
||||
for row in rows {
|
||||
let id = row.map_err(StoreError::from)?;
|
||||
out.push(kebab_core::ChunkId(id));
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-B: sweep the SQLite document chain (`documents` →
|
||||
/// `blocks` / `chunks` / `embedding_records` via CASCADE) for every
|
||||
/// row at `workspace_path` whose `doc_id` differs from `keep_doc_id`.
|
||||
/// Pair with [`Self::stale_chunk_ids_for_workspace_path_except_doc_id`]
|
||||
/// — caller fetches the chunk_ids first, hands them to
|
||||
/// `VectorStore::delete_by_chunk_ids`, then calls this sweep.
|
||||
/// `assets` row is preserved (same bytes, same asset_id — only the
|
||||
/// derived `doc_id` changed).
|
||||
///
|
||||
/// `keep_doc_id = ""` deletes every doc at `workspace_path`
|
||||
/// (semantics mirror the sister helper above — used by
|
||||
/// `try_skip_unchanged` before the new INSERT exists).
|
||||
pub fn purge_document_at_workspace_path_except_doc_id(
|
||||
&self,
|
||||
workspace_path: &str,
|
||||
keep_doc_id: &str,
|
||||
) -> Result<()> {
|
||||
let conn = self.lock_conn();
|
||||
conn.execute(
|
||||
"DELETE FROM documents WHERE workspace_path = ?1 AND doc_id != ?2",
|
||||
params![workspace_path, keep_doc_id],
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
/// Sweep stale `assets` + `documents` + downstream rows when the file
|
||||
@@ -540,10 +608,132 @@ pub(crate) fn purge_orphan_at_workspace_path(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Purge all stored data for a document whose on-disk file has been
|
||||
/// deleted (as opposed to content-changed, which is handled by
|
||||
/// `purge_orphan_at_workspace_path`).
|
||||
///
|
||||
/// Returns the `chunk_id`s that were associated with the document so
|
||||
/// the caller can issue a matching `VectorStore::delete_by_chunk_ids`
|
||||
/// on the LanceDB side.
|
||||
///
|
||||
/// Deletion order:
|
||||
/// 1. Collect chunk_ids (before cascade removes them).
|
||||
/// 2. DELETE the `documents` row → CASCADE clears `blocks`, `chunks`,
|
||||
/// `embedding_records`.
|
||||
/// 3. DELETE the `assets` row **only if no other document still
|
||||
/// references it** (twin-file protection — `assets` can be shared
|
||||
/// across identical-content files via the blake3 PK).
|
||||
/// 4. If the asset was `storage_kind = 'copied'`, best-effort delete
|
||||
/// the on-disk byte file at `storage_path`.
|
||||
///
|
||||
/// Returns `Ok(vec![])` when no document exists at `workspace_path`
|
||||
/// (idempotent — caller doesn't need to pre-check).
|
||||
pub fn purge_deleted_workspace_path(
|
||||
store: &SqliteStore,
|
||||
workspace_path: &kebab_core::WorkspacePath,
|
||||
) -> anyhow::Result<Vec<kebab_core::ChunkId>> {
|
||||
let conn = store.lock_conn();
|
||||
|
||||
// Look up the document + its asset_id.
|
||||
let doc_row: Option<(String, String)> = conn
|
||||
.query_row(
|
||||
"SELECT doc_id, asset_id FROM documents WHERE workspace_path = ?",
|
||||
rusqlite::params![workspace_path.0],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.optional()
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
let Some((doc_id, asset_id)) = doc_row else {
|
||||
return Ok(Vec::new());
|
||||
};
|
||||
|
||||
// 1. Collect chunk_ids before CASCADE removes them.
|
||||
let mut stmt = conn
|
||||
.prepare("SELECT chunk_id FROM chunks WHERE doc_id = ?")
|
||||
.map_err(StoreError::from)?;
|
||||
let rows = stmt
|
||||
.query_map(rusqlite::params![doc_id], |r| r.get::<_, String>(0))
|
||||
.map_err(StoreError::from)?;
|
||||
let chunk_ids: Vec<kebab_core::ChunkId> = rows
|
||||
.map(|r| r.map(kebab_core::ChunkId))
|
||||
.collect::<rusqlite::Result<Vec<_>>>()
|
||||
.map_err(StoreError::from)?;
|
||||
drop(stmt);
|
||||
|
||||
// 2. DELETE the document row (CASCADE clears blocks / chunks /
|
||||
// embedding_records via the FK constraints in V001).
|
||||
conn.execute(
|
||||
"DELETE FROM documents WHERE doc_id = ?",
|
||||
rusqlite::params![doc_id],
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
// 3. Delete the asset row only when no other document still
|
||||
// references it (twin-file safety: two files with identical
|
||||
// bytes share a single asset row via the blake3 PK).
|
||||
let remaining_refs: i64 = conn
|
||||
.query_row(
|
||||
"SELECT COUNT(*) FROM documents WHERE asset_id = ?",
|
||||
rusqlite::params![asset_id],
|
||||
|r| r.get(0),
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
if remaining_refs == 0 {
|
||||
// 4. Capture storage details before deleting the row.
|
||||
let asset_storage: Option<(String, String)> = conn
|
||||
.query_row(
|
||||
"SELECT storage_kind, storage_path FROM assets WHERE asset_id = ?",
|
||||
rusqlite::params![asset_id],
|
||||
|r| Ok((r.get(0)?, r.get(1)?)),
|
||||
)
|
||||
.optional()
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
conn.execute(
|
||||
"DELETE FROM assets WHERE asset_id = ?",
|
||||
rusqlite::params![asset_id],
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
|
||||
// 5. Best-effort: remove the on-disk copied asset file.
|
||||
if let Some((storage_kind, storage_path)) = asset_storage {
|
||||
if storage_kind == "copied" {
|
||||
let _ = std::fs::remove_file(&storage_path);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-store-sqlite",
|
||||
workspace_path = %workspace_path.0,
|
||||
doc_id = %doc_id,
|
||||
chunk_count = chunk_ids.len(),
|
||||
"purged deleted-file document from store"
|
||||
);
|
||||
|
||||
Ok(chunk_ids)
|
||||
}
|
||||
|
||||
/// UPSERT a row into `assets`. Used by both the `put_asset_with_bytes`
|
||||
/// path (which has bytes + computed `storage_kind/path`) and the
|
||||
/// `DocumentStore::put_asset` path (which only has the `RawAsset` and
|
||||
/// reads `storage_kind/path` from `asset.stored`).
|
||||
///
|
||||
/// **`assets.workspace_path` is "last-registered path" semantics for
|
||||
/// twin files** (two source files with identical content share one
|
||||
/// `assets` row keyed on `asset_id = blake3(content)`). Each ingest
|
||||
/// of either twin overwrites `workspace_path` with whichever path was
|
||||
/// seen most recently — this is intentional and correct after PR #146
|
||||
/// made `try_skip_unchanged` document-centric (uses
|
||||
/// `get_document_by_workspace_path`, not `get_asset_by_workspace_path`)
|
||||
/// and PR #149 made `reset --orphans-only` document-centric too.
|
||||
/// Do NOT "fix" the flip-flop by adding a UNIQUE constraint on
|
||||
/// `workspace_path` in the `assets` table — twin de-dup is load-bearing.
|
||||
/// When you need media_type for a known document, use the 2-step lookup
|
||||
/// `get_document_by_workspace_path` → `doc.source_asset_id` →
|
||||
/// `get_asset(asset_id)` so the result is twin-safe.
|
||||
pub(crate) fn upsert_asset_row(
|
||||
conn: &Connection,
|
||||
asset: &kebab_core::RawAsset,
|
||||
@@ -701,6 +891,78 @@ impl SqliteStore {
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-C: per-code-language **chunk** count for
|
||||
/// `schema.v1.stats`. Companion to [`Self::code_lang_breakdown`] —
|
||||
/// that one returns *document* counts. Stats observers wanting
|
||||
/// indexing-pressure granularity (a single PDF spec → 200 chunks,
|
||||
/// vs a single Rust file → 5 chunks) need the chunk-level view.
|
||||
///
|
||||
/// SQL joins `chunks → documents`, reads
|
||||
/// `metadata_json->'$.code_lang'` on the doc side, groups by the
|
||||
/// language, and skips rows where `code_lang IS NULL`. Returns
|
||||
/// `BTreeMap<String, u32>` mirroring the doc-count helper above
|
||||
/// so callers can serialize both with the same shape.
|
||||
pub fn code_lang_chunk_breakdown(
|
||||
&self,
|
||||
) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
|
||||
use anyhow::Context;
|
||||
let conn = self.read_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT json_extract(d.metadata_json, '$.code_lang') AS cl, \
|
||||
COUNT(c.chunk_id) \
|
||||
FROM chunks c \
|
||||
INNER JOIN documents d ON c.doc_id = d.doc_id \
|
||||
WHERE cl IS NOT NULL \
|
||||
GROUP BY cl",
|
||||
)
|
||||
.context("prepare code_lang_chunk_breakdown")?;
|
||||
let rows = stmt
|
||||
.query_map([], |r| {
|
||||
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
|
||||
})
|
||||
.context("query code_lang_chunk_breakdown")?;
|
||||
let mut out = std::collections::BTreeMap::new();
|
||||
for row in rows {
|
||||
let (k, v) = row.context("read code_lang_chunk_breakdown row")?;
|
||||
out.insert(k, v);
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// p10-1A-2 follow-up (dogfooding 2026-05-20): per-repo doc count for
|
||||
/// `schema.v1`.
|
||||
///
|
||||
/// Reads `metadata_json->'$.repo'`, groups by the value, and skips rows
|
||||
/// where `repo` is NULL (documents without an explicit repo tag).
|
||||
/// Returns `BTreeMap<String, u32>` — key is the repo name as stored in
|
||||
/// frontmatter, value is the doc count.
|
||||
pub fn repo_breakdown(
|
||||
&self,
|
||||
) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
|
||||
use anyhow::Context;
|
||||
let conn = self.read_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT json_extract(metadata_json, '$.repo') AS rp, COUNT(*) \
|
||||
FROM documents \
|
||||
WHERE rp IS NOT NULL \
|
||||
GROUP BY rp",
|
||||
)
|
||||
.context("prepare repo_breakdown")?;
|
||||
let rows = stmt
|
||||
.query_map([], |r| {
|
||||
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
|
||||
})
|
||||
.context("query repo_breakdown")?;
|
||||
let mut out = std::collections::BTreeMap::new();
|
||||
for row in rows {
|
||||
let (k, v) = row.context("read repo_breakdown row")?;
|
||||
out.insert(k, v);
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
/// Apply the design §5 / task-spec pragmas. Called once per connection.
|
||||
@@ -817,5 +1079,181 @@ mod tests {
|
||||
// only one key total
|
||||
assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
|
||||
}
|
||||
|
||||
/// v0.17.0 PR-C: `code_lang_chunk_breakdown` counts *chunks* (not
|
||||
/// docs) grouped by `documents.metadata_json.code_lang`. Differs
|
||||
/// from `code_lang_breakdown` (doc count) by joining `chunks` and
|
||||
/// summing chunk rows so one Rust file with 3 chunks reports
|
||||
/// `rust=3` here vs `rust=1` in the doc-count helper.
|
||||
///
|
||||
/// Uses a side rusqlite connection (FK enforcement off) so a single
|
||||
/// doc + multiple chunks fixture can be inserted without standing
|
||||
/// up `assets` companions.
|
||||
#[test]
|
||||
fn code_lang_chunk_breakdown_counts_chunks_not_docs() {
|
||||
let (dir, store) = open_fresh_store();
|
||||
let db_path = dir.path().join("kebab.sqlite");
|
||||
let conn = rusqlite::Connection::open(&db_path).unwrap();
|
||||
conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
|
||||
|
||||
// 1 Rust doc + 3 chunks → chunk_breakdown rust=3 / doc_breakdown rust=1.
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-rust-1', 'asset-1', 'src/main.rs',
|
||||
'reference', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"code_lang\":\"rust\"}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
for i in 0..3u32 {
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, 'doc-rust-1', ?, '[]', NULL, '[]', 0, 'cv1', 'h', '[]', '2024-01-01T00:00:00Z')",
|
||||
rusqlite::params![format!("rust-chunk-{i:0>26}"), format!("body {i}")],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
// 1 markdown doc + 1 chunk → code_lang = null → must be skipped.
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-md-1', 'asset-2', 'notes/readme.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"code_lang\":null}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES ('md-chunk-00000000000000000000000', 'doc-md-1', 'm', '[]', NULL, '[]', 0, 'cv1', 'h', '[]', '2024-01-01T00:00:00Z')",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
drop(conn);
|
||||
|
||||
let chunk_bd = store.code_lang_chunk_breakdown().unwrap();
|
||||
assert_eq!(
|
||||
chunk_bd.get("rust"),
|
||||
Some(&3u32),
|
||||
"expected rust=3 chunks (1 doc × 3 chunks): {chunk_bd:?}"
|
||||
);
|
||||
assert!(
|
||||
!chunk_bd.contains_key("null"),
|
||||
"null code_lang must be skipped: {chunk_bd:?}"
|
||||
);
|
||||
assert_eq!(
|
||||
chunk_bd.len(),
|
||||
1,
|
||||
"expected exactly 1 language entry: {chunk_bd:?}"
|
||||
);
|
||||
|
||||
// Sanity: the existing doc-count helper still returns 1 for rust,
|
||||
// proving the two metrics differ as intended.
|
||||
let doc_bd = store.code_lang_breakdown().unwrap();
|
||||
assert_eq!(
|
||||
doc_bd.get("rust"),
|
||||
Some(&1u32),
|
||||
"doc-count helper unchanged: {doc_bd:?}"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1A-2 follow-up: `repo_breakdown` counts docs by
|
||||
/// `metadata_json.repo`.
|
||||
///
|
||||
/// Inserts:
|
||||
/// - one doc with `repo = "my-repo"` → must appear with count 1
|
||||
/// - one doc with `repo = null` → must NOT appear (NULL skipped)
|
||||
///
|
||||
/// Uses a side rusqlite connection that bypasses the `assets` FK via
|
||||
/// `PRAGMA foreign_keys = OFF` so the test is self-contained.
|
||||
#[test]
|
||||
fn repo_breakdown_counts_by_repo() {
|
||||
let (dir, store) = open_fresh_store();
|
||||
|
||||
let db_path = dir.path().join("kebab.sqlite");
|
||||
let conn = rusqlite::Connection::open(&db_path).unwrap();
|
||||
conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
|
||||
|
||||
// Doc 1: doc with repo = "my-repo"
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-repo-1', 'asset-r1', 'my-repo/README.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"repo\":\"my-repo\"}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Doc 2: doc with repo absent (null in JSON)
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-norepo-1', 'asset-r2', 'standalone/notes.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"repo\":null}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
drop(conn); // release side connection before querying via store
|
||||
|
||||
let bd = store.repo_breakdown().unwrap();
|
||||
|
||||
// "my-repo" must appear with count 1
|
||||
assert_eq!(
|
||||
bd.get("my-repo"),
|
||||
Some(&1u32),
|
||||
"expected my-repo=1 in repo_breakdown, got: {bd:?}"
|
||||
);
|
||||
// null repo must NOT appear as any key
|
||||
assert!(
|
||||
!bd.contains_key("null"),
|
||||
"null repo must not appear in breakdown, got: {bd:?}"
|
||||
);
|
||||
// only one key total
|
||||
assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -370,17 +370,19 @@ fn extract_design_5_5_fts_block() -> String {
|
||||
fts_slice[..last_end + "END;".len()].to_string()
|
||||
}
|
||||
|
||||
/// Extract the §5.5 verbatim block from the V002 migration, between the
|
||||
/// `── §5.5 verbatim block ──` anchor markers the file already carries.
|
||||
/// Extract the §5.5 verbatim block from the V007 migration (replaced V002
|
||||
/// 's unicode61 tokenizer with trigram — V002 stays in place for
|
||||
/// historical cold-upgrade replay but V007 is now the source of truth),
|
||||
/// between the `── §5.5 verbatim block ──` anchor markers V007 carries.
|
||||
fn extract_migration_5_5_verbatim_block() -> String {
|
||||
let migration = include_str!("../../../migrations/V002__fts.sql");
|
||||
let migration = include_str!("../../../migrations/V007__fts_trigram.sql");
|
||||
// The opening anchor line ends with `── §5.5 verbatim block ─...`.
|
||||
let open_marker = "§5.5 verbatim block";
|
||||
let close_marker = "End §5.5 verbatim block";
|
||||
|
||||
let open_idx = migration
|
||||
.find(open_marker)
|
||||
.expect("V002 must carry the `§5.5 verbatim block` opening anchor");
|
||||
.expect("V007 must carry the `§5.5 verbatim block` opening anchor");
|
||||
let after_open_line = open_idx
|
||||
+ migration[open_idx..]
|
||||
.find('\n')
|
||||
@@ -389,7 +391,7 @@ fn extract_migration_5_5_verbatim_block() -> String {
|
||||
|
||||
let close_idx = migration[after_open_line..]
|
||||
.find(close_marker)
|
||||
.expect("V002 must carry the `End §5.5 verbatim block` closing anchor")
|
||||
.expect("V007 must carry the `End §5.5 verbatim block` closing anchor")
|
||||
+ after_open_line;
|
||||
// Walk back from the close marker to the start of its comment line.
|
||||
let close_line_start = migration[..close_idx]
|
||||
@@ -400,12 +402,14 @@ fn extract_migration_5_5_verbatim_block() -> String {
|
||||
migration[after_open_line..close_line_start].to_string()
|
||||
}
|
||||
|
||||
/// CI diff guard: the §5.5 block in `migrations/V002__fts.sql` must
|
||||
/// match the design doc verbatim (whitespace-normalized). If the
|
||||
/// design doc moves the section, renames the heading, or edits the
|
||||
/// SQL, this test fails first. Same for migration drift.
|
||||
/// CI diff guard: the §5.5 block in `migrations/V007__fts_trigram.sql`
|
||||
/// must match the design doc verbatim (whitespace-normalized). V007
|
||||
/// replaced V002 's unicode61 tokenizer with trigram (2026-05-23).
|
||||
/// V002 stays in place for historical replay of cold-upgrade paths
|
||||
/// but is no longer compared against the design doc — V007 is now
|
||||
/// the source of truth.
|
||||
#[test]
|
||||
fn fts_v002_matches_design_section_5_5_verbatim() {
|
||||
fn fts_v007_matches_design_section_5_5_verbatim() {
|
||||
let design = extract_design_5_5_fts_block();
|
||||
let migration_block = extract_migration_5_5_verbatim_block();
|
||||
|
||||
@@ -428,7 +432,7 @@ fn fts_v002_matches_design_section_5_5_verbatim() {
|
||||
let migration_n = normalize_ws(&migration_block);
|
||||
assert_eq!(
|
||||
design_n, migration_n,
|
||||
"V002__fts.sql §5.5 block must match design doc §5.5 verbatim \
|
||||
"V007__fts_trigram.sql §5.5 block must match design doc §5.5 verbatim \
|
||||
(whitespace-normalized). If you intentionally changed one, \
|
||||
update the other in the same commit."
|
||||
);
|
||||
@@ -477,3 +481,115 @@ fn fts_store_drop_releases_wal_files() {
|
||||
.expect("main DB file should be removable after store drop");
|
||||
}
|
||||
}
|
||||
|
||||
// ── 7. Trigram tokenizer behavior (V007) — Korean + English ──────────
|
||||
|
||||
/// V007 의 trigram tokenizer 가 한국어 3자 이상 연속 substring 을
|
||||
/// 매칭하는지. Codex round 1/2 가 sqlite 3.45.1 로 검증한 동작을 pin:
|
||||
/// - raw query 가 3자 이상 공백 없는 substring 인 경우 hit.
|
||||
/// - raw query 가 공백을 포함하면 FTS5 가 토큰 경계로 분리 →
|
||||
/// 양 토큰이 3자 미만이면 0-hit.
|
||||
/// - quoted phrase ("..." 안에 공백 포함) 는 통째로 substring 매칭.
|
||||
#[test]
|
||||
fn fts_trigram_korean_3char_substring_hits() {
|
||||
let env = common::TestEnv::new();
|
||||
let store = SqliteStore::open(&env.config()).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
|
||||
let conn = raw_conn_no_fk(&env);
|
||||
insert_chunk(
|
||||
&conn,
|
||||
&"k".repeat(32),
|
||||
&"d".repeat(32),
|
||||
"[]",
|
||||
"해시 충돌은 키와 값을 매핑할 때 발생한다",
|
||||
);
|
||||
|
||||
// raw 3+ chars 공백 없는 연속 substring → hit.
|
||||
assert_eq!(
|
||||
count_match(&conn, "충돌은"),
|
||||
1,
|
||||
"raw 3-char 공백 없는 substring '충돌은' must hit"
|
||||
);
|
||||
assert_eq!(
|
||||
count_match(&conn, "발생한"),
|
||||
1,
|
||||
"raw 3-char 공백 없는 substring '발생한' must hit"
|
||||
);
|
||||
|
||||
// quoted phrase (공백 포함) → substring 매칭으로 hit.
|
||||
assert_eq!(
|
||||
count_match(&conn, "\"해시 충돌\""),
|
||||
1,
|
||||
"quoted whole phrase '해시 충돌' (5 chars including space)"
|
||||
);
|
||||
assert_eq!(
|
||||
count_match(&conn, "\"시 충\""),
|
||||
1,
|
||||
"quoted phrase '시 충' across the space boundary"
|
||||
);
|
||||
|
||||
// raw with no whitespace but substring not present in source → 0-hit.
|
||||
assert_eq!(
|
||||
count_match(&conn, "해시충"),
|
||||
0,
|
||||
"원문에 공백 없는 '해시충' trigram 이 없으므로 0-hit"
|
||||
);
|
||||
}
|
||||
|
||||
/// V007 trigram 의 핵심 제약: 3 Unicode chars 미만 query 는 색인 단위가
|
||||
/// 없어 항상 0-hit. design §3.4 + 사용자 결정 (lexical core 정상 0-hit,
|
||||
/// CLI/TUI wrapper 가 안내 메시지 출력). 회귀 감지 — trigram 구조 변경
|
||||
/// 또는 다른 tokenizer 도입 시 이 test 가 먼저 fail 한다.
|
||||
#[test]
|
||||
fn fts_trigram_korean_short_query_zero_hit_pinned() {
|
||||
let env = common::TestEnv::new();
|
||||
let store = SqliteStore::open(&env.config()).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
|
||||
let conn = raw_conn_no_fk(&env);
|
||||
insert_chunk(
|
||||
&conn,
|
||||
&"k".repeat(32),
|
||||
&"d".repeat(32),
|
||||
"[]",
|
||||
"해시 충돌은 키와 값을 매핑할 때 발생한다",
|
||||
);
|
||||
|
||||
// 2자 한국어 query — 도그푸딩에서 보고된 핵심 케이스 ('충돌'/'값').
|
||||
assert_eq!(count_match(&conn, "충돌"), 0, "2-char Korean query");
|
||||
// 1자 한국어 query.
|
||||
assert_eq!(count_match(&conn, "키"), 0, "1-char Korean query");
|
||||
}
|
||||
|
||||
/// V007 trigram 은 영어에도 substring 매칭으로 동작 — recall ↑, 단어
|
||||
/// 경계 정밀도 ↓. design §3.4 의 동작 변경을 명시적으로 핀.
|
||||
#[test]
|
||||
fn fts_trigram_english_substring_hits() {
|
||||
let env = common::TestEnv::new();
|
||||
let store = SqliteStore::open(&env.config()).unwrap();
|
||||
store.run_migrations().unwrap();
|
||||
|
||||
let conn = raw_conn_no_fk(&env);
|
||||
insert_chunk(
|
||||
&conn,
|
||||
&"e".repeat(32),
|
||||
&"d".repeat(32),
|
||||
"[]",
|
||||
"the tokenizer normalizes whitespace before matching",
|
||||
);
|
||||
|
||||
// trigram substring — 'token' hits inside 'tokenizer'.
|
||||
assert_eq!(
|
||||
count_match(&conn, "token"),
|
||||
1,
|
||||
"substring of 'tokenizer' — trigram recall"
|
||||
);
|
||||
assert_eq!(
|
||||
count_match(&conn, "izer"),
|
||||
1,
|
||||
"substring of 'tokenizer'"
|
||||
);
|
||||
// 3-char-minimum applies to English too.
|
||||
assert_eq!(count_match(&conn, "to"), 0, "2-char English query");
|
||||
}
|
||||
|
||||
@@ -41,6 +41,7 @@ fn fixture_report() -> IngestReport {
|
||||
skipped_generated: 0,
|
||||
skipped_size_exceeded: 0,
|
||||
skip_examples: kebab_core::SkipExamples::default(),
|
||||
purged_deleted_files: 0,
|
||||
items: Some(vec![
|
||||
IngestItem {
|
||||
kind: IngestItemKind::New,
|
||||
|
||||
@@ -153,6 +153,12 @@ pub struct SearchState {
|
||||
/// `Ctrl-L`); the previous draft kept one for "symmetry" but
|
||||
/// it was dead code.
|
||||
pub worker_rx: Option<std::sync::mpsc::Receiver<SearchWorkerMessage>>,
|
||||
/// v0.17.0 A5 Step 5: advisory text shown when the last completed
|
||||
/// search returned no hits and the (trimmed) query is shorter than
|
||||
/// the FTS5 trigram tokenizer's 3-char minimum. `None` whenever
|
||||
/// the input changes (so a stale hint never overlaps a fresh
|
||||
/// typing session) or the next search returns ≥1 hit.
|
||||
pub short_query_hint: Option<String>,
|
||||
}
|
||||
|
||||
/// p9-fb-08: payload posted by the search worker on completion.
|
||||
@@ -179,6 +185,7 @@ impl Default for SearchState {
|
||||
preview: None,
|
||||
generation: 0,
|
||||
worker_rx: None,
|
||||
short_query_hint: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -393,6 +393,20 @@ fn dynamic_status(app: &App) -> String {
|
||||
if app.search.as_ref().map(|s| s.searching).unwrap_or(false) {
|
||||
return "searching…".to_string();
|
||||
}
|
||||
// v0.17.0 A5 Step 5: short-query advisory has higher priority than
|
||||
// the idle slot but lower than active operations (streaming /
|
||||
// searching / ingest progress) — the user should always see what
|
||||
// is happening *now* before reading guidance about the last
|
||||
// empty result. Slot only fires while focused on Search.
|
||||
if app.focus == Pane::Search {
|
||||
if let Some(hint) = app
|
||||
.search
|
||||
.as_ref()
|
||||
.and_then(|s| s.short_query_hint.as_deref())
|
||||
{
|
||||
return hint.to_string();
|
||||
}
|
||||
}
|
||||
if let Some(state) = app.ingest_state.as_ref() {
|
||||
return crate::ingest_progress::status_line(state);
|
||||
}
|
||||
|
||||
@@ -333,7 +333,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
s.mode = cycle_mode(s.mode);
|
||||
// Force re-search at the new mode if there's a query.
|
||||
if !s.input.as_str().trim().is_empty() {
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
}
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
@@ -360,7 +360,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
(KeyCode::Backspace, _) => {
|
||||
if !s.input.is_empty() {
|
||||
s.input.pop_char();
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
}
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
@@ -388,7 +388,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
}
|
||||
(KeyCode::Delete, _) => {
|
||||
if s.input.delete_after().is_some() {
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
}
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
@@ -402,7 +402,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
s.preview = None;
|
||||
} else {
|
||||
s.input.push_char('j');
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
}
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
@@ -412,7 +412,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
s.preview = None;
|
||||
} else {
|
||||
s.input.push_char('k');
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
}
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
@@ -426,7 +426,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
// bindings (and don't currently match any Search
|
||||
// command, so they're a safe fall-through to Continue).
|
||||
s.input.push_char(c);
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
mark_input_changed(s);
|
||||
KeyOutcome::Continue
|
||||
}
|
||||
// Normal mode + un-handled Char → no-op (no typing in
|
||||
@@ -435,6 +435,16 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
|
||||
}
|
||||
}
|
||||
|
||||
/// v0.17.0 A5 Step 5: every input-mutation site in `handle_key_search`
|
||||
/// funnels through this helper so the debounce stamp and the
|
||||
/// short-query advisory stay in sync. Reset is eager — the stale
|
||||
/// advisory from the previous result set must not visually overlap
|
||||
/// with a fresh typing session.
|
||||
fn mark_input_changed(s: &mut crate::app::SearchState) {
|
||||
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
|
||||
s.short_query_hint = None;
|
||||
}
|
||||
|
||||
fn cycle_mode(m: SearchMode) -> SearchMode {
|
||||
match m {
|
||||
SearchMode::Lexical => SearchMode::Vector,
|
||||
@@ -603,6 +613,11 @@ pub(crate) fn fire_search(state: &mut App) -> anyhow::Result<()> {
|
||||
s.generation = s.generation.wrapping_add(1);
|
||||
s.searching = true;
|
||||
s.input_dirty_at = None;
|
||||
// v0.17.0 A5 Step 5: hint belongs to the *prior* result set —
|
||||
// a fresh worker spawn invalidates it so the status bar
|
||||
// doesn't keep showing the old advisory while the new
|
||||
// query is in flight.
|
||||
s.short_query_hint = None;
|
||||
let q_text = s.input.as_str().to_string();
|
||||
s.last_query = Some((q_text.clone(), s.mode));
|
||||
(q_text, s.mode, s.generation)
|
||||
@@ -676,6 +691,18 @@ pub fn poll_worker(state: &mut App) {
|
||||
s.searching = false;
|
||||
match result {
|
||||
Ok(hits) => {
|
||||
// v0.17.0 A5 Step 5: stale-aware short-query hint.
|
||||
// The worker carries no copy of the query text;
|
||||
// we ground the advisory on `s.last_query` which
|
||||
// was snapshotted at `fire_search` time and (by
|
||||
// the generation guard above) still matches what
|
||||
// the user submitted for *this* result set. If
|
||||
// input has drifted since spawn, the gen-check
|
||||
// already returned early.
|
||||
let q_text =
|
||||
s.last_query.as_ref().map(|(t, _)| t.as_str()).unwrap_or("");
|
||||
s.short_query_hint =
|
||||
kebab_app::short_query_hint(q_text, hits.is_empty());
|
||||
s.hits = hits;
|
||||
s.selected_hit = 0;
|
||||
s.preview = None;
|
||||
@@ -683,6 +710,7 @@ pub fn poll_worker(state: &mut App) {
|
||||
Err(e) => {
|
||||
s.hits.clear();
|
||||
s.selected_hit = 0;
|
||||
s.short_query_hint = None;
|
||||
state.error_overlay =
|
||||
Some(crate::error_popup::ErrorOverlay::from_anyhow(&e));
|
||||
}
|
||||
|
||||
@@ -22,7 +22,7 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
|
||||
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
|
||||
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
|
||||
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
|
||||
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` — **parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). |
|
||||
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng` — **parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.21–0.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). **Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc. **Tier 1 family complete (p10-1D)**: C (`tree-sitter-c`, `code-c-ast-v1`, `.c`/`.h`) + C++ (`tree-sitter-cpp`, `code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`). C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → Tier 3 fallback. |
|
||||
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
|
||||
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |
|
||||
| Desktop | Tauri 2 + `pdfjs-dist` (native PDF render backend 금지) — P9-5 |
|
||||
@@ -52,7 +52,7 @@ flowchart TB
|
||||
ppdf["kebab-parse-pdf"]
|
||||
pimg["kebab-parse-image"]
|
||||
paud["kebab-parse-audio<br/>(P8 보류)"]
|
||||
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B)"]
|
||||
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2 + P10-3 + P10-1D)"]
|
||||
ptypes["kebab-parse-types"]
|
||||
norm["kebab-normalize"]
|
||||
chunk["kebab-chunk"]
|
||||
@@ -127,7 +127,7 @@ flowchart TB
|
||||
|
||||
UI → store/llm/parse 직접 의존 금지. 모든 user-facing 진입은 `kebab-app` facade 만 통한다 (frozen 설계 §8). `kebab-cli` 가 `--config <path>` flag 를 honor 하려면 `kebab_app::*_with_config(cfg, …)` companion 을 통해 Config 을 명시적으로 thread 하는 패턴 — 자세한 이유는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 의 `--config` 항목.
|
||||
|
||||
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지).
|
||||
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가, P10-1C-Go 에서 `tree-sitter-go` 추가, P10-1C-JK 에서 `tree-sitter-java` / `tree-sitter-kotlin-ng` 추가, P10-1D 에서 `tree-sitter-c` / `tree-sitter-cpp` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지). Kotlin 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.21–0.23 에 고착 — 사용 불가).
|
||||
|
||||
## 디렉토리 구조
|
||||
|
||||
@@ -165,7 +165,16 @@ kebab/
|
||||
│ ├── kebab-source-fs/ # 워크스페이스 walk + checksum (P1-1)
|
||||
│ ├── kebab-parse-md/ # Markdown frontmatter + blocks (P1-2/3)
|
||||
│ ├── kebab-normalize/ # ParsedBlock → CanonicalDocument (P1-4)
|
||||
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-rust-ast-v1 + code-python-ast-v1 + code-ts-ast-v1 + code-js-ast-v1 chunker (P1-5, P7-2, P10-1A-2, P10-1B)
|
||||
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-*-ast-v1 (Tier 1) + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + tier2_shared (P10-2) + code-text-paragraph-v1 (P10-3) chunker (P1-5, P7-2, P10-1A-2, P10-1B, P10-1C-Go, P10-1C-JK, P10-2, P10-3, P10-1D)
|
||||
│ │ └── src/
|
||||
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin/c/cpp)
|
||||
│ │ ├── code_c_ast_v1.rs # Tier 1 (p10-1D): C top-level fn / struct / enum / union
|
||||
│ │ ├── code_cpp_ast_v1.rs # Tier 1 (p10-1D): C++ namespace::Class::method (recursive nesting)
|
||||
│ │ ├── k8s_manifest_resource_v1.rs # Tier 2 (p10-2): YAML multi-doc, apiVersion+kind per resource
|
||||
│ │ ├── dockerfile_file_v1.rs # Tier 2 (p10-2): whole-file Dockerfile
|
||||
│ │ ├── manifest_file_v1.rs # Tier 2 (p10-2): whole-file Cargo.toml / go.mod / .json / .xml / .groovy
|
||||
│ │ ├── code_text_paragraph_v1.rs # Tier 3 (p10-3): blank-line paragraph + 80/20 line-window fallback
|
||||
│ │ └── tier2_shared.rs # Tier 2 (p10-2): shared oversize fallback + Chunk builder helpers
|
||||
│ ├── kebab-store-sqlite/ # SQLite + FTS5 (V001/V002/V003) (P1-6, P2-1, P3-3)
|
||||
│ ├── kebab-search/ # Lexical + Vector + Hybrid retriever (P2-2, P3-4)
|
||||
│ ├── kebab-embed/ kebab-embed-local/ # Embedder trait + fastembed adapter (P3-1, P3-2)
|
||||
@@ -175,7 +184,7 @@ kebab/
|
||||
│ ├── kebab-eval/ # golden query runner + metrics (P5-1, P5-2)
|
||||
│ ├── kebab-parse-image/ # ImageExtractor + Ollama OCR + caption (P6)
|
||||
│ ├── kebab-parse-pdf/ # lopdf per-page text extractor (P7-1)
|
||||
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B); chunker lives in kebab-chunk
|
||||
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B), Go (P10-1C-Go), Java + Kotlin (P10-1C-JK — java.rs + kotlin.rs), C + C++ (P10-1D — c.rs + cpp.rs); chunker lives in kebab-chunk
|
||||
│ ├── kebab-app/ # facade (P0 시그니처 + P3-5/P6-4/P7-3 본체)
|
||||
│ ├── kebab-tui/ # Ratatui shell + Library 패널 (P9-1)
|
||||
│ ├── kebab-mcp/ # stdio MCP server — tools: schema, doctor, search, ask (P9-FB-30)
|
||||
|
||||
255
docs/SMOKE.md
255
docs/SMOKE.md
@@ -21,6 +21,30 @@ cargo build --release -p kebab-cli # debug 도 무방. 디버그가 더 빠르
|
||||
# Mac 등 별도 호스트에서
|
||||
OLLAMA_HOST=0.0.0.0:11434 ollama serve
|
||||
ollama pull gemma4:e4b # 기본 default. 더 큰 variant 원하면 gemma4:26b
|
||||
# CPU only / RAM ≤ 16 GB 환경이면 ≤ 4B Q4 모델 권장 (gemma3:4b / qwen2.5:3b 등) —
|
||||
# 8B+ 모델은 첫 RAG 답변이 5분 (기본 [models.llm] request_timeout_secs)
|
||||
# 한도를 넘기 쉬워 `error: kb-rag: llm.generate_stream` 으로 떨어짐.
|
||||
# 노브 늘리려면 config 에 request_timeout_secs = 1200 추가
|
||||
# 또는 KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=1200 env. HOTFIXES 2026-05-25 참조.
|
||||
```
|
||||
|
||||
sudo / systemd 없이 격리 디렉토리에 설치하는 경로 (컨테이너 / WSL2 / 회사 머신
|
||||
유용):
|
||||
|
||||
```bash
|
||||
# tarball 만 받아 사용자 디렉토리에 풀고 OLLAMA_MODELS 로 모델 디렉토리 분리.
|
||||
mkdir -p /opt/ollama/{models,logs}
|
||||
curl -fL https://ollama.com/download/ollama-linux-amd64.tar.zst -o /tmp/ollama.tar.zst
|
||||
zstd -d /tmp/ollama.tar.zst -o /tmp/ollama.tar && tar -xf /tmp/ollama.tar -C /opt/ollama/
|
||||
OLLAMA_MODELS=/opt/ollama/models OLLAMA_HOST=127.0.0.1:11434 \
|
||||
/opt/ollama/bin/ollama serve > /opt/ollama/logs/serve.log 2>&1 &
|
||||
/opt/ollama/bin/ollama pull gemma3:4b
|
||||
# 종료: pkill -f "ollama serve"
|
||||
```
|
||||
|
||||
cold start 가 긴 모델 (8B+ 또는 첫 호출) 은 `kebab ask --stream` 으로 시도 권장
|
||||
— 토큰을 stderr 에 ndjson 으로 흘려 받아 5분 timeout 한도 안에서도 첫 토큰이
|
||||
빨리 surface 됨 (fb-33). 자세한 명령은 아래 "Streaming ask (fb-33)" 절.
|
||||
```
|
||||
|
||||
본 머신에서 reachability 검증:
|
||||
@@ -140,6 +164,37 @@ KB ask "이 KB 안에서 ..." --mode hybrid --k 5 # 9. RAG 답변 (Ollama
|
||||
KB --json ask "..." --mode hybrid # 10. 기계 친화 출력 검증
|
||||
```
|
||||
|
||||
### 한국어 trigram 검색 (v0.17.0)
|
||||
|
||||
`chunks_fts` 가 FTS5 `trigram` tokenizer 로 동작 — 한국어 query 는 3자 이상 substring 매칭. V007 자동 backfill 이라 기존 KB 의 binary 만 v0.17.0+ 로 교체하면 즉시 적용 (re-ingest 불필요). `kebab.sqlite` 파일 크기가 trigram index 비대화로 ~2-5배 또는 수백 MB 증가.
|
||||
|
||||
`fixtures/search/korean/hash-table.md` (또는 등가) 를 워크스페이스에 두고 ingest 한 후:
|
||||
|
||||
```bash
|
||||
# 3자 연속 substring (raw, 원문에 "해시 충돌은" / "충돌은 발생" 가 있음)
|
||||
KB search --mode lexical "충돌은"
|
||||
|
||||
# multi-token Korean — builder 가 ("해시 충돌") OR ("해시" "충돌") 으로
|
||||
# 변환 (각 토큰 2자라 token-AND 후보는 trigram 비호환, whole-phrase 가 hit)
|
||||
KB search --mode lexical "해시 충돌"
|
||||
|
||||
# 한영 혼합 — 둘 다 3자 이상이라 whole-phrase + token-AND 모두 후보
|
||||
KB search --mode lexical "Rust 충돌은"
|
||||
|
||||
# 2자 query — 정상 0 hit + stderr `[hint] 3자 이상 키워드 권장`
|
||||
KB search --mode lexical "충돌"
|
||||
|
||||
# 동일 케이스의 --json 출력에는 search_response.v1.hint 필드 포함
|
||||
KB search --mode lexical "충돌" --json | jq '.hits | length, .hint'
|
||||
# → 0
|
||||
# → "3자 이상 키워드 권장 (trigram tokenizer 제약)"
|
||||
|
||||
# raw FTS5 mode (single quote 로 감싼 입력) — 사용자 명시 의도, hint 미출력
|
||||
KB search --mode lexical "'충돌'"
|
||||
```
|
||||
|
||||
영어 lexical 도 substring 매칭으로 바뀜 — `KB search --mode lexical "token"` 이 `tokenizer` / `tokenize` 도 hit (recall ↑, 단어 경계 정밀도 ↓).
|
||||
|
||||
### Stale doc indicator
|
||||
|
||||
Each search hit and RAG citation carries `indexed_at` (RFC3339 of the doc's last
|
||||
@@ -401,6 +456,201 @@ KB --json schema | jq '.stats.code_lang_breakdown'
|
||||
- `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 잡힘 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20).
|
||||
- `.gitignore` honor — `node_modules/` / `__pycache__/` / `.venv/` 등 built-in 안전망 자동 skip.
|
||||
|
||||
## P10-1C-Go Go 코드 색인
|
||||
|
||||
P10-1B 와 동일한 격리 KB 설정. `.go` 파일을 워크스페이스에 두고 ingest 하면 `code-go-ast-v1` chunker 가 package 단위 AST 로 처리한다.
|
||||
|
||||
```bash
|
||||
cat > /tmp/kebab-smoke/workspace/sample_code/hello.go <<'EOF'
|
||||
package main
|
||||
|
||||
import "fmt"
|
||||
|
||||
func Hello(name string) string {
|
||||
return fmt.Sprintf("Hello, %s!", name)
|
||||
}
|
||||
EOF
|
||||
|
||||
KB ingest
|
||||
KB search --mode hybrid "Hello" --code-lang go --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "main.Hello", lang = "go"
|
||||
```
|
||||
|
||||
## P10-2 Tier 2 리소스 파일 색인
|
||||
|
||||
P10-1C-Go 와 동일한 격리 KB 설정. `.yaml` / `Dockerfile` / `.toml` 등 Tier 2 리소스 파일을 워크스페이스에 두고 ingest 하면 각 확장자에 맞는 chunker 로 처리된다.
|
||||
|
||||
```bash
|
||||
# 1) Kubernetes manifest (YAML multi-doc)
|
||||
cat > /tmp/kebab-smoke/workspace/deploy.yaml <<'EOF'
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-app
|
||||
namespace: default
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: my-app
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: my-app
|
||||
spec:
|
||||
containers:
|
||||
- name: app
|
||||
image: my-app:latest
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: my-app-svc
|
||||
namespace: default
|
||||
spec:
|
||||
selector:
|
||||
app: my-app
|
||||
ports:
|
||||
- port: 80
|
||||
EOF
|
||||
|
||||
# 2) Dockerfile (전체 파일 단일 chunk)
|
||||
cat > /tmp/kebab-smoke/workspace/Dockerfile <<'EOF'
|
||||
FROM rust:1.85 AS builder
|
||||
WORKDIR /app
|
||||
COPY . .
|
||||
RUN cargo build --release
|
||||
|
||||
FROM debian:bookworm-slim
|
||||
COPY --from=builder /app/target/release/kebab /usr/local/bin/kebab
|
||||
ENTRYPOINT ["kebab"]
|
||||
EOF
|
||||
|
||||
# 3) Cargo.toml (manifest — 전체 파일 단일 chunk)
|
||||
cp Cargo.toml /tmp/kebab-smoke/workspace/Cargo.toml
|
||||
|
||||
# 4) ingest
|
||||
KB ingest
|
||||
|
||||
# 5) 언어별 검색 (citation.symbol 확인)
|
||||
KB search --mode hybrid "Deployment" --code-lang yaml --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "Deployment/default/my-app" (kind/namespace/name), lang = "yaml"
|
||||
|
||||
KB search --mode hybrid "rust:1.85" --code-lang dockerfile --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "<dockerfile>", lang = "dockerfile"
|
||||
|
||||
KB search --mode hybrid "kebab-cli" --code-lang toml --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "<manifest>", lang = "toml"
|
||||
|
||||
# 6) schema stats 에 Tier 2 언어 카운트 확인
|
||||
KB --json schema | jq '.stats.code_lang_breakdown'
|
||||
# 기대: {"yaml": N, "dockerfile": N, "toml": N, ...}
|
||||
```
|
||||
|
||||
**Tier 2 citation.symbol 컨벤션**:
|
||||
|
||||
- **YAML k8s 리소스**: `<kind>/<namespace>/<name>` (예: `Deployment/default/my-app`). `namespace` 없으면 `<kind>/<name>`. multi-doc YAML 은 `---` 구분자 기준으로 resource 별 chunk.
|
||||
- **Dockerfile**: `<dockerfile>` (고정 심볼, 전체 파일이 단일 chunk).
|
||||
- **TOML / JSON / XML / Groovy / go.mod**: `<manifest>` (고정 심볼, 전체 파일이 단일 chunk). 단, 파일이 `tier2_shared` 의 oversize threshold 초과 시 줄 단위 fallback chunk.
|
||||
|
||||
## P10-3 Tier 3 paragraph fallback
|
||||
|
||||
P10-2 와 동일한 격리 KB 설정. `.sh` 파일은 direct, 비-k8s YAML 은 fallback 으로 들어간다.
|
||||
|
||||
```bash
|
||||
# 1) shell script (direct Tier 3)
|
||||
cat > /tmp/kebab-smoke/workspace/deploy.sh <<'EOF'
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
echo "ingesting..."
|
||||
kebab ingest
|
||||
|
||||
echo "done"
|
||||
kebab schema --json | jq '.stats'
|
||||
EOF
|
||||
|
||||
# 2) 비-k8s YAML (Tier 2 가 0 chunk → Tier 3 fallback)
|
||||
cat > /tmp/kebab-smoke/workspace/docker-compose.yml <<'EOF'
|
||||
version: '3'
|
||||
services:
|
||||
api:
|
||||
image: nginx:latest
|
||||
ports:
|
||||
- 8080:80
|
||||
EOF
|
||||
|
||||
# 3) ingest
|
||||
KB ingest
|
||||
|
||||
# 4) 언어별 검색 (citation.symbol = None 확인)
|
||||
KB search --mode hybrid "ingest" --code-lang shell --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
|
||||
# 기대: symbol = null, lang = "shell", chunker_version = "code-text-paragraph-v1"
|
||||
|
||||
KB search --mode hybrid "nginx" --code-lang yaml --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
|
||||
# 기대: symbol = null, lang = "yaml", chunker_version = "code-text-paragraph-v1"
|
||||
|
||||
# 5) schema stats 에 shell 카운트 확인
|
||||
KB --json schema | jq '.stats.code_lang_breakdown'
|
||||
# 기대: {"shell": N, "yaml": M, ...}
|
||||
```
|
||||
|
||||
**Tier 3 citation.symbol 컨벤션**: 항상 `null`. 의미 단위 식별 안 함. `lang` 은 원본 lang 보존 (shell → `"shell"`, yaml → `"yaml"` 등).
|
||||
|
||||
## P10-1D C + C++ AST chunkers
|
||||
|
||||
P10-3 와 동일한 격리 KB 설정. `.c` 와 `.cpp` 파일이 각자의 AST chunker 로 처리된다.
|
||||
|
||||
```bash
|
||||
# 1) C 파일 — top-level function symbol
|
||||
cat > /tmp/kebab-smoke/workspace/parser.c <<'EOF'
|
||||
#include <stdio.h>
|
||||
|
||||
int parse_record(const char *line) {
|
||||
if (line == NULL) return -1;
|
||||
return 0;
|
||||
}
|
||||
EOF
|
||||
|
||||
# 2) C++ 파일 — namespace::Class::method symbol
|
||||
cat > /tmp/kebab-smoke/workspace/chunker.cpp <<'EOF'
|
||||
namespace kebab {
|
||||
namespace chunk {
|
||||
|
||||
class Foo {
|
||||
public:
|
||||
void bar() { /* impl */ }
|
||||
};
|
||||
|
||||
} // namespace chunk
|
||||
} // namespace kebab
|
||||
EOF
|
||||
|
||||
# 3) ingest
|
||||
KB ingest
|
||||
|
||||
# 4) 언어별 검색 (citation.symbol 확인)
|
||||
KB search --mode hybrid "parse_record" --code-lang c --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "parse_record" (function name only), lang = "c"
|
||||
|
||||
KB search --mode hybrid "bar" --code-lang cpp --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
# 기대: symbol = "kebab::chunk::Foo" 또는 "kebab::chunk::Foo::bar" (namespace::Class[::method]), lang = "cpp"
|
||||
|
||||
# 5) schema stats 에 C/C++ 카운트 확인
|
||||
KB --json schema | jq '.stats.code_lang_breakdown'
|
||||
# 기대: {"c": N, "cpp": M, ...}
|
||||
```
|
||||
|
||||
**Tier 1 (p10-1D) citation.symbol 컨벤션**: C 는 function name only (`parse_record` 같이 nesting 없음). C++ 는 `namespace::Class::method` (recursive namespace + class nesting). `.h` 파일이 C++ syntax (namespace / template / class) 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback (`code-text-paragraph-v1`) 으로 자동 picked up.
|
||||
|
||||
## 검증 체크리스트
|
||||
|
||||
- `kebab doctor` 가 `--config` path 를 honor 하고 그 안의 `storage.data_dir` 를 출력 (XDG default 가 아님).
|
||||
@@ -433,6 +683,11 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
|
||||
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
|
||||
- (P10-1A-2) `.rs` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `kebab search --mode hybrid "<함수명>" --code-lang rust --json` 가 `citation.kind = "code"`, `citation.lang = "rust"` (SearchHit top-level `code_lang` 도 동일), `citation.symbol` (함수/타입 이름), `citation.line_start` / `citation.line_end` 를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"rust": N` 이 나오면 chunk 가 색인됨.
|
||||
- (P10-1B) `.py` / `.ts` / `.tsx` / `.js` / `.mjs` / `.cjs` / `.jsx` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` 검색이 `citation.symbol` 에 module path prefix 를 포함한 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 해당 언어 카운트 등장 확인.
|
||||
- (P10-1C-Go) `.go` 파일을 워크스페이스에 두면 `kebab ingest` 가 `code-go-ast-v1` 로 처리. `--code-lang go` 검색이 `citation.symbol` 에 `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"go": N` 등장 확인.
|
||||
- (P10-1C-JK) `.java` 파일은 `code-java-ast-v1`, `.kt`/`.kts` 파일은 `code-kotlin-ast-v1` 로 처리. `--code-lang java` / `--code-lang kotlin` 검색이 `citation.symbol` 에 `com.foo.Foo.bar` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"java": N` / `"kotlin": N` 등장 확인.
|
||||
- (P10-2) `.yaml`/`.yml` 파일은 apiVersion+kind 파싱으로 k8s resource 별 chunk 생성 (`k8s-manifest-resource-v1`). `Dockerfile`/`Dockerfile.*` 는 전체 파일 단일 chunk (`dockerfile-file-v1`). `.toml`/`.json`/`.xml`/`.groovy`/`go.mod` 는 전체 파일 단일 chunk (`manifest-file-v1`). `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` 검색이 `citation.symbol` 에 각각 `Deployment/default/my-app` / `<dockerfile>` / `<manifest>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"yaml": N` / `"dockerfile": N` / `"toml": N` 등장 확인.
|
||||
- (P10-3) `.sh`/`.bash`/`.zsh` 파일은 direct Tier 3 (`code-text-paragraph-v1`). 비-k8s YAML (apiVersion+kind 없는 yaml) 은 k8s chunker 가 0 chunk → Tier 3 fallback 으로 picked up. `--code-lang shell` / `--code-lang yaml` 검색이 `citation.symbol = null`, `chunker_version = "code-text-paragraph-v1"` 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"shell": N` 등장 확인.
|
||||
- (P10-1D) `.c` / `.h` 파일은 `code-c-ast-v1` (function name only symbol). `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` 는 `code-cpp-ast-v1` (`namespace::Class::method` symbol). `--code-lang c` / `--code-lang cpp` 검색 동작 + `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"c": N` / `"cpp": M` 등장 확인. `.h` 파일이 C++ 내용 (namespace 등) 갖고 있으면 자동으로 Tier 3 (`code-text-paragraph-v1`) fallback 으로 picked up.
|
||||
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
|
||||
|
||||
### Embedding upgrade (fb-39b)
|
||||
|
||||
540
docs/superpowers/plans/2026-05-20-p10-1c-go-ast-chunker.md
Normal file
540
docs/superpowers/plans/2026-05-20-p10-1c-go-ast-chunker.md
Normal file
@@ -0,0 +1,540 @@
|
||||
# p10-1C-Go Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Activate Go code ingest end-to-end on top of 1A-2 (Rust) + 1B (Python/TS/JS) infrastructure. Add `tree-sitter-go` grammar + `GoAstExtractor` + `code-go-ast-v1` chunker + media routing + app dispatch arm.
|
||||
|
||||
**Architecture:** Mirror 1A-2 / 1B exactly. `kebab-parse-code/src/go.rs` walks tree-sitter-go parse tree; emits one `Block::Code` per top-level AST semantic unit with `SourceSpan::Code { symbol, lang: Some("go") }`. Symbol prefix = **source-extracted package name** (from `package_clause` AST node — design §3.4 Go row). `kebab-chunk/src/code_go_ast_v1.rs` is a near-duplicate of `code-rust-ast-v1`. App dispatch's `ingest_one_code_asset` (PR #142 generalized 4-arm match) gets a 5th arm.
|
||||
|
||||
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already in workspace), `tree-sitter-go` (NEW), 1A-2/1B infrastructure unchanged.
|
||||
|
||||
**Memory note:** Host has been OOM-killed previously. Use `cargo test -p <crate>` and `cargo check -p <crate>` only. ONE full-suite invocation reserved for Task G gate.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
Branch `feat/p10-1c-go` already exists.
|
||||
|
||||
- [ ] **Disk hygiene**: `cargo clean` if previous artifacts are bloated. Skip if disk is comfortable (`df -h /`).
|
||||
|
||||
Reference files:
|
||||
- 1A-2 Rust extractor: `crates/kebab-parse-code/src/rust.rs` — closest single-language scaffold template.
|
||||
- 1B Python extractor (closest analog for "class-nesting recursion" — Go doesn't have classes but has package as the single prefix): `crates/kebab-parse-code/src/python.rs`.
|
||||
- 1A-2 chunker scaffold: `crates/kebab-chunk/src/code_rust_ast_v1.rs`.
|
||||
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L1645, 4-arm match).
|
||||
- 1A-2 source-fs routing: `crates/kebab-source-fs/src/media.rs` `"rs" =>` arm.
|
||||
|
||||
---
|
||||
|
||||
## Task A: Workspace dep `tree-sitter-go`
|
||||
|
||||
**Files:**
|
||||
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-javascript` line)
|
||||
- Modify: `crates/kebab-parse-code/Cargo.toml`
|
||||
|
||||
- [ ] **Step 1**: `cargo add tree-sitter-go -p kebab-parse-code` to resolve version.
|
||||
|
||||
- [ ] **Step 2**: Lift the resolved version into `[workspace.dependencies]` after `tree-sitter-javascript`:
|
||||
|
||||
```toml
|
||||
# Go grammar for code ingest (kebab-parse-code, p10-1C).
|
||||
tree-sitter-go = "<resolved>"
|
||||
```
|
||||
|
||||
Switch the crate's entry to `{ workspace = true }` matching existing tree-sitter-* style.
|
||||
|
||||
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
|
||||
|
||||
- [ ] **Step 4**: Commit:
|
||||
|
||||
```bash
|
||||
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
|
||||
git commit -m "build(p10-1c-go): add tree-sitter-go workspace dep
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task B: source-fs media routing `.go` → `MediaType::Code("go")`
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing JS arm at ~L44)
|
||||
- Test: same file's test module
|
||||
|
||||
- [ ] **Step 1 (failing test)** — add to existing tests near `py_ts_js_files_map_to_media_code`:
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn go_files_map_to_media_code_go() {
|
||||
assert_eq!(media_type_for(Path::new("a/b.go")), MediaType::Code("go".into()));
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: Run → FAIL.
|
||||
|
||||
- [ ] **Step 3**: Add the arm before the catch-all `_ => MediaType::Other(ext)`:
|
||||
|
||||
```rust
|
||||
// p10-1C-Go: Go ingest activated.
|
||||
"go" => MediaType::Code("go".into()),
|
||||
```
|
||||
|
||||
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
|
||||
|
||||
- [ ] **Step 5**: clippy clean, commit:
|
||||
|
||||
```bash
|
||||
git add crates/kebab-source-fs/
|
||||
git commit -m "feat(p10-1c-go): route .go to MediaType::Code(go)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task C: App dispatch allowlist + bail arm for "go"
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-app/src/lib.rs` (dispatch match guard + 4 internal match arms in `ingest_one_code_asset`)
|
||||
|
||||
- [ ] **Step 1**: Find the `MediaType::Code(lang) if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript")` arm (~L953). Add `"go"` to the allowlist:
|
||||
|
||||
```rust
|
||||
MediaType::Code(lang)
|
||||
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go") =>
|
||||
{
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: In `ingest_one_code_asset`'s 4 `match code_lang` blocks (parser_version, chunker_version, extract, chunk), add a "go" arm that `bail!()`s for now (extractor + chunker land in Task D/E). Mirror the Python/TS/JS bail-then-activate pattern:
|
||||
|
||||
```rust
|
||||
let parser_version = match code_lang {
|
||||
// ... existing arms ...
|
||||
"go" => anyhow::bail!("go ingest not yet wired (p10-1c-go Task F)"),
|
||||
other => anyhow::bail!("unsupported code_lang: {other}"),
|
||||
};
|
||||
// similar for chunker_version / extract / chunk matches
|
||||
```
|
||||
|
||||
- [ ] **Step 3**: `cargo test -p kebab-app --lib` → existing 52 lib tests stay green. `cargo test -p kebab-app --test code_ingest_smoke` → 6 stay green (Rust path unaffected).
|
||||
|
||||
- [ ] **Step 4**: clippy clean, commit:
|
||||
|
||||
```bash
|
||||
git add crates/kebab-app/
|
||||
git commit -m "refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task D: `GoAstExtractor` (`kebab-parse-code/src/go.rs`)
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-parse-code/src/go.rs`
|
||||
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod go;` + re-exports `GO_PARSER_VERSION`, `GoAstExtractor`)
|
||||
- Create: `crates/kebab-parse-code/tests/fixtures/sample.go`
|
||||
|
||||
Scaffold mirrors `crates/kebab-parse-code/src/rust.rs` line-for-line for the `CanonicalDocument` skeleton (Extractor trait impl, `id_for_doc`, ProvenanceEvent, final `CanonicalDocument` literal). The novel parts:
|
||||
|
||||
### Constants
|
||||
|
||||
```rust
|
||||
pub const PARSER_VERSION: &str = "code-go-v1";
|
||||
|
||||
pub struct GoAstExtractor;
|
||||
// new() + Default
|
||||
// supports: matches!(m, MediaType::Code(l) if l == "go")
|
||||
// agent = "kb-parse-code"
|
||||
// metadata.code_lang = Some("go")
|
||||
// SourceType::Note (no SourceType::Code variant)
|
||||
// repo/git_branch/git_commit via detect_repo
|
||||
```
|
||||
|
||||
### Package extraction
|
||||
|
||||
Unlike 1B's path-based `module_path_for_python` / `_for_tsjs`, the Go package prefix comes from the **source code's `package` declaration** (design §3.4). tree-sitter-go's grammar:
|
||||
|
||||
- Root: `source_file`
|
||||
- First named child is typically `package_clause` → contains `package_identifier` child whose text is the package name.
|
||||
|
||||
Helper (local to `go.rs`):
|
||||
|
||||
```rust
|
||||
/// Returns the package name from a tree-sitter-go `source_file`, or
|
||||
/// `None` if the file has no `package_clause` (invalid Go in practice,
|
||||
/// but be defensive).
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_clause" {
|
||||
// `package_clause` has a `package_identifier` named child.
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "package_identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
```
|
||||
|
||||
### Semantic-unit rules
|
||||
|
||||
| node kind | unit | symbol |
|
||||
|-----------|------|--------|
|
||||
| `function_declaration` (name field) | 1 | `<pkg>.<fn_name>` |
|
||||
| `method_declaration` | 1 | `<pkg>.(<TypeText>).<MethodName>` where `<TypeText>` includes a leading `*` if the receiver is `pointer_type`. Examples: `chunk.(*MdHeadingV1Chunker).ChunkDoc`, `chunk.(Foo).Bar`. |
|
||||
| `type_declaration` (struct / interface / type alias) | 1 per inner `type_spec` | `<pkg>.<TypeName>` |
|
||||
| `const_declaration`, `var_declaration`, `import_declaration` (single or block) | glue | `<pkg>.<top-level>` (or `<pkg>.<package>` if file has ZERO real units AND glue is import-only — same `<module>` post-pass pattern as 1B Python, renamed to `<package>` to avoid colliding with Go's `package` keyword? — actually use `<module>` per design §3.4 — see "module / namespace 만 있고 symbol 없는 경우" line) |
|
||||
|
||||
`unit_start` walks `comment` siblings (same as 1B). Go doesn't have separate attribute / decorator nodes.
|
||||
|
||||
Method receiver pointer detection:
|
||||
|
||||
```rust
|
||||
// In the method_declaration arm:
|
||||
let receiver = child.child_by_field_name("receiver"); // parameter_list
|
||||
let receiver_type_text = receiver.and_then(|r| {
|
||||
let mut cw = r.walk();
|
||||
for p in r.named_children(&mut cw) {
|
||||
if p.kind() == "parameter_declaration" {
|
||||
// type field is either type_identifier (value) or pointer_type (ptr)
|
||||
if let Some(ty) = p.child_by_field_name("type") {
|
||||
let s = &src[ty.start_byte()..ty.end_byte()];
|
||||
return Some(s.to_string()); // includes leading "*" if pointer_type
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
});
|
||||
// Format: "(*Foo)" or "(Foo)" — wrap in parens, preserve leading "*" if any.
|
||||
let owner = receiver_type_text
|
||||
.map(|t| format!("({t})"))
|
||||
.unwrap_or_else(|| "()".to_string());
|
||||
let method_name = name_text(&child, src);
|
||||
// symbol = format!("{pkg}.{owner}.{method_name}")
|
||||
```
|
||||
|
||||
Read tree-sitter-go's grammar.json or node-types.json (in the registry source) if any field name above differs in the resolved crate version.
|
||||
|
||||
### Fixture `tests/fixtures/sample.go`:
|
||||
|
||||
```go
|
||||
// sample.go
|
||||
package chunk
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
)
|
||||
|
||||
const Version = "v1"
|
||||
|
||||
type MdHeadingV1Chunker struct {
|
||||
Name string
|
||||
}
|
||||
|
||||
// ChunkDoc returns a stub list of strings.
|
||||
func (m *MdHeadingV1Chunker) ChunkDoc(input string) []string {
|
||||
return []string{m.Name}
|
||||
}
|
||||
|
||||
func (m MdHeadingV1Chunker) Name2() string {
|
||||
return m.Name
|
||||
}
|
||||
|
||||
type Stringer interface {
|
||||
String() string
|
||||
}
|
||||
|
||||
func Free(x int) int {
|
||||
return x + 1
|
||||
}
|
||||
|
||||
func init() {
|
||||
fmt.Println(strings.ToUpper("init"))
|
||||
}
|
||||
```
|
||||
|
||||
### Test module
|
||||
|
||||
Mirror Python's test shape (use `crate::rust::tests_support::fixed_code_asset` from 1B):
|
||||
|
||||
```rust
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.go"),
|
||||
).unwrap();
|
||||
let asset = crate::rust::tests_support::fixed_code_asset(
|
||||
"crates/x/src/sample.go", "go",
|
||||
);
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
|
||||
GoAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_go() {
|
||||
let e = GoAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("go".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn go_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("go"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
}).collect();
|
||||
syms.sort();
|
||||
assert!(syms.iter().any(|s| s == "chunk.Free"), "got {syms:?}");
|
||||
assert!(syms.iter().any(|s| s == "chunk.init"));
|
||||
assert!(syms.iter().any(|s| s == "chunk.MdHeadingV1Chunker"));
|
||||
assert!(syms.iter().any(|s| s == "chunk.(*MdHeadingV1Chunker).ChunkDoc"));
|
||||
assert!(syms.iter().any(|s| s == "chunk.(MdHeadingV1Chunker).Name2"));
|
||||
assert!(syms.iter().any(|s| s == "chunk.Stringer"));
|
||||
assert!(syms.iter().any(|s| s == "chunk.<top-level>")); // import + const grouped
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step list
|
||||
|
||||
- [ ] Step 1: create fixture + test module.
|
||||
- [ ] Step 2: run → FAIL (`GoAstExtractor` undefined).
|
||||
- [ ] Step 3: implement `go.rs`. Scaffold mirrors `python.rs` (Extractor impl + extract scaffold + `build_blocks` returning blocks). `build_blocks` does: extract_package → walk root's named children → branch per node kind per the table above → emit `Block::Code` with `SourceSpan::Code { symbol, lang: Some("go") }`. Use the same `flush_glue` / glue grouping / `<top-level>` vs `<module>` post-pass as Python (rename to `<package>` if user prefers, but spec §3.4 says `<module>` so keep that name for cross-language consistency).
|
||||
- [ ] Step 4: wire into `lib.rs`:
|
||||
|
||||
```rust
|
||||
pub mod go;
|
||||
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
|
||||
```
|
||||
|
||||
- [ ] Step 5: `cargo test -p kebab-parse-code` → all pass (Rust/Python/TS/JS + new Go). `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean.
|
||||
- [ ] Step 6: commit:
|
||||
|
||||
```bash
|
||||
git add crates/kebab-parse-code/
|
||||
git commit -m "feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task E: `code-go-ast-v1` chunker
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-chunk/src/code_go_ast_v1.rs`
|
||||
- Modify: `crates/kebab-chunk/src/lib.rs`
|
||||
|
||||
Identical pattern to PR #142 Task I (TS) / Task L (JS) — near-duplicate of `code_rust_ast_v1.rs` with substitutions:
|
||||
- `const VERSION_LABEL: &str = "code-go-ast-v1";`
|
||||
- struct name `CodeGoAstV1Chunker`
|
||||
- error message says `"CodeGoAstV1Chunker only handles..."`
|
||||
- module doc-comment prose `Rust` → `Go`, `code-rust-ast-v1` → `code-go-ast-v1`
|
||||
|
||||
`split_oversize` / `make_chunk` / `AST_CHUNK_MAX_LINES = 200` / `BYTES_PER_TOKEN = 3` / `POLICY_HASH_HEX_LEN = 16` IDENTICAL (language-agnostic).
|
||||
|
||||
Test module: copy from `code_ts_ast_v1.rs` and substitute names. KEEP cross-chunker `policy_hash_matches_md_heading_v1`.
|
||||
|
||||
Wire into `crates/kebab-chunk/src/lib.rs`:
|
||||
|
||||
```rust
|
||||
mod code_go_ast_v1;
|
||||
pub use code_go_ast_v1::CodeGoAstV1Chunker;
|
||||
```
|
||||
|
||||
(Alphabetical placement.)
|
||||
|
||||
Verify + commit:
|
||||
- `cargo test -p kebab-chunk code_go_ast` PASS (~6 tests)
|
||||
- `cargo test -p kebab-chunk` full per-crate green
|
||||
- `cargo clippy -p kebab-chunk --all-targets -- -D warnings` clean
|
||||
|
||||
```bash
|
||||
git add crates/kebab-chunk/
|
||||
git commit -m "feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task F: Activate Go in app dispatch
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-app/src/lib.rs` (replace 4 "go" bail! arms with real calls)
|
||||
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs` (add Go integration test)
|
||||
|
||||
Replace the 4 `"go" => anyhow::bail!(...)` arms in `ingest_one_code_asset` (added in Task C) with real:
|
||||
|
||||
```rust
|
||||
"go" => ParserVersion(kebab_parse_code::GO_PARSER_VERSION.to_string()),
|
||||
// ...
|
||||
"go" => CodeGoAstV1Chunker.chunker_version(),
|
||||
// ...
|
||||
"go" => kebab_parse_code::GoAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::GoAstExtractor::extract (code:go)")?,
|
||||
// ...
|
||||
"go" => CodeGoAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)")?,
|
||||
```
|
||||
|
||||
Add imports at top of lib.rs:
|
||||
- `kebab_chunk::CodeGoAstV1Chunker`
|
||||
- `kebab_parse_code::GoAstExtractor`
|
||||
|
||||
Integration test (mirror PR #142's `python_file_ingests_and_searches_as_code_citation`):
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn go_file_ingests_and_searches_as_code_citation() {
|
||||
// ... TempDir + Config harness same as Python/TS test ...
|
||||
let pkg_dir = env.workspace_root.join("chunk");
|
||||
std::fs::create_dir_all(&pkg_dir).unwrap();
|
||||
std::fs::write(
|
||||
pkg_dir.join("ast.go"),
|
||||
"package chunk\n\nfunc ParseDoc(input string) string {\n return input\n}\n",
|
||||
).unwrap();
|
||||
|
||||
let report = kebab_app::ingest_with_config(/* ... */).unwrap();
|
||||
assert!(report.new >= 1);
|
||||
let go_item = report.items.as_ref().unwrap().iter()
|
||||
.find(|i| i.doc_path.0.ends_with("ast.go")).expect("ast.go item");
|
||||
assert_eq!(go_item.parser_version.as_ref().unwrap().0, "code-go-v1");
|
||||
assert_eq!(go_item.chunker_version.as_ref().unwrap().0, "code-go-ast-v1");
|
||||
|
||||
let hits = kebab_app::search_with_config(/* search "ParseDoc" */).unwrap();
|
||||
let h = hits.iter().find(|h| matches!(h.citation, kebab_core::Citation::Code { .. }))
|
||||
.expect("Citation::Code hit");
|
||||
match &h.citation {
|
||||
kebab_core::Citation::Code { lang, symbol, line_start, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("go"));
|
||||
assert_eq!(symbol.as_deref(), Some("chunk.ParseDoc"));
|
||||
assert!(*line_start >= 1);
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
assert_eq!(h.code_lang.as_deref(), Some("go"));
|
||||
}
|
||||
```
|
||||
|
||||
Verify:
|
||||
- `cargo test -p kebab-app --test code_ingest_smoke` → 7/7 (6 existing + 1 new go)
|
||||
- `cargo test -p kebab-app --lib` → 52/52 (no regression)
|
||||
- clippy clean
|
||||
|
||||
```bash
|
||||
git add crates/kebab-app/
|
||||
git commit -m "feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task G: Snapshot + full-suite gate + manual SMOKE
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-chunk/tests/code_go_ast_snapshot.rs` + fixture + baseline (mirror `code_python_ast_snapshot.rs` from PR #142)
|
||||
|
||||
- [ ] **Step 1**: Add snapshot integration test. In-memory `CanonicalDocument` (no kebab-parse-code dep — boundary §6.3). Generate baseline: `UPDATE_SNAPSHOTS=1 cargo test -p kebab-chunk code_go_ast_snapshot` → re-run without env → PASS.
|
||||
|
||||
- [ ] **Step 2**: Full-suite gate (the ONE invocation allowed this PR):
|
||||
|
||||
```bash
|
||||
cargo clippy --workspace --all-targets -- -D warnings
|
||||
cargo test --workspace --no-fail-fast -j 1
|
||||
```
|
||||
|
||||
Both must be CLEAN/GREEN.
|
||||
|
||||
- [ ] **Step 3**: Manual SMOKE (optional but recommended — mirror PR #142 SMOKE):
|
||||
|
||||
```bash
|
||||
cargo build --release # OR debug if RAM-tight
|
||||
rm -rf /tmp/kebab-go-smoke && mkdir -p /tmp/kebab-go-smoke/ws/chunk
|
||||
echo 'package chunk
|
||||
|
||||
func ParseDoc(input string) string { return input }
|
||||
' > /tmp/kebab-go-smoke/ws/chunk/ast.go
|
||||
# adapt isolated config from docs/SMOKE.md
|
||||
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml ingest --json | jq '.items[].parser_version' | sort -u
|
||||
./target/release/kebab --config /tmp/kebab-go-smoke/config.toml search "ParseDoc" --code-lang go --json | jq '.hits[0]'
|
||||
```
|
||||
|
||||
Expected: `code-go-v1` in parser_versions; Citation::Code with symbol `chunk.ParseDoc`.
|
||||
|
||||
- [ ] **Step 4**: Commit snapshot only (full-suite + SMOKE are gates, not commit content):
|
||||
|
||||
```bash
|
||||
git add crates/kebab-chunk/tests/
|
||||
git commit -m "test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task H: Docs + version bump
|
||||
|
||||
- README: 지원 형식 row — add Go (`.go`, `code-go-ast-v1`).
|
||||
- HANDOFF: P10 phase row note 1C-Go merged (Go active). Java/Kotlin remain pending.
|
||||
- ARCHITECTURE: directory tree note for kebab-parse-code includes `go.rs` (Java/Kotlin coming in next PR). Decisions table — no new row (1C-Go follows the 1A-2/1B convention).
|
||||
- SMOKE: extend the P10 section with a 1-line note for Go (or compact Go example).
|
||||
- tasks/INDEX + tasks/p10/INDEX: flip the row for 1C-Go to 🟡 (PR open) → ✅ on merge. The 1C row in p10/INDEX may need a split — `p10-1C-Go ⏳ → 🟡` and `p10-1C-JavaKotlin ⏳ unchanged` (since user split into 2 PRs).
|
||||
- frozen design §10.1: add a one-liner — "p10-1C-Go 활성화 (Go)" (Java/Kotlin will get its own line in the next PR).
|
||||
- `Cargo.toml`: workspace version `0.11.1 → 0.12.0` (minor — dogfooding surface 확장, 새 chunker + extractor 활성화).
|
||||
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + chore: bump version 0.11.1 → 0.12.0
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Finalize: PR + review loop + release
|
||||
|
||||
Per workflow memory (gitea-pr + review loop, no single-shot):
|
||||
|
||||
- [ ] `gitea-pr` → PR title `feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화`
|
||||
- [ ] Review loop until APPROVE → merge → main pull → branch cleanup → `cargo clean` → `gitea-release v0.12.0`.
|
||||
|
||||
---
|
||||
|
||||
## Self-Review (filled by plan author)
|
||||
|
||||
- **Spec coverage**: design §1C Go (extractor + chunker + activation) → Tasks D/E/F; §3.3 (`code-go-ast-v1`) → Task E; §3.4 symbol path → Task D (extract_package + method receiver pointer detection); §6.1 (`kebab-parse-code/src/go.rs`) → Task D; §6.2 (`kebab-chunk/src/code_go_ast_v1.rs`) → Task E; §6.3 dep graph (`tree-sitter-go` parser-side) → Task A; §9.1 Tier-1 + oversize fallback → Task E (1A-2 split_oversize reused identically).
|
||||
- **No placeholders**: novel logic (`extract_package`, method receiver pointer detection, fixture, test assertions, dispatch arm additions) given concretely. Mechanical mirrors (chunker, integration test, snapshot test) pinned to exact existing files with substitutions.
|
||||
- **Type consistency**: `GoAstExtractor` / `GO_PARSER_VERSION = "code-go-v1"` / `CodeGoAstV1Chunker` / `VERSION_LABEL = "code-go-ast-v1"` used consistently across Tasks A-H. `MediaType::Code("go")` in routing + dispatch. `Citation::Code` with `lang: Some("go")` in integration test.
|
||||
494
docs/superpowers/plans/2026-05-20-p10-1c-jk-ast-chunker.md
Normal file
494
docs/superpowers/plans/2026-05-20-p10-1c-jk-ast-chunker.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# p10-1C-JavaKotlin Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Activate Java + Kotlin code ingest end-to-end. Mirror 1C-Go (PR #151 / v0.12.0) for Java (single-language scaffold) and Kotlin (additional top-level fn variant). Both use source-side `package` extraction (design §3.4 JVM convention).
|
||||
|
||||
**Architecture:** Same shape as 1B (multi-language single PR). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing + app dispatch arms. 1C-Go pattern is the closest template for source-side `package` extraction.
|
||||
|
||||
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-java` + `tree-sitter-kotlin` (NEW). 1A-2/1B/1C-Go infrastructure unchanged.
|
||||
|
||||
**Memory note:** Host has been OOM'd previously. Per-crate cargo only. ONE full-suite + clippy invocation in Task J.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
Branch `feat/p10-1c-jk` already exists.
|
||||
|
||||
- [ ] **Disk hygiene**: `cargo clean` if heavy (last cleanup recovered 34 GB).
|
||||
|
||||
Reference files:
|
||||
- 1C-Go extractor: `crates/kebab-parse-code/src/go.rs` — closest template for source-side package extraction.
|
||||
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for Java/Kotlin).
|
||||
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution.
|
||||
- 1B dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` 4-arm match (~L1645). 1C-Go already added `"go"`; this PR adds `"java"` + `"kotlin"`.
|
||||
|
||||
---
|
||||
|
||||
## Task A: Workspace deps (tree-sitter-java + tree-sitter-kotlin)
|
||||
|
||||
**Files:**
|
||||
- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, after `tree-sitter-go` line)
|
||||
- Modify: `crates/kebab-parse-code/Cargo.toml`
|
||||
|
||||
- [ ] **Step 1**: `cargo add tree-sitter-java tree-sitter-kotlin -p kebab-parse-code`. If `tree-sitter-kotlin` resolves to a fork name, verify the actively-maintained crate (e.g. check crates.io page / GitHub stars / last update). Likely `tree-sitter-kotlin` (without fork suffix) is the default.
|
||||
|
||||
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` after `tree-sitter-go`:
|
||||
|
||||
```toml
|
||||
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
|
||||
tree-sitter-java = "<resolved>"
|
||||
tree-sitter-kotlin = "<resolved>"
|
||||
```
|
||||
|
||||
Switch crate's entries to `{ workspace = true }`.
|
||||
|
||||
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
|
||||
|
||||
- [ ] **Step 4**: Commit:
|
||||
|
||||
```bash
|
||||
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
|
||||
git commit -m "build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin workspace deps
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
If the kotlin crate has a different actual name (e.g. `tree-sitter-kotlin-ng` or fork suffix), document the choice in the commit body briefly.
|
||||
|
||||
---
|
||||
|
||||
## Task B: source-fs routing `.java` / `.kt` / `.kts`
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-source-fs/src/media.rs` (add arm after the existing `.go` arm)
|
||||
- Test: same file's test module
|
||||
|
||||
- [ ] **Step 1 (failing test)** — add near `go_files_map_to_media_code_go`:
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn java_kotlin_files_map_to_media_code() {
|
||||
assert_eq!(media_type_for(Path::new("a/b.java")), MediaType::Code("java".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.kt")), MediaType::Code("kotlin".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.kts")), MediaType::Code("kotlin".into()));
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: Run → FAIL.
|
||||
|
||||
- [ ] **Step 3**: Add the arms before the `_ => MediaType::Other(ext)` fallback (after `"go" => ...`):
|
||||
|
||||
```rust
|
||||
// p10-1C-JK: JVM family (Java + Kotlin) ingest activated.
|
||||
"java" => MediaType::Code("java".into()),
|
||||
"kt" | "kts" => MediaType::Code("kotlin".into()),
|
||||
```
|
||||
|
||||
- [ ] **Step 4**: Run → PASS. `cargo test -p kebab-source-fs` → no regression.
|
||||
|
||||
- [ ] **Step 5**: clippy clean, commit.
|
||||
|
||||
```bash
|
||||
cargo clippy -p kebab-source-fs --all-targets -- -D warnings
|
||||
git add crates/kebab-source-fs/
|
||||
git commit -m "feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task C: App dispatch + bail arms for "java" + "kotlin"
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-app/src/lib.rs`
|
||||
|
||||
- [ ] **Step 1**: Find the dispatch arm guard (currently `matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript" | "go")`). Add `"java"` + `"kotlin"`:
|
||||
|
||||
```rust
|
||||
MediaType::Code(lang)
|
||||
if matches!(lang.as_str(),
|
||||
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin") =>
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: In `ingest_one_code_asset` the 4 `match code_lang` blocks add `"java"` and `"kotlin"` arms that `bail!()` for now:
|
||||
|
||||
```rust
|
||||
"java" => anyhow::bail!("java ingest not yet wired (p10-1c-jk Task F)"),
|
||||
"kotlin" => anyhow::bail!("kotlin ingest not yet wired (p10-1c-jk Task I)"),
|
||||
```
|
||||
|
||||
(in each of the 4 blocks before the `other =>` catch-all).
|
||||
|
||||
- [ ] **Step 3**: Verify per-crate:
|
||||
- `cargo test -p kebab-app --lib` → 52 stay green
|
||||
- `cargo test -p kebab-app --test code_ingest_smoke` → 7 stay green
|
||||
- `cargo clippy -p kebab-app --all-targets -- -D warnings` clean
|
||||
|
||||
- [ ] **Step 4**: Commit:
|
||||
|
||||
```bash
|
||||
git add crates/kebab-app/
|
||||
git commit -m "refactor(p10-1c-jk): add java + kotlin to ingest dispatch allowlist (bail until Tasks F/I)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task D: `JavaAstExtractor`
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-parse-code/src/java.rs`
|
||||
- Modify: `crates/kebab-parse-code/src/lib.rs` (`pub mod java;` + re-exports `JAVA_PARSER_VERSION`, `JavaAstExtractor`)
|
||||
- Create: `crates/kebab-parse-code/tests/fixtures/sample.java`
|
||||
|
||||
Scaffold mirrors `crates/kebab-parse-code/src/go.rs` (1C-Go) — single-language with source-side `package` extraction. Differences:
|
||||
|
||||
### Constants
|
||||
|
||||
```rust
|
||||
pub const PARSER_VERSION: &str = "code-java-v1";
|
||||
pub struct JavaAstExtractor;
|
||||
// supports: matches!(m, MediaType::Code(l) if l == "java")
|
||||
// code_lang = Some("java"), SourceType::Note, repo via detect_repo
|
||||
```
|
||||
|
||||
### Package extraction (Java)
|
||||
|
||||
tree-sitter-java grammar:
|
||||
- Root: `program`
|
||||
- `package_declaration` (top-level child) → contains `scoped_identifier` (dotted) OR `identifier` (single-segment)
|
||||
|
||||
```rust
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_declaration" {
|
||||
// package_declaration has scoped_identifier OR identifier as first named child
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "scoped_identifier" || sub.kind() == "identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
```
|
||||
|
||||
(Verify field names against tree-sitter-java's node-types.json if any field differs.)
|
||||
|
||||
### AST mapping
|
||||
|
||||
| node kind | unit | symbol |
|
||||
|-----------|------|--------|
|
||||
| `class_declaration` (name field) | 1 + recurse body | `<pkg>.<ClassName>` |
|
||||
| `interface_declaration` (name) | 1 + recurse body | `<pkg>.<InterfaceName>` |
|
||||
| `enum_declaration` (name) | 1 | `<pkg>.<EnumName>` |
|
||||
| `record_declaration` (name, Java 14+) | 1 | `<pkg>.<RecordName>` |
|
||||
| `annotation_type_declaration` (name) | 1 | `<pkg>.<AnnotationName>` |
|
||||
| Inside class body: `method_declaration` (name) | 1 | `<pkg>.<Class>.<method>` |
|
||||
| Inside class body: `constructor_declaration` (name = class name) | 1 | `<pkg>.<Class>.<ClassName>` (matches Java convention) |
|
||||
| Nested classes recurse with class name pushed onto mod_path | as above | `<pkg>.<Outer>.<Inner>` etc. |
|
||||
| `import_declaration`, `package_declaration` | glue | `<pkg>.<top-level>` |
|
||||
| `field_declaration` at top of class | NOT a unit in 1C-JK (would explode unit count for value-only fields) | n/a |
|
||||
|
||||
`unit_start` walks `comment` siblings; Java has `@interface` annotations but those are part of `annotation_type_declaration` itself, not separate sibling nodes.
|
||||
|
||||
`mod_path` = class nesting (like 1B Python). Empty at file top level.
|
||||
|
||||
### Fixture `tests/fixtures/sample.java`:
|
||||
|
||||
```java
|
||||
// sample.java
|
||||
package com.kebab.chunk;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
/**
|
||||
* Heading-aware Markdown chunker.
|
||||
*/
|
||||
public class MdHeadingV1Chunker {
|
||||
private final String name;
|
||||
|
||||
public MdHeadingV1Chunker(String name) {
|
||||
this.name = name;
|
||||
}
|
||||
|
||||
public List<String> chunkDoc(String input) {
|
||||
return List.of(name, input);
|
||||
}
|
||||
|
||||
public String getName() {
|
||||
return name;
|
||||
}
|
||||
|
||||
public static class Builder {
|
||||
private String name;
|
||||
public Builder withName(String n) { this.name = n; return this; }
|
||||
public MdHeadingV1Chunker build() { return new MdHeadingV1Chunker(name); }
|
||||
}
|
||||
}
|
||||
|
||||
interface Stringer {
|
||||
String asString();
|
||||
}
|
||||
|
||||
enum Mode { DEFAULT, FAST }
|
||||
```
|
||||
|
||||
### Test module (inline `#[cfg(test)] mod tests`)
|
||||
|
||||
Mirror 1C-Go shape:
|
||||
|
||||
```rust
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.java"),
|
||||
).unwrap();
|
||||
let asset = crate::rust::tests_support::fixed_code_asset(
|
||||
"crates/x/src/sample.java", "java",
|
||||
);
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
|
||||
JavaAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_java() { /* ... */ }
|
||||
|
||||
#[test]
|
||||
fn java_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc.blocks.iter().filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("java"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
}).collect();
|
||||
syms.sort();
|
||||
// workspace path → package extracted from source = com.kebab.chunk
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker"), "got {syms:?}");
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.MdHeadingV1Chunker")); // constructor
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.getName"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.withName"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.MdHeadingV1Chunker.Builder.build"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Stringer"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.Mode"));
|
||||
assert!(syms.iter().any(|s| s == "com.kebab.chunk.<top-level>"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Wire into lib.rs
|
||||
|
||||
```rust
|
||||
pub mod java;
|
||||
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
|
||||
```
|
||||
|
||||
### Verify + commit
|
||||
|
||||
- `cargo test -p kebab-parse-code` → all pass
|
||||
- `cargo clippy -p kebab-parse-code --all-targets -- -D warnings` clean
|
||||
- commit `feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)`
|
||||
|
||||
---
|
||||
|
||||
## Task E: `code-java-ast-v1` chunker
|
||||
|
||||
Identical pattern to 1C-Go Task E. Duplicate `code_rust_ast_v1.rs` with substitutions:
|
||||
- `VERSION_LABEL = "code-java-ast-v1"`, struct `CodeJavaAstV1Chunker`
|
||||
- error message + module doc-comment prose
|
||||
- Test module: parser_version `"code-java-v1"`, code_lang `"java"`
|
||||
- Keep cross-chunker `policy_hash_matches_md_heading_v1`
|
||||
|
||||
Wire into `crates/kebab-chunk/src/lib.rs` (alphabetical). Verify + commit.
|
||||
|
||||
---
|
||||
|
||||
## Task F: Activate Java in app dispatch
|
||||
|
||||
Replace the `"java"` `bail!()` arms in `ingest_one_code_asset` with real calls (`JavaAstExtractor` + `CodeJavaAstV1Chunker`). Add integration test `java_file_ingests_and_searches_as_code_citation` (mirror 1C-Go test, fixture `pkg_dir/Foo.java` with `package com.foo;` and `public class Foo { public String bar() { ... } }`, assert symbol `com.foo.Foo.bar`).
|
||||
|
||||
Verify + commit.
|
||||
|
||||
---
|
||||
|
||||
## Task G: `KotlinAstExtractor`
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-parse-code/src/kotlin.rs`
|
||||
- Modify: `crates/kebab-parse-code/src/lib.rs`
|
||||
- Create: `crates/kebab-parse-code/tests/fixtures/sample.kt`
|
||||
|
||||
Constants: `PARSER_VERSION = "code-kotlin-v1"`, `KotlinAstExtractor`, `code_lang = "kotlin"`.
|
||||
|
||||
### Package extraction (Kotlin)
|
||||
|
||||
tree-sitter-kotlin grammar:
|
||||
- Root: `source_file`
|
||||
- `package_header` (top-level) → contains `identifier` (dotted is single `identifier` node text; verify against node-types.json)
|
||||
|
||||
```rust
|
||||
fn extract_package(root: tree_sitter::Node, src: &str) -> Option<String> {
|
||||
let mut cur = root.walk();
|
||||
for child in root.named_children(&mut cur) {
|
||||
if child.kind() == "package_header" {
|
||||
let mut c2 = child.walk();
|
||||
for sub in child.named_children(&mut c2) {
|
||||
if sub.kind() == "identifier" {
|
||||
return Some(src[sub.start_byte()..sub.end_byte()].to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
```
|
||||
|
||||
(Verify against tree-sitter-kotlin's node-types.json — Kotlin grammar varies more than Java's.)
|
||||
|
||||
### AST mapping (Kotlin)
|
||||
|
||||
| node kind | unit | symbol |
|
||||
|-----------|------|--------|
|
||||
| `class_declaration` (name field) — covers `class`, `data class`, `sealed class`, `enum class`, `interface` (Kotlin's interface is a class_declaration variant) | 1 + recurse body | `<pkg>.<ClassName>` |
|
||||
| `object_declaration` (name) — singleton | 1 + recurse | `<pkg>.<ObjectName>` |
|
||||
| `function_declaration` (name) | 1 | `<pkg>.<fn_name>` (top-level) or `<pkg>.<Class>.<method>` (inside class) |
|
||||
| Inside class body: `function_declaration` → method | 1 | `<pkg>.<Class>.<method>` |
|
||||
| `property_declaration` at top-level (`val` / `var`) | glue | `<top-level>` (Kotlin top-level properties are common — keep as glue not unit) |
|
||||
| `import_header`, `package_header` | glue | `<top-level>` |
|
||||
|
||||
(Detect class-vs-interface via modifier; for 1C 1차 treat both as `class_declaration` arm — symbol differs only via name. If tree-sitter-kotlin exposes `interface` keyword via modifier list, mention in HOTFIXES if special handling needed.)
|
||||
|
||||
### Fixture `sample.kt`:
|
||||
|
||||
```kotlin
|
||||
// sample.kt
|
||||
package com.kebab.chunk
|
||||
|
||||
import java.util.List
|
||||
|
||||
/**
|
||||
* Heading-aware Markdown chunker.
|
||||
*/
|
||||
class MdHeadingV1Chunker(val name: String) {
|
||||
fun chunkDoc(input: String): List<String> = listOf(name, input)
|
||||
|
||||
fun getName(): String = name
|
||||
|
||||
companion object {
|
||||
fun withName(n: String): MdHeadingV1Chunker = MdHeadingV1Chunker(n)
|
||||
}
|
||||
}
|
||||
|
||||
interface Stringer {
|
||||
fun asString(): String
|
||||
}
|
||||
|
||||
enum class Mode { DEFAULT, FAST }
|
||||
|
||||
fun freeFunction(x: Int): Int = x + 1
|
||||
|
||||
object Singleton {
|
||||
fun ping(): String = "pong"
|
||||
}
|
||||
```
|
||||
|
||||
### Test module — assert symbols
|
||||
|
||||
```rust
|
||||
// Asserted symbols:
|
||||
"com.kebab.chunk.MdHeadingV1Chunker"
|
||||
"com.kebab.chunk.MdHeadingV1Chunker.chunkDoc"
|
||||
"com.kebab.chunk.MdHeadingV1Chunker.getName"
|
||||
"com.kebab.chunk.MdHeadingV1Chunker.Companion" // companion object (verify name)
|
||||
"com.kebab.chunk.MdHeadingV1Chunker.Companion.withName" // method on companion
|
||||
"com.kebab.chunk.Stringer"
|
||||
"com.kebab.chunk.Mode"
|
||||
"com.kebab.chunk.freeFunction" // top-level fn (Kotlin-specific!)
|
||||
"com.kebab.chunk.Singleton"
|
||||
"com.kebab.chunk.Singleton.ping"
|
||||
"com.kebab.chunk.<top-level>" // import + property glue
|
||||
```
|
||||
|
||||
(Companion object: tree-sitter-kotlin may use `companion_object` or `object_declaration` with `companion` modifier — verify and adjust the symbol if `Companion` isn't the right name.)
|
||||
|
||||
### Wire into lib.rs
|
||||
|
||||
```rust
|
||||
pub mod kotlin;
|
||||
pub use kotlin::{PARSER_VERSION as KOTLIN_PARSER_VERSION, KotlinAstExtractor};
|
||||
```
|
||||
|
||||
Verify + commit.
|
||||
|
||||
---
|
||||
|
||||
## Task H: `code-kotlin-ast-v1` chunker
|
||||
|
||||
Same pattern as Task E. Substitute kotlin labels. Verify + commit.
|
||||
|
||||
---
|
||||
|
||||
## Task I: Activate Kotlin in app dispatch
|
||||
|
||||
Replace `"kotlin"` bail arms with real calls. Add integration test `kotlin_file_ingests_and_searches_as_code_citation`. Verify + commit.
|
||||
|
||||
---
|
||||
|
||||
## Task J: Snapshots + full-suite + SMOKE
|
||||
|
||||
- Create 2 snapshot tests (`code_java_ast_snapshot.rs`, `code_kotlin_ast_snapshot.rs`) + baselines. Mirror 1C-Go Task G snapshot test.
|
||||
- ONE workspace test + clippy invocation.
|
||||
- Manual SMOKE: write a `.java` and `.kt` file in TempDir, ingest, search.
|
||||
|
||||
Verify + commit (snapshot only).
|
||||
|
||||
---
|
||||
|
||||
## Task K: Docs + version bump
|
||||
|
||||
- README + HANDOFF + ARCHITECTURE + SMOKE + 2 INDEX updates + design §10.1.
|
||||
- `Cargo.toml` version `0.12.0 → 0.13.0` (minor, surface 확장).
|
||||
|
||||
Commit `docs(p10-1c-jk): ... + chore: bump 0.12.0 → 0.13.0`.
|
||||
|
||||
---
|
||||
|
||||
## Finalize
|
||||
|
||||
`gitea-pr` → review loop → merge → main pull → branch cleanup → `cargo clean` → `gitea-release v0.13.0`.
|
||||
|
||||
---
|
||||
|
||||
## Self-Review (filled by plan author)
|
||||
|
||||
- **Spec coverage**: design §1C Java + Kotlin → Tasks D-I; §3.4 symbol path → extractor (Java D, Kotlin G); §6.1/§6.2 module structure → Tasks D/E/G/H; §6.3 dep graph → Task A; §9.1 Tier-1 + oversize fallback → chunkers E/H.
|
||||
- **No placeholders**: novel logic (Java `extract_package`, Kotlin `extract_package`, AST walk arm tables) given concretely. Chunkers (E, H) are explicit "duplicate code_rust_ast_v1.rs with substitution X/Y/Z".
|
||||
- **Type consistency**: `JavaAstExtractor` / `JAVA_PARSER_VERSION` / `CodeJavaAstV1Chunker` + `KotlinAstExtractor` / `KOTLIN_PARSER_VERSION` / `CodeKotlinAstV1Chunker` used consistently. `MediaType::Code("java")` / `("kotlin")` in routing + dispatch.
|
||||
- **Kotlin grammar risk**: noted — tree-sitter-kotlin's exact node kinds (`class_declaration` vs `object_declaration`, `companion_object` vs companion modifier, `package_header` vs `package_directive`) should be verified against the resolved crate's node-types.json. Pin contract via test fixture; HOTFIXES any deviation found during implementation.
|
||||
1343
docs/superpowers/plans/2026-05-20-p10-2-tier2-resource-aware.md
Normal file
1343
docs/superpowers/plans/2026-05-20-p10-2-tier2-resource-aware.md
Normal file
File diff suppressed because it is too large
Load Diff
930
docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md
Normal file
930
docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md
Normal file
@@ -0,0 +1,930 @@
|
||||
# p10-1D C + C++ AST Chunkers Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Activate C + C++ code ingest end-to-end. P10 Tier 1 chunker family final entry.
|
||||
|
||||
**Architecture:** Same shape as 1B (multi-language single PR) and 1C-JK (JVM family). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing (delegated via `code_lang_for_path`, no change) + app dispatch arms. C symbol = function name only; C++ symbol = `namespace::Class::method` via recursive class/namespace nesting (Java/Kotlin + Python hybrid).
|
||||
|
||||
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-c` + `tree-sitter-cpp` (NEW). 1A-2/1B/1C/p10-2/p10-3 infrastructure unchanged.
|
||||
|
||||
**Memory note:** Host has been OOM'd previously (재부팅 사례). Per-crate cargo only. ONE full-suite + clippy invocation in Task J. NO `cargo test --workspace` outside that gate.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
Branch `feat/p10-1d-c-cpp` already exists (spec commit `8add684`).
|
||||
|
||||
- [ ] **Disk hygiene**: `df -h /` 점검. 80% 넘으면 `cargo clean`.
|
||||
|
||||
Reference files:
|
||||
- 1C-JK extractor: `crates/kebab-parse-code/src/{java,kotlin}.rs` — closest template for source-side identifier prefix (package vs namespace).
|
||||
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for C++ class nesting).
|
||||
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution pattern.
|
||||
- 1B/1C/p10-2/p10-3 dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L1796–2116). Current allowlist + 4-arm match.
|
||||
- spec: `tasks/p10/p10-1d-c-cpp-ast-chunker.md`.
|
||||
|
||||
---
|
||||
|
||||
## Task A: Workspace deps (tree-sitter-c + tree-sitter-cpp)
|
||||
|
||||
**Files:**
|
||||
- Modify: `Cargo.toml` (`[workspace.dependencies]`, after `tree-sitter-kotlin-ng`)
|
||||
- Modify: `crates/kebab-parse-code/Cargo.toml`
|
||||
|
||||
- [ ] **Step 1**: `cargo add tree-sitter-c tree-sitter-cpp -p kebab-parse-code`. If either crate's actively-maintained name differs (e.g. `tree-sitter-cpp` vs `tree-sitter-cpp-ng`), verify on crates.io. The `tree-sitter-c` 0.24 / `tree-sitter-cpp` 0.23 line is the most common; verify compatibility with workspace `tree-sitter = "0.26"` (likely already supported via the `tree-sitter-language` shim).
|
||||
|
||||
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` (after `tree-sitter-kotlin-ng`):
|
||||
|
||||
```toml
|
||||
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
|
||||
tree-sitter-c = "<resolved>"
|
||||
tree-sitter-cpp = "<resolved>"
|
||||
```
|
||||
|
||||
Switch crate's `Cargo.toml` entries to `{ workspace = true }`.
|
||||
|
||||
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
|
||||
|
||||
- [ ] **Step 4**: Commit:
|
||||
|
||||
```bash
|
||||
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
|
||||
git commit -m "$(cat <<'EOF'
|
||||
build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
If a crate's resolved name has a non-obvious fork suffix (e.g. `tree-sitter-cpp-ng`), document it in the commit body.
|
||||
|
||||
---
|
||||
|
||||
## Task B: C AST extractor (`kebab-parse-code/src/c.rs`)
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-parse-code/src/c.rs`
|
||||
- Modify: `crates/kebab-parse-code/src/lib.rs` (pub mod + `C_PARSER_VERSION` const)
|
||||
|
||||
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/c.rs`. Mirror `crates/kebab-parse-code/src/go.rs` (closest template — single-language, no namespace/package nesting, top-level units). Replace tree-sitter-go with tree-sitter-c:
|
||||
|
||||
```rust
|
||||
//! p10-1D: C AST extractor.
|
||||
|
||||
use crate::traits::{Extractor, ExtractContext};
|
||||
use anyhow::{Context, Result};
|
||||
use kebab_core::{Block, BlockId, CanonicalDocument, CodeBlock, CommonBlock, /*..*/, SourceSpan, id_for_block, id_for_doc};
|
||||
use tree_sitter::Parser;
|
||||
|
||||
pub const C_PARSER_VERSION: &str = concat!("tree-sitter-c-", env!("CARGO_PKG_VERSION"));
|
||||
// Or use the tree-sitter-c crate version: better to hardcode for stability.
|
||||
// Look at how go.rs / rust.rs / etc. set their PARSER_VERSION.
|
||||
|
||||
pub struct CAstExtractor {
|
||||
parser: Parser,
|
||||
}
|
||||
|
||||
impl CAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
let mut parser = Parser::new();
|
||||
parser.set_language(&tree_sitter_c::LANGUAGE.into()).expect("load tree-sitter-c");
|
||||
Self { parser }
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for CAstExtractor {
|
||||
fn extract(&mut self, ctx: &ExtractContext, bytes: &[u8]) -> Result<CanonicalDocument> {
|
||||
// ... mirror go.rs:
|
||||
// 1. parse the tree
|
||||
// 2. iterate source_file's named_children
|
||||
// 3. for each top-level node:
|
||||
// - function_definition → emit unit (symbol = fn name)
|
||||
// - struct_specifier (named) → emit unit (symbol = struct name)
|
||||
// - enum_specifier (named) → emit unit (symbol = enum name)
|
||||
// - union_specifier (named) → emit unit (symbol = union name)
|
||||
// - declaration → glue
|
||||
// - preproc_include / preproc_def / preproc_function_def / preproc_ifdef → glue
|
||||
// - else → glue
|
||||
// 4. <top-level> glue chunk if any glue accumulated
|
||||
// 5. <module> post-pass if 0 units
|
||||
// ...
|
||||
todo!("mirror go.rs structure with C-specific node-kind names")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**ACTION**: Read `crates/kebab-parse-code/src/go.rs` in full first. It's the closest template — single-language, no namespace prefix to thread through (C is even simpler than Go since there's no `package`). Port the structure: parse → iterate top-level → match on node-kind → emit units or accumulate glue.
|
||||
|
||||
Node-kind name reference (tree-sitter-c): `function_definition`, `struct_specifier`, `enum_specifier`, `union_specifier`, `declaration`, `preproc_*`. Confirm by checking the crate's `node-types.json` if uncertain.
|
||||
|
||||
**Function name extraction**: `function_definition` has a `declarator` field. The innermost `identifier` of that declarator is the function name. Mirror how go.rs extracts function names — it uses tree-sitter field traversal.
|
||||
|
||||
- [ ] **Step 2**: Register the module in `crates/kebab-parse-code/src/lib.rs`:
|
||||
|
||||
```rust
|
||||
pub mod c;
|
||||
pub use c::{CAstExtractor, C_PARSER_VERSION};
|
||||
```
|
||||
|
||||
- [ ] **Step 3**: Build:
|
||||
|
||||
```bash
|
||||
cargo build -p kebab-parse-code 2>&1 | tail -5
|
||||
```
|
||||
|
||||
Expected: clean.
|
||||
|
||||
- [ ] **Step 4**: Commit (no test yet — Task D adds the snapshot test):
|
||||
|
||||
```bash
|
||||
git add crates/kebab-parse-code/src/c.rs crates/kebab-parse-code/src/lib.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(p10-1d): C AST extractor (tree-sitter-c)
|
||||
|
||||
Top-level units: function_definition (symbol = fn name), struct_specifier,
|
||||
enum_specifier, union_specifier (each emits 1 unit with the symbol being
|
||||
the named identifier). Preprocessor directives + top-level declarations
|
||||
group into a <top-level> glue chunk. Empty file or zero units → <module>
|
||||
post-pass.
|
||||
|
||||
C symbol = function name only — no namespace, no class nesting (design §3.4).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task C: C++ AST extractor (`kebab-parse-code/src/cpp.rs`)
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-parse-code/src/cpp.rs`
|
||||
- Modify: `crates/kebab-parse-code/src/lib.rs`
|
||||
|
||||
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/cpp.rs`. The closest template is `crates/kebab-parse-code/src/java.rs` (1C-JK) — it handles package prefix + class nesting via recursion. C++ adds namespace nesting (multiple levels possible).
|
||||
|
||||
Pseudocode:
|
||||
|
||||
```rust
|
||||
//! p10-1D: C++ AST extractor.
|
||||
|
||||
use crate::traits::{Extractor, ExtractContext};
|
||||
use anyhow::{Context, Result};
|
||||
use kebab_core::{/* ... */};
|
||||
use tree_sitter::{Node, Parser};
|
||||
|
||||
pub const CPP_PARSER_VERSION: &str = "tree-sitter-cpp-<resolved>";
|
||||
|
||||
pub struct CppAstExtractor { parser: Parser }
|
||||
|
||||
impl CppAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
let mut parser = Parser::new();
|
||||
parser.set_language(&tree_sitter_cpp::LANGUAGE.into()).expect("load tree-sitter-cpp");
|
||||
Self { parser }
|
||||
}
|
||||
|
||||
fn visit(&self, node: Node, source: &[u8], prefix: &[&str], units: &mut Vec<(String, Node)>, glue: &mut Vec<Node>) {
|
||||
// prefix is the namespace/class chain so far (e.g. ["kebab", "chunk", "MdHeadingV1Chunker"]).
|
||||
for child in node.named_children(&mut node.walk()) {
|
||||
match child.kind() {
|
||||
"namespace_definition" => {
|
||||
let name = child.child_by_field_name("name")
|
||||
.and_then(|n| n.utf8_text(source).ok())
|
||||
.unwrap_or("<anonymous>");
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
new_prefix.push(name);
|
||||
let body = child.child_by_field_name("body").unwrap_or(child);
|
||||
self.visit(body, source, &new_prefix, units, glue);
|
||||
}
|
||||
"class_specifier" | "struct_specifier" if child.child_by_field_name("name").is_some() => {
|
||||
let name = child.child_by_field_name("name")
|
||||
.and_then(|n| n.utf8_text(source).ok())
|
||||
.unwrap_or("<anonymous>");
|
||||
// Emit the class itself as a unit.
|
||||
let symbol = build_symbol(prefix, &[], name); // e.g. "kebab::chunk::Foo"
|
||||
units.push((symbol, child));
|
||||
// Recurse for nested classes / methods.
|
||||
let mut new_prefix = prefix.to_vec();
|
||||
new_prefix.push(name);
|
||||
let body = child.child_by_field_name("body").unwrap_or(child);
|
||||
self.visit(body, source, &new_prefix, units, glue);
|
||||
}
|
||||
"function_definition" => {
|
||||
// declarator may be qualified_identifier (out-of-class def) or plain identifier.
|
||||
let symbol = extract_fn_symbol(child, source, prefix);
|
||||
units.push((symbol, child));
|
||||
// Do NOT recurse into function body — inner classes/lambdas left to a future revision.
|
||||
}
|
||||
"template_declaration" => {
|
||||
// Recurse: unwrap to inner declarator (function_definition or class_specifier)
|
||||
// and treat it as if it were directly there. Template params NOT in symbol.
|
||||
self.visit(child, source, prefix, units, glue);
|
||||
}
|
||||
"enum_specifier" if child.child_by_field_name("name").is_some() => {
|
||||
let name = child.child_by_field_name("name").and_then(|n| n.utf8_text(source).ok()).unwrap_or("<anonymous>");
|
||||
let symbol = build_symbol(prefix, &[], name);
|
||||
units.push((symbol, child));
|
||||
}
|
||||
"concept_definition" => {
|
||||
let name = /* extract */;
|
||||
let symbol = build_symbol(prefix, &[], &name);
|
||||
units.push((symbol, child));
|
||||
}
|
||||
_ => glue.push(child),
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn build_symbol(prefix: &[&str], extras: &[&str], leaf: &str) -> String {
|
||||
// Join with ::
|
||||
let mut parts: Vec<&str> = prefix.iter().copied().collect();
|
||||
parts.extend_from_slice(extras);
|
||||
parts.push(leaf);
|
||||
parts.join("::")
|
||||
}
|
||||
|
||||
fn extract_fn_symbol(node: Node, source: &[u8], prefix: &[&str]) -> String {
|
||||
// function_definition.declarator may be a function_declarator wrapping a
|
||||
// qualified_identifier (out-of-class def like `void Foo::bar(){}`) or a
|
||||
// plain identifier (free fn or in-namespace fn).
|
||||
// Need to walk down to the leaf identifier and any qualifier chain.
|
||||
// For qualified_identifier "Foo::bar::baz", break into ["Foo", "bar"] qualifier + "baz" leaf.
|
||||
// ...
|
||||
todo!("walk declarator → qualified_identifier → assemble symbol with prefix")
|
||||
}
|
||||
|
||||
// Extractor impl: parse, visit(root, ...), emit chunks-of-blocks per (symbol, node) pair + <top-level> glue + <module> fallback.
|
||||
```
|
||||
|
||||
This is the most intricate extractor in p10-1D. **Action**: read `crates/kebab-parse-code/src/java.rs` for the recursion pattern, then `crates/kebab-parse-code/src/python.rs` for the class-nesting pattern, and combine. tree-sitter-cpp's node-types.json (or a quick `tree-sitter parse` against a sample file) confirms exact node-kind names.
|
||||
|
||||
- [ ] **Step 2**: Register in `crates/kebab-parse-code/src/lib.rs`:
|
||||
|
||||
```rust
|
||||
pub mod cpp;
|
||||
pub use cpp::{CppAstExtractor, CPP_PARSER_VERSION};
|
||||
```
|
||||
|
||||
- [ ] **Step 3**: Build:
|
||||
|
||||
```bash
|
||||
cargo build -p kebab-parse-code 2>&1 | tail -5
|
||||
```
|
||||
|
||||
Expected: clean.
|
||||
|
||||
- [ ] **Step 4**: Commit:
|
||||
|
||||
```bash
|
||||
git add crates/kebab-parse-code/src/cpp.rs crates/kebab-parse-code/src/lib.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
|
||||
|
||||
Symbol = namespace::Class::method via recursive visit. namespace_definition
|
||||
pushes namespace name (anonymous → <anonymous>). class_specifier / struct_specifier
|
||||
(named) emit class unit + recurse with class name pushed. function_definition
|
||||
emits method unit (symbol may include qualified_identifier prefix for
|
||||
out-of-class definitions). template_declaration unwraps to inner declarator
|
||||
(template params NOT in symbol). enum_specifier + concept_definition emit
|
||||
type-level units. extern "C" block content + using/include/define → glue.
|
||||
|
||||
Constructor / destructor symbols use Class::Class / Class::~Class
|
||||
convention. Operator overloads keep operator+ form.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task D: C chunker + snapshot test
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-chunk/src/code_c_ast_v1.rs`
|
||||
- Create: `crates/kebab-chunk/tests/fixtures/sample.c`
|
||||
- Create: `crates/kebab-chunk/tests/code_c_ast_snapshot.rs`
|
||||
- Modify: `crates/kebab-chunk/src/lib.rs`
|
||||
|
||||
- [ ] **Step 1**: Create `crates/kebab-chunk/src/code_c_ast_v1.rs`. **Mirror `crates/kebab-chunk/src/code_go_ast_v1.rs`** (closest 1-extractor pattern, no nesting):
|
||||
|
||||
```rust
|
||||
//! p10-1D: C AST chunker.
|
||||
|
||||
use crate::tier2_shared::build_chunk;
|
||||
use crate::{Chunker, ChunkPolicy};
|
||||
use anyhow::Result;
|
||||
use kebab_core::{Block, Chunk, Document};
|
||||
|
||||
pub const VERSION_LABEL: &str = "code-c-ast-v1";
|
||||
|
||||
pub struct CodeCAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeCAstV1Chunker {
|
||||
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
crate::tier2_shared::policy_hash(policy)
|
||||
}
|
||||
fn chunk(&self, doc: &Document, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
|
||||
// Mirror code_go_ast_v1.rs's body — iterate doc.blocks, each Block::Code
|
||||
// contributes 1 chunk via build_chunk. Apply oversize fallback per block
|
||||
// via tier2_shared::push_chunks_with_oversize.
|
||||
// ...
|
||||
todo!("mirror code_go_ast_v1.rs verbatim, substituting VERSION_LABEL")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Read `code_go_ast_v1.rs` and port verbatim — the language-agnostic body iterates `doc.blocks` and emits chunks. Only the `VERSION_LABEL` and (potentially) symbol formatting helper change.
|
||||
|
||||
- [ ] **Step 2**: Create `tests/fixtures/sample.c` (~30 lines, includes top-level fn, struct, enum, preprocessor):
|
||||
|
||||
```c
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
#define MAX_BUF 4096
|
||||
|
||||
typedef enum {
|
||||
OK = 0,
|
||||
ERR_PARSE,
|
||||
ERR_IO,
|
||||
} status_t;
|
||||
|
||||
typedef struct {
|
||||
int id;
|
||||
char name[64];
|
||||
status_t status;
|
||||
} record_t;
|
||||
|
||||
static int counter = 0;
|
||||
|
||||
int parse_record(const char *line, record_t *out) {
|
||||
if (line == NULL || out == NULL) return ERR_PARSE;
|
||||
return OK;
|
||||
}
|
||||
|
||||
void print_record(const record_t *r) {
|
||||
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
|
||||
}
|
||||
|
||||
int main(void) {
|
||||
record_t r = { .id = 1, .name = "foo", .status = OK };
|
||||
print_record(&r);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Expected snapshot: 3 function units (`parse_record`, `print_record`, `main`) + 1 enum unit (`status_t`) + 1 struct unit (`record_t`) + 1 `<top-level>` glue (preproc + global var). Total ~6 chunks.
|
||||
|
||||
- [ ] **Step 3**: Create `tests/code_c_ast_snapshot.rs` mirroring `tests/code_go_ast_snapshot.rs`. Assertions:
|
||||
|
||||
```rust
|
||||
// Pseudocode:
|
||||
// 1. Load fixture sample.c
|
||||
// 2. Run CAstExtractor → Document
|
||||
// 3. Run CodeCAstV1Chunker.chunk(&doc, &policy)
|
||||
// 4. Assert chunks.len() == expected (6).
|
||||
// 5. Assert symbols (from chunks[i].source_spans[0]::SourceSpan::Code.symbol) match expected list:
|
||||
// ["status_t", "record_t", "parse_record", "print_record", "main", "<top-level>"]
|
||||
// (order matches AST traversal order — verify by running once.)
|
||||
// 6. Assert all chunks have lang = Some("c").
|
||||
```
|
||||
|
||||
- [ ] **Step 4**: Register module in `crates/kebab-chunk/src/lib.rs`:
|
||||
|
||||
```rust
|
||||
pub mod code_c_ast_v1;
|
||||
pub use code_c_ast_v1::CodeCAstV1Chunker;
|
||||
```
|
||||
|
||||
- [ ] **Step 5**: Run test:
|
||||
|
||||
```bash
|
||||
cargo test -p kebab-chunk --test code_c_ast_snapshot -- --nocapture 2>&1 | tail -25
|
||||
```
|
||||
|
||||
Expected: PASS. If chunk count or symbol order differs from expectation, INSPECT the actual output and update the test's expected list to match (run once to learn, codify on second run).
|
||||
|
||||
- [ ] **Step 6**: Clippy + commit:
|
||||
|
||||
```bash
|
||||
cargo clippy -p kebab-chunk --all-targets -- -D warnings
|
||||
git add crates/kebab-chunk/src/code_c_ast_v1.rs \
|
||||
crates/kebab-chunk/src/lib.rs \
|
||||
crates/kebab-chunk/tests/fixtures/sample.c \
|
||||
crates/kebab-chunk/tests/code_c_ast_snapshot.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(p10-1d): code-c-ast-v1 chunker + snapshot test
|
||||
|
||||
Mirrors code-go-ast-v1's chunker pattern (1 chunk per AST unit + <top-level>
|
||||
glue + oversize fallback). Snapshot test against tests/fixtures/sample.c
|
||||
(function + struct + enum + preprocessor) verifies symbol order + lang=c
|
||||
stamping.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task E: C++ chunker + snapshot test
|
||||
|
||||
**Files:**
|
||||
- Create: `crates/kebab-chunk/src/code_cpp_ast_v1.rs`
|
||||
- Create: `crates/kebab-chunk/tests/fixtures/sample.cpp`
|
||||
- Create: `crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs`
|
||||
- Modify: `crates/kebab-chunk/src/lib.rs`
|
||||
|
||||
- [ ] **Step 1**: Create `code_cpp_ast_v1.rs`. **Mirror `code_c_ast_v1.rs`** verbatim, only VERSION_LABEL differs:
|
||||
|
||||
```rust
|
||||
pub const VERSION_LABEL: &str = "code-cpp-ast-v1";
|
||||
|
||||
pub struct CodeCppAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeCppAstV1Chunker {
|
||||
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
|
||||
// ... identical body — both languages use the same Block::Code → Chunk emission ...
|
||||
}
|
||||
```
|
||||
|
||||
The actual symbol-formatting work happens in the EXTRACTOR (Task C). The chunker's job is to iterate blocks the extractor produced and emit Chunks. Both C and C++ chunkers are essentially identical bodies.
|
||||
|
||||
- [ ] **Step 2**: Create `tests/fixtures/sample.cpp` (~50 lines, includes namespace + nested class + method + free fn + template):
|
||||
|
||||
```cpp
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace kebab {
|
||||
namespace chunk {
|
||||
|
||||
class MdHeadingV1Chunker {
|
||||
public:
|
||||
MdHeadingV1Chunker() = default;
|
||||
~MdHeadingV1Chunker() = default;
|
||||
|
||||
std::string chunk_doc(const std::string& doc) {
|
||||
return doc;
|
||||
}
|
||||
|
||||
int operator()(int x) const {
|
||||
return x * 2;
|
||||
}
|
||||
|
||||
private:
|
||||
int counter_ = 0;
|
||||
};
|
||||
|
||||
template <typename T>
|
||||
T identity(T value) {
|
||||
return value;
|
||||
}
|
||||
|
||||
} // namespace chunk
|
||||
|
||||
void global_helper() {
|
||||
// free function in kebab namespace
|
||||
}
|
||||
|
||||
} // namespace kebab
|
||||
|
||||
int main() {
|
||||
kebab::chunk::MdHeadingV1Chunker c;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Expected snapshot symbols (verify on first run, then codify):
|
||||
- `kebab::chunk::MdHeadingV1Chunker` (class unit)
|
||||
- `kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker` (constructor)
|
||||
- `kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker` (destructor)
|
||||
- `kebab::chunk::MdHeadingV1Chunker::chunk_doc`
|
||||
- `kebab::chunk::MdHeadingV1Chunker::operator()`
|
||||
- `kebab::chunk::identity` (template fn)
|
||||
- `kebab::global_helper`
|
||||
- `main` (free fn, no namespace)
|
||||
- `<top-level>` (include + using)
|
||||
|
||||
~9 chunks total.
|
||||
|
||||
- [ ] **Step 3**: Create `tests/code_cpp_ast_snapshot.rs` mirroring `code_c_ast_snapshot.rs`. Assert symbol list matches expected (run once to learn the actual order, codify).
|
||||
|
||||
- [ ] **Step 4**: Register module in `lib.rs`:
|
||||
|
||||
```rust
|
||||
pub mod code_cpp_ast_v1;
|
||||
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
|
||||
```
|
||||
|
||||
- [ ] **Step 5**: Run test:
|
||||
|
||||
```bash
|
||||
cargo test -p kebab-chunk --test code_cpp_ast_snapshot -- --nocapture 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 6**: Clippy + commit:
|
||||
|
||||
```bash
|
||||
cargo clippy -p kebab-chunk --all-targets -- -D warnings
|
||||
git add crates/kebab-chunk/src/code_cpp_ast_v1.rs \
|
||||
crates/kebab-chunk/src/lib.rs \
|
||||
crates/kebab-chunk/tests/fixtures/sample.cpp \
|
||||
crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
|
||||
|
||||
Identical chunker body to code-c-ast-v1; per-language work happens in the
|
||||
CppAstExtractor (Task C). Snapshot fixture covers nested namespace +
|
||||
class + ctor/dtor + method + operator overload + template fn + free fn +
|
||||
top-level main, verifying namespace::Class::method symbol convention per
|
||||
design §3.4.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task F: ingest_one_code_asset dispatch + tier3 fallback list extension
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-app/src/lib.rs`
|
||||
|
||||
- [ ] **Step 1**: Top-of-file `use kebab_chunk::{...}` extend with `CodeCAstV1Chunker` + `CodeCppAstV1Chunker`:
|
||||
|
||||
```rust
|
||||
use kebab_chunk::{
|
||||
/* existing items */,
|
||||
CodeCAstV1Chunker,
|
||||
CodeCppAstV1Chunker,
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: Allowlist (around line 953) extend:
|
||||
|
||||
```rust
|
||||
if matches!(lang.as_str(),
|
||||
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
|
||||
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
| "shell"
|
||||
| "c" | "cpp")
|
||||
```
|
||||
|
||||
- [ ] **Step 3**: `parser_version` match — add C/C++ arms (Tier 1, so they DO get a real parser version):
|
||||
|
||||
```rust
|
||||
let parser_version = match code_lang {
|
||||
// ... existing 7 Tier 1 + Tier 2 + shell arms ...
|
||||
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
|
||||
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
|
||||
other => anyhow::bail!("unsupported code_lang: {other}"),
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 4**: `chunker_version` match — add C/C++ arms:
|
||||
|
||||
```rust
|
||||
let chunker_version = match code_lang {
|
||||
// ... existing arms ...
|
||||
"c" => CodeCAstV1Chunker.chunker_version(),
|
||||
"cpp" => CodeCppAstV1Chunker.chunker_version(),
|
||||
other => anyhow::bail!("unreachable chunker_version: {other}"),
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 5**: `canonical_result` extract match — add C/C++ arms:
|
||||
|
||||
```rust
|
||||
let canonical_result: anyhow::Result<CanonicalDocument> = match code_lang {
|
||||
"rust" => RustAstExtractor::new().extract(&ctx, &bytes).context("..."),
|
||||
// ... existing ...
|
||||
"c" => CAstExtractor::new().extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::CAstExtractor::extract (code:c)"),
|
||||
"cpp" => CppAstExtractor::new().extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::CppAstExtractor::extract (code:cpp)"),
|
||||
// ... Tier 2 + shell ...
|
||||
other => anyhow::bail!("unreachable (extract): {other}"),
|
||||
};
|
||||
```
|
||||
|
||||
(Add `use kebab_parse_code::{CAstExtractor, CppAstExtractor};` at the top if not already wildcard-imported.)
|
||||
|
||||
- [ ] **Step 6**: `chunks_result` match — add C/C++ arms:
|
||||
|
||||
```rust
|
||||
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
|
||||
// ... existing ...
|
||||
} else {
|
||||
match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker.chunk(&canonical, chunk_policy).context("..."),
|
||||
// ... existing ...
|
||||
"c" => CodeCAstV1Chunker.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeCAstV1Chunker::chunk (code:c)"),
|
||||
"cpp" => CodeCppAstV1Chunker.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
|
||||
// ... existing ...
|
||||
other => anyhow::bail!("unreachable (chunk): {other}"),
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 7**: `tier3_fallback_cv` (p10-3 Critical fix) — C/C++ are fallback-eligible (extract may fail on `.h` C++ headers or malformed code):
|
||||
|
||||
```rust
|
||||
let tier3_fallback_cv = match code_lang {
|
||||
"rust" | "python" | "typescript" | "javascript"
|
||||
| "go" | "java" | "kotlin"
|
||||
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
|
||||
| "c" | "cpp" // p10-1d:
|
||||
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
|
||||
_ => None,
|
||||
};
|
||||
```
|
||||
|
||||
(The exact location of this match is in `ingest_one_code_asset` between ~lines 1921-1927 per the p10-3 critical fix.)
|
||||
|
||||
- [ ] **Step 8**: Build:
|
||||
|
||||
```bash
|
||||
cargo build -p kebab-app 2>&1 | tail -5
|
||||
```
|
||||
|
||||
Expected: clean.
|
||||
|
||||
- [ ] **Step 9**: Per-crate test (no regression):
|
||||
|
||||
```bash
|
||||
cargo test -p kebab-app --lib -- --nocapture 2>&1 | tail -10
|
||||
```
|
||||
|
||||
Expected: 52 PASS (existing baseline).
|
||||
|
||||
- [ ] **Step 10**: Clippy + commit:
|
||||
|
||||
```bash
|
||||
cargo clippy -p kebab-app --all-targets -- -D warnings
|
||||
git add crates/kebab-app/src/lib.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
|
||||
|
||||
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
|
||||
+ allowlist + tier3_fallback_cv list with "c" + "cpp" arms. C uses
|
||||
CAstExtractor + CodeCAstV1Chunker; C++ uses CppAstExtractor +
|
||||
CodeCppAstV1Chunker. Both langs are Tier 3-fallback-eligible (e.g. .h
|
||||
file with C++ syntax may fail tree-sitter-c parse → Tier 3 paragraph
|
||||
fallback).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task G: code_ingest_smoke integration tests (C + C++)
|
||||
|
||||
**Files:**
|
||||
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs`
|
||||
|
||||
- [ ] **Step 1**: Append 2 tests at the end of the file (mirror the existing tier1 tests `c_ast_v1_*` if present; if not, mirror `rust_ast_v1_*` or `go_ast_v1_*`):
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn tier1_c_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
let workspace = env.workspace_root();
|
||||
std::fs::write(
|
||||
workspace.join("parser.c"),
|
||||
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = env.ingest().expect("ingest");
|
||||
assert!(report.new_docs >= 1, "expected at least 1 new doc");
|
||||
|
||||
let hits = env.search_code_lang("c", "parse_record").expect("search");
|
||||
assert!(!hits.is_empty(), "expected at least 1 c hit");
|
||||
|
||||
match &hits[0].citation {
|
||||
Citation::Code { symbol, lang, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse_record"), "C symbol must be function name only");
|
||||
assert_eq!(lang.as_deref(), Some("c"));
|
||||
}
|
||||
other => panic!("expected Citation::Code, got {other:?}"),
|
||||
}
|
||||
assert_eq!(
|
||||
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-c-ast-v1"),
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tier1_cpp_ingest_searchable() {
|
||||
let env = TestEnv::lexical_only();
|
||||
let workspace = env.workspace_root();
|
||||
std::fs::write(
|
||||
workspace.join("chunker.cpp"),
|
||||
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report = env.ingest().expect("ingest");
|
||||
assert!(report.new_docs >= 1);
|
||||
|
||||
let hits = env.search_code_lang("cpp", "bar").expect("search");
|
||||
assert!(!hits.is_empty(), "expected at least 1 cpp hit");
|
||||
|
||||
match &hits[0].citation {
|
||||
Citation::Code { symbol, lang, .. } => {
|
||||
// Symbol could be "kebab::chunk::Foo::bar" or "kebab::chunk::Foo" depending on which chunk hits first.
|
||||
assert!(
|
||||
symbol.as_deref().map_or(false, |s| s.starts_with("kebab::chunk::Foo")),
|
||||
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("cpp"));
|
||||
}
|
||||
other => panic!("expected Citation::Code, got {other:?}"),
|
||||
}
|
||||
assert_eq!(
|
||||
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-cpp-ast-v1"),
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: Run tests:
|
||||
|
||||
```bash
|
||||
cargo test -p kebab-app --test code_ingest_smoke tier1_c_ingest tier1_cpp_ingest -- --nocapture 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: 2 PASS.
|
||||
|
||||
- [ ] **Step 3**: Full smoke regression:
|
||||
|
||||
```bash
|
||||
cargo test -p kebab-app --test code_ingest_smoke -- --nocapture 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: 18 PASS (16 existing + 2 new).
|
||||
|
||||
- [ ] **Step 4**: Clippy + commit:
|
||||
|
||||
```bash
|
||||
cargo clippy -p kebab-app --tests -- -D warnings
|
||||
git add crates/kebab-app/tests/code_ingest_smoke.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
test(p10-1d): integration smoke tests for C + C++
|
||||
|
||||
Verifies end-to-end ingest + search + Citation::Code shape:
|
||||
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
|
||||
= function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
|
||||
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
|
||||
symbol starts with namespace::Class prefix, lang = "cpp",
|
||||
chunker_version = "code-cpp-ast-v1".
|
||||
|
||||
Brings code_ingest_smoke to 18 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
|
||||
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1 + shell 1 +
|
||||
yaml-fallback 1 + 2 reingest-unchanged regression + c 1 + cpp 1).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task H: frozen design §10 activation log
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
|
||||
|
||||
- [ ] **Step 1**: Find §10 activation log. Add p10-1D entry right after the p10-3 entry:
|
||||
|
||||
```
|
||||
**p10-1D 활성화 (C + C++) (2026-05-21)**: Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up.
|
||||
```
|
||||
|
||||
- [ ] **Step 2**: Commit:
|
||||
|
||||
```bash
|
||||
git add docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md \
|
||||
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md 2>/dev/null
|
||||
git add docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(p10-1d): activate C + C++ in frozen design §10
|
||||
|
||||
P10 Tier 1 chunker family complete.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task I: README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX
|
||||
|
||||
**Files:**
|
||||
- Modify: `README.md` (Mermaid + ingest row), `HANDOFF.md`, `docs/ARCHITECTURE.md`, `docs/SMOKE.md`, `tasks/INDEX.md`, `tasks/p10/INDEX.md`
|
||||
|
||||
- [ ] **Step 1 — README.md**: Update the `kebab ingest` row's supported-langs list to include `.c` / `.h` → `code-c-ast-v1` and `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` → `code-cpp-ast-v1`. Extend `--code-lang c` / `--code-lang cpp` in the enumeration. Update the Mermaid `chunker[...]` node to include `code-c-ast-v1, code-cpp-ast-v1` in the brace.
|
||||
|
||||
- [ ] **Step 2 — HANDOFF.md**: P10 row append `, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)**`. Update 한 줄 요약 to include C/C++. Update 다음 후보 (drop p10-1D; remaining: P9-5 desktop / P8 audio).
|
||||
|
||||
- [ ] **Step 3 — docs/ARCHITECTURE.md**: code parser table row: append C + C++ row mention. Flowchart `pcode` node: append `+ P10-1D`. Directory tree chunkers list: add `code_c_ast_v1.rs` + `code_cpp_ast_v1.rs`.
|
||||
|
||||
- [ ] **Step 4 — docs/SMOKE.md**: Add a "## P10-1D C + C++ AST chunker" section after the P10-3 section. Walkthrough with sample.c + sample.cpp ingest + `--code-lang c` / `--code-lang cpp` search assertions. Append verification checklist entry.
|
||||
|
||||
- [ ] **Step 5 — tasks/INDEX.md + tasks/p10/INDEX.md**: Flip p10-1D row ⏳ → ✅ (v0.16.0).
|
||||
|
||||
- [ ] **Step 6**: Commit:
|
||||
|
||||
```bash
|
||||
git add README.md HANDOFF.md docs/ARCHITECTURE.md docs/SMOKE.md tasks/INDEX.md tasks/p10/INDEX.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
|
||||
|
||||
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
|
||||
Kotlin + C + C++). Tier 2 (k8s + dockerfile + manifest) and Tier 3
|
||||
(paragraph fallback) already active. p10-1D 활성화 + ✅ flip.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task J: workspace test gate + clippy
|
||||
|
||||
- [ ] **Step 1**: Disk check (`df -h /`) + optional `cargo clean`.
|
||||
|
||||
- [ ] **Step 2**: `cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -80`. Expected: all PASS.
|
||||
|
||||
- [ ] **Step 3**: `cargo clippy --workspace --all-targets -- -D warnings 2>&1 | tail -30`. Expected: clean.
|
||||
|
||||
---
|
||||
|
||||
## Task K: version bump + gitea PR + release
|
||||
|
||||
**Files:**
|
||||
- Modify: `Cargo.toml`
|
||||
|
||||
- [ ] **Step 1**: Workspace `version = "0.15.0"` → `"0.16.0"`.
|
||||
|
||||
- [ ] **Step 2**: `cargo build -p kebab-cli` to refresh Cargo.lock.
|
||||
|
||||
- [ ] **Step 3**: Commit:
|
||||
|
||||
```bash
|
||||
git add Cargo.toml Cargo.lock
|
||||
git commit -m "$(cat <<'EOF'
|
||||
chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
|
||||
|
||||
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
|
||||
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
|
||||
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
|
||||
major bump.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
- [ ] **Step 4**: Push branch + open gitea PR via REST API. Title: `feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete`.
|
||||
|
||||
- [ ] **Step 5**: Wait for code-reviewer APPROVE → merge via gitea REST API → cut `gitea-release v0.16.0`.
|
||||
|
||||
---
|
||||
|
||||
## Verification matrix
|
||||
|
||||
| 검증 | 명령 | 기대 |
|
||||
|------|------|------|
|
||||
| C symbol | `kebab search --code-lang c --json` | `Citation::Code.symbol = "<fn_name>"` |
|
||||
| C++ symbol | `kebab search --code-lang cpp --json` | `Citation::Code.symbol = "namespace::Class::method"` |
|
||||
| .h fallback | `.h` with C++ syntax → ingest | Tier 3 fallback: `chunker_version = "code-text-paragraph-v1"`, lang = c |
|
||||
| code_lang_breakdown | `kebab schema --json` | `"c": N`, `"cpp": M` |
|
||||
|
||||
---
|
||||
|
||||
## Risks reminder (구현 중 주의)
|
||||
|
||||
- **tree-sitter grammar version resolution**: tree-sitter 0.26 호환 grammar. crates.io 최신 버전 default.
|
||||
- **tree-sitter-cpp 의 node-kind 명**: spec 의 가정 (`namespace_definition`, `class_specifier`, `function_definition`, `template_declaration`, `concept_definition`, etc.) 이 실제 grammar 와 일치하는지 fixture parse 로 검증.
|
||||
- **out-of-class method def 의 prefix 복원**: `void Foo::bar()` 의 declarator 가 `function_declarator > qualified_identifier > namespace_identifier "Foo" + identifier "bar"`. spec 의 `extract_fn_symbol` 이 이 chain 정확히 walk.
|
||||
- **Operator overload**: tree-sitter-cpp 의 `operator_name` 또는 `field_identifier` "operator+" 형태. fixture 로 검증.
|
||||
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그.
|
||||
1296
docs/superpowers/plans/2026-05-21-p10-3-tier3-paragraph-fallback.md
Normal file
1296
docs/superpowers/plans/2026-05-21-p10-3-tier3-paragraph-fallback.md
Normal file
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user