Compare commits
88 Commits
spec/fb-39
...
v0.9.0
| Author | SHA1 | Date | |
|---|---|---|---|
| acf8cf3be2 | |||
| ea5f7b22c8 | |||
| 5497c6e7b5 | |||
| 5a90940f1c | |||
| 4389b887f0 | |||
| 360f825f3a | |||
| 641b92af7d | |||
| 08fb743598 | |||
| 0a2a7ae214 | |||
| 803d02b68b | |||
| 4e8b84c4e0 | |||
| 16dc02cfa2 | |||
| 74f1b0571b | |||
| 918ee6c0be | |||
| 68ada396f3 | |||
| 23c4ad97b9 | |||
| 1f566b8bfa | |||
| 26562588e3 | |||
| 4503b5b12f | |||
| 44813df052 | |||
| d6bb6cfd3b | |||
| d53995a6d4 | |||
| c215034653 | |||
| 31245a4328 | |||
| acb61b6830 | |||
| 20feb3133e | |||
| de63f161ac | |||
| 1815091247 | |||
| 6a0b340941 | |||
| 9664e97497 | |||
| 8bdb3e8090 | |||
| dcad9ccda2 | |||
| ed0f4769b3 | |||
| 0c61758931 | |||
| 39b766ea59 | |||
| 7f287abacb | |||
| d715631928 | |||
| 73e5b359d8 | |||
| c780aca904 | |||
| b1d5047399 | |||
| 80c2d31fb3 | |||
| 97e9f558f4 | |||
| da51e59081 | |||
| 11a0fc758f | |||
| b5d1fe8c1e | |||
| 580576c2c6 | |||
| 808b92a6c5 | |||
| c74f8d269e | |||
| df85bafa7f | |||
| a93b33ffbe | |||
| 402a4506a2 | |||
| a531dc37dc | |||
| 7a6a24ad10 | |||
| 42712b50c2 | |||
| 9f3edb7e24 | |||
| 5c265bb59f | |||
| a08ed32199 | |||
| 9362cd0aae | |||
|
|
7961f8813d | ||
|
|
7bbd2c0cbf | ||
|
|
d13f58d28a | ||
|
|
298f4adc81 | ||
|
|
4e8b70a04b | ||
|
|
682f7dd3a2 | ||
|
|
40b3ea8408 | ||
|
|
9fce24b106 | ||
|
|
8bbe25dc10 | ||
|
|
abfdcbd31d | ||
|
|
69d1593bc5 | ||
|
|
2a8451c033 | ||
|
|
ff11f81f7f | ||
|
|
bf4ebf8d2a | ||
|
|
351c7a0826 | ||
|
|
7329ba96ee | ||
|
|
fa4eeb5a87 | ||
|
|
3b1e878aed | ||
|
|
005a9011ea | ||
|
|
c6d61b0b37 | ||
|
|
49487dc46b | ||
| 2c2bf9bac5 | |||
| 72798bd3ff | |||
|
|
c3177561b9 | ||
| a465b71f99 | |||
|
|
787007172a | ||
|
|
b954e9ce66 | ||
|
|
c62a8ff503 | ||
|
|
69c94b6692 | ||
|
|
7c6c2e8102 |
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,6 +1,6 @@
|
||||
.superpowers/
|
||||
.worktrees/
|
||||
.claude/
|
||||
/target/
|
||||
/target
|
||||
**/*.rs.bk
|
||||
Cargo.lock.bak
|
||||
|
||||
@@ -27,7 +27,7 @@ cargo build --release # produces target/release/kebab
|
||||
|
||||
`-j 1` for the full workspace test isn't optional: 18 integration-test binaries each link `lance` + `datafusion` + `arrow` + `tantivy` and the parallel link step exhausts memory (linker gets SIGKILL'd, build silently fails partway). Per-crate runs are fine in parallel.
|
||||
|
||||
`target/` is 6–10 GB after a fresh build (DataFusion + Lance + fastembed + 18 × test-binary debug info). The dev/test profile is already trimmed (`debug = "line-tables-only"`, `split-debuginfo = "unpacked"` — see workspace `Cargo.toml`). Run `cargo clean` after phase merges if disk pressure shows up; backtraces still resolve to function + line.
|
||||
`target/` is 6–10 GB after a fresh build but **balloons to 90+ GB after a few task cycles** (each fb-* batch adds incremental compile artifacts on top of the existing 18 × test-binary debug info). The dev/test profile is already trimmed (`debug = "line-tables-only"`, `split-debuginfo = "unpacked"` — see workspace `Cargo.toml`). Run `cargo clean` **routinely after each merged PR**, not just "if pressure shows up" — disk space is tight and recovery via `cargo clean` is cheap (one re-link per crate on next build). Verified pattern: 92 GB → 0 GB in seconds, backtraces still resolve to function + line.
|
||||
|
||||
## The facade rule
|
||||
|
||||
|
||||
782
Cargo.lock
generated
782
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
15
Cargo.toml
15
Cargo.toml
@@ -23,6 +23,7 @@ members = [
|
||||
"crates/kebab-parse-pdf",
|
||||
"crates/kebab-tui",
|
||||
"crates/kebab-mcp",
|
||||
"crates/kebab-parse-code",
|
||||
]
|
||||
|
||||
[workspace.package]
|
||||
@@ -30,7 +31,7 @@ edition = "2024"
|
||||
rust-version = "1.85"
|
||||
license = "MIT OR Apache-2.0"
|
||||
repository = "https://github.com/altair823/kebab"
|
||||
version = "0.5.0"
|
||||
version = "0.9.0"
|
||||
|
||||
[workspace.dependencies]
|
||||
anyhow = "1"
|
||||
@@ -81,6 +82,18 @@ rmcp = { version = "1.6", default-features = false, features = ["server"
|
||||
# sync via reqwest::blocking — wiremock is dev-only there).
|
||||
wiremock = "0.6"
|
||||
base64 = "0.22"
|
||||
# Pure-Rust git library for repo metadata detection (kebab-parse-code).
|
||||
# No `git` binary required. Default features include thread-safety + most
|
||||
# object-reading capabilities needed for HEAD name + commit SHA queries.
|
||||
gix = { version = "0.70", default-features = false, features = ["revision"] }
|
||||
# Rust source parsing for code ingest (kebab-parse-code, p10-1A-2). The
|
||||
# chunker stays tree-sitter-free — AST work is parser-side per design §6.3.
|
||||
tree-sitter = "0.26"
|
||||
tree-sitter-rust = "0.24"
|
||||
# Python / TS / JS grammars for code ingest (kebab-parse-code, p10-1B).
|
||||
tree-sitter-python = "0.25.0"
|
||||
tree-sitter-typescript = "0.23.2"
|
||||
tree-sitter-javascript = "0.25.0"
|
||||
|
||||
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
|
||||
# behavior are unchanged — only DWARF debug info is reduced (line
|
||||
|
||||
@@ -20,6 +20,7 @@ P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
|
||||
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
|
||||
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
|
||||
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
|
||||
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1` — 3 언어 dogfooding 가능, v0.8.0 대기) |
|
||||
|
||||
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
|
||||
|
||||
@@ -31,6 +32,8 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
|
||||
|
||||
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
|
||||
|
||||
- **2026-05-20 P10-1B (Rust 1A symbol path 비일관 + expression-level 함수 미방출)** — (a) Rust `code-rust-ast-v1` 은 file-scope nesting 만 (workspace path prefix 없음), 1B 의 Python/TypeScript/JavaScript 는 workspace 경로 → module path prefix 사용 (비일관 수용, retrofit = chunker_version bump + reindex 필요, 사용자 명시 요청까지 보류); (b) TS/JS 의 `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 처리됨 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20) 두 항목.
|
||||
- **2026-05-19 P10-1A-2 (code_rust_ast_v1.rs + SourceType)** — `AST_CHUNK_MAX_LINES` 상수가 `IngestCodeCfg.ast_chunk_max_lines` 를 읽지 않고 모듈 상수 200 고정 (Chunker trait 이 per-medium config 미노출); `SourceType::Code` variant 부재로 code 파일이 `SourceType::Note` 로 분류됨 — 두 항목 모두 `tasks/HOTFIXES.md` (2026-05-19) 에 기록.
|
||||
- **2026-05-07 fb-26 (progress.rs)** — `Aborted` unconditional writeln (TTY duplicate) + `Completed` TTY no summary fixed; `KEBAB_PROGRESS=plain` env + quiet suppression added
|
||||
- **2026-05-07 fb-28 (main.rs)** — `--readonly` (KEBAB_READONLY) blocks Ingest/IngestFile/IngestStdin/Reset; `--quiet` suppresses progress stderr; error.v1 code: "readonly_mode"
|
||||
|
||||
|
||||
26
README.md
26
README.md
@@ -7,7 +7,7 @@
|
||||
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
|
||||
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
|
||||
- **빌드 디스크** — 첫 빌드 시 `target/` 가 6–10 GB (Lance + DataFusion + fastembed). 여유 확인.
|
||||
- **fastembed 모델** — 첫 `kebab ingest` 시 `multilingual-e5-small` (~470 MB) 자동 다운로드.
|
||||
- **fastembed 모델** — 첫 `kebab ingest` 시 `multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
|
||||
|
||||
## 설치
|
||||
|
||||
@@ -42,7 +42,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
|
||||
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
|
||||
kebab init
|
||||
|
||||
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식은 md / png / jpg / pdf 로 고정)
|
||||
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
|
||||
${EDITOR:-vi} ~/.config/kebab/config.toml
|
||||
|
||||
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
|
||||
@@ -70,8 +70,8 @@ kebab doctor
|
||||
| 명령 | 동작 |
|
||||
|------|------|
|
||||
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
|
||||
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. |
|
||||
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. |
|
||||
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → `code-python-ast-v1`, `.ts`/`.tsx` → `code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx` → `code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"` 에 `citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
|
||||
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
|
||||
| `kebab list docs` | 색인된 문서 목록 |
|
||||
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
|
||||
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
|
||||
@@ -131,9 +131,9 @@ flowchart TB
|
||||
end
|
||||
|
||||
subgraph Pipeline["도메인 + 파이프라인"]
|
||||
parse["parse-md / parse-pdf / parse-image"]
|
||||
chunker["chunker (md-heading-v1, pdf-page-v1)"]
|
||||
embedder["embedder (fastembed multilingual-e5-small)"]
|
||||
parse["parse-md / parse-pdf / parse-image / parse-code"]
|
||||
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
|
||||
embedder["embedder (fastembed multilingual-e5-large)"]
|
||||
retriever["retriever (lexical / vector / hybrid RRF)"]
|
||||
rag["RAG pipeline"]
|
||||
end
|
||||
@@ -178,7 +178,17 @@ flowchart TB
|
||||
|
||||
## Configuration
|
||||
|
||||
- `~/.config/kebab/config.toml` — `kebab init` 가 XDG 경로에 생성. `[workspace]` (root, exclude — include 필드는 제거됨, 지원 형식은 자동 결정), `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절. `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback). `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25).
|
||||
- `~/.config/kebab/config.toml` — `kebab init` 가 XDG 경로에 생성. `[workspace]` (root, exclude — include 필드는 제거됨, 지원 형식은 자동 결정), `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절.
|
||||
- `[models.embedding]` —
|
||||
- `model` (default `"multilingual-e5-large"`, fb-39b) — 다국어 sentence embedding 모델. 1024-dim. ONNX (~1.3 GB) 첫 실행 시 fastembed cache (`config.storage.model_dir/fastembed/`) 에 자동 다운로드. `"multilingual-e5-small"` (384 dim) 는 backwards-compat 으로 사용 가능 — TOML 에 명시.
|
||||
- `dimensions` (default `1024`) — 모델의 embedding 차원. config 와 LanceDB stored dim 불일치 시 검색 결과 0 건 (orphan table). 모델 변경 시 `kebab reset --vector-only && kebab ingest` 로 vector index 재구축 권장.
|
||||
- `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback).
|
||||
- `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25).
|
||||
- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값.
|
||||
- `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip.
|
||||
- `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip.
|
||||
- `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법).
|
||||
- `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`.
|
||||
- `[rag] prompt_template_version` (default `"rag-v2"`) — RAG system prompt version. `"rag-v1"` 은 legacy backwards-compat (사용자 명시 시 유지). v2 강화 규칙: (1) fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표 표기, (2) 학습 지식 동원 금지, (3) 근거 모호 시 "확실하지 않다" 명시.
|
||||
- `--config <path>` flag — 임시 워크스페이스 / 격리 테스트 시 사용. CLI / TUI 모두 honor.
|
||||
- `KEBAB_*` env — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN`, `KEBAB_COMMIT_HASH` 등).
|
||||
|
||||
@@ -32,6 +32,10 @@ kebab-parse-image = { path = "../kebab-parse-image" }
|
||||
# per-asset dispatch (see `ingest_one_asset` PDF branch) and runs the
|
||||
# resulting `CanonicalDocument` through `kebab-chunk::PdfPageV1Chunker`.
|
||||
kebab-parse-pdf = { path = "../kebab-parse-pdf" }
|
||||
# p10-1A-2: Rust AST extractor lives here. App threads it into the
|
||||
# per-asset dispatch (see `ingest_one_asset` Code branch) and runs the
|
||||
# resulting `CanonicalDocument` through `kebab-chunk::CodeRustAstV1Chunker`.
|
||||
kebab-parse-code = { path = "../kebab-parse-code" }
|
||||
anyhow = { workspace = true }
|
||||
blake3 = { workspace = true }
|
||||
serde = { workspace = true }
|
||||
|
||||
@@ -40,8 +40,8 @@ use anyhow::{Context, Result, anyhow};
|
||||
use lru::LruCache;
|
||||
|
||||
use kebab_core::{
|
||||
Answer, Embedder, IndexVersion, LanguageModel, Retriever, SearchHit, SearchMode,
|
||||
SearchOpts, SearchQuery, VectorStore,
|
||||
Answer, DocumentStore, Embedder, IndexVersion, LanguageModel, Retriever, SearchHit,
|
||||
SearchMode, SearchOpts, SearchQuery, VectorStore,
|
||||
};
|
||||
use kebab_embed_local::FastembedEmbedder;
|
||||
use kebab_llm_local::OllamaLanguageModel;
|
||||
@@ -296,6 +296,15 @@ impl App {
|
||||
now,
|
||||
self.config.search.stale_threshold_days,
|
||||
);
|
||||
// p10-1A-2: backfill `code_lang` from the Citation::Code `lang`
|
||||
// field. The search layer (kebab-search) constructs SearchHit with
|
||||
// `code_lang: None`; we own the post-processing here in kebab-app
|
||||
// and can fill it cheaply from data already present in the hit.
|
||||
backfill_code_lang(&mut hits);
|
||||
// p10-1A-2 Task 8b: backfill `repo` from the document's
|
||||
// `Metadata.repo`. Unlike `code_lang`, this cannot be derived from
|
||||
// the Citation alone — it requires a store lookup by `doc_id`.
|
||||
self.backfill_repo(&mut hits);
|
||||
Ok(hits)
|
||||
}
|
||||
|
||||
@@ -387,6 +396,10 @@ impl App {
|
||||
now,
|
||||
self.config.search.stale_threshold_days,
|
||||
);
|
||||
// p10-1A-2: backfill code_lang — same as search_uncached.
|
||||
backfill_code_lang(&mut traced_hits);
|
||||
// p10-1A-2 Task 8b: backfill repo — same as search_uncached.
|
||||
self.backfill_repo(&mut traced_hits);
|
||||
|
||||
// Apply offset + k_effective truncation (mirrors non-trace path).
|
||||
let drop_n = offset.min(traced_hits.len());
|
||||
@@ -413,6 +426,9 @@ impl App {
|
||||
});
|
||||
}
|
||||
|
||||
// backfill_code_lang + backfill_repo are applied inside `search`
|
||||
// via `search_uncached` — no explicit call needed here. Trace
|
||||
// branch above calls them directly because it bypasses `search`.
|
||||
let mut all_hits = self.search(fetch_query)?;
|
||||
|
||||
// Skip offset.
|
||||
@@ -777,6 +793,58 @@ impl App {
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1A-2 Task 8b: back-fill `SearchHit.repo` from the originating
|
||||
/// document's `Metadata.repo` for every hit whose `repo` field is
|
||||
/// currently `None`. The search layer (kebab-search) constructs hits
|
||||
/// with `repo: None` because it has no store access; we fill it here
|
||||
/// in kebab-app post-retrieval via a per-distinct-`doc_id` store lookup.
|
||||
///
|
||||
/// Deduplication: a small `HashMap` accumulates the
|
||||
/// `(doc_id → Option<String>)` mapping so each unique document is
|
||||
/// fetched at most once. Search result sets are small (default k ≤ 20),
|
||||
/// so the map overhead is negligible. A `None` entry is cached too
|
||||
/// (document not found or no repo in metadata) to avoid re-querying.
|
||||
///
|
||||
/// Non-repo documents (markdown, PDF, plain text, code files outside a
|
||||
/// git tree) correctly keep `repo: None` — `Metadata.repo` is already
|
||||
/// `None` for those, so the assignment is a no-op.
|
||||
fn backfill_repo(&self, hits: &mut [SearchHit]) {
|
||||
use std::collections::HashMap;
|
||||
use kebab_core::DocumentId;
|
||||
|
||||
// doc_id → Option<String> where None means "not found / no repo"
|
||||
let mut cache: HashMap<DocumentId, Option<String>> = HashMap::new();
|
||||
|
||||
for hit in hits.iter_mut() {
|
||||
if hit.repo.is_some() {
|
||||
continue;
|
||||
}
|
||||
let repo_val = cache
|
||||
.entry(hit.doc_id.clone())
|
||||
.or_insert_with(|| {
|
||||
// Deliberately non-aborting: a failed store lookup for
|
||||
// one hit must not abort the whole search response. Log
|
||||
// the error so it's observable rather than silently
|
||||
// dropped (review #140 round 1).
|
||||
match self.sqlite.get_document(&hit.doc_id) {
|
||||
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
target: "kebab-app",
|
||||
doc_id = %hit.doc_id,
|
||||
error = %e,
|
||||
"backfill_repo: get_document failed; leaving hit.repo = None"
|
||||
);
|
||||
None
|
||||
}
|
||||
}
|
||||
});
|
||||
if let Some(r) = repo_val {
|
||||
hit.repo = Some(r.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Resolve the embedder + vector store, surfacing the user-friendly
|
||||
/// "switch to --mode lexical" error when embeddings are disabled.
|
||||
fn require_embeddings(
|
||||
@@ -896,6 +964,21 @@ fn estimate_chars(hits: &[SearchHit]) -> usize {
|
||||
.sum()
|
||||
}
|
||||
|
||||
/// p10-1A-2: back-fill `SearchHit.code_lang` from `Citation::Code.lang`
|
||||
/// for every code hit in the list. The search layer (kebab-search)
|
||||
/// constructs hits with `code_lang: None`; we fill it here in kebab-app
|
||||
/// post-retrieval so callers see the correct language identifier without
|
||||
/// requiring a second SQL query.
|
||||
fn backfill_code_lang(hits: &mut [SearchHit]) {
|
||||
for hit in hits.iter_mut() {
|
||||
if let kebab_core::Citation::Code { lang, .. } = &hit.citation {
|
||||
if hit.code_lang.is_none() {
|
||||
hit.code_lang = lang.clone();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
@@ -197,6 +197,8 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
|
||||
media,
|
||||
ingested_after,
|
||||
doc_id,
|
||||
repo: vec![],
|
||||
code_lang: vec![],
|
||||
};
|
||||
|
||||
let opts = SearchOpts {
|
||||
|
||||
@@ -96,6 +96,7 @@ pub fn media_label(media: &kebab_core::MediaType) -> &'static str {
|
||||
kebab_core::MediaType::Pdf => "pdf",
|
||||
kebab_core::MediaType::Image(_) => "image",
|
||||
kebab_core::MediaType::Audio(_) => "audio",
|
||||
kebab_core::MediaType::Code(_) => "code",
|
||||
kebab_core::MediaType::Other(_) => "other",
|
||||
}
|
||||
}
|
||||
@@ -148,6 +149,7 @@ mod tests {
|
||||
media_label(&MediaType::Audio(kebab_core::AudioType::Wav)),
|
||||
"audio"
|
||||
);
|
||||
assert_eq!(media_label(&MediaType::Code("rust".into())), "code");
|
||||
assert_eq!(media_label(&MediaType::Other("x".into())), "other");
|
||||
}
|
||||
|
||||
|
||||
@@ -39,17 +39,18 @@ use std::sync::Arc;
|
||||
use anyhow::{Context, anyhow};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use kebab_chunk::{MdHeadingV1Chunker, PdfPageV1Chunker};
|
||||
use kebab_chunk::{CodeJsAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
|
||||
use kebab_core::{
|
||||
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
|
||||
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
|
||||
EmbeddingKind, ExtractContext, Extractor, IngestReport, Lang, LanguageModel, MediaType,
|
||||
ParserVersion, RawAsset, SearchHit, SearchQuery, SourceConnector, SourceScope,
|
||||
ParserVersion, RawAsset, SearchHit, SearchQuery, SourceScope,
|
||||
SourceUri, VectorRecord, VectorStore,
|
||||
};
|
||||
use kebab_llm_local::OllamaLanguageModel;
|
||||
use kebab_normalize::build_canonical_document;
|
||||
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
|
||||
use kebab_parse_code::{JavascriptAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
|
||||
use kebab_parse_pdf::PdfTextExtractor;
|
||||
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
|
||||
use kebab_source_fs::FsSourceConnector;
|
||||
@@ -305,8 +306,8 @@ pub fn ingest_with_config_opts(
|
||||
);
|
||||
let connector = FsSourceConnector::new(&app.config)
|
||||
.context("kb-app::ingest: build FsSourceConnector")?;
|
||||
let assets = connector
|
||||
.scan(&scope)
|
||||
let (assets, fs_skips) = connector
|
||||
.scan_with_skips(&scope)
|
||||
.context("kb-app::ingest: scan workspace")?;
|
||||
crate::ingest_progress::emit(
|
||||
progress,
|
||||
@@ -675,6 +676,12 @@ pub fn ingest_with_config_opts(
|
||||
errors: error_count,
|
||||
duration_ms,
|
||||
skipped_by_extension,
|
||||
skipped_gitignore: fs_skips.skipped_gitignore,
|
||||
skipped_kebabignore: fs_skips.skipped_kebabignore,
|
||||
skipped_builtin_blacklist: fs_skips.skipped_builtin_blacklist,
|
||||
skipped_generated: fs_skips.skipped_generated,
|
||||
skipped_size_exceeded: fs_skips.skipped_size_exceeded,
|
||||
skip_examples: fs_skips.skip_examples,
|
||||
items: if summary_only { None } else { Some(items) },
|
||||
})
|
||||
}
|
||||
@@ -741,15 +748,18 @@ struct ImagePipeline<'a> {
|
||||
/// hold (per design §9 cascade rule):
|
||||
///
|
||||
/// 1. `force_reingest == false` — caller hasn't asked to bypass skip.
|
||||
/// 2. The freshly-scanned asset's blake3 checksum equals what the
|
||||
/// existing `assets` row stores at the same `workspace_path`.
|
||||
/// 3. The doc keyed on `(workspace_path, asset_id, current_parser_version)`
|
||||
/// exists. If the parser_version changed, `id_for_doc` produces a
|
||||
/// different `doc_id` so the lookup misses → no skip → re-process.
|
||||
/// 4. The existing doc's stamped `last_chunker_version` AND
|
||||
/// `last_embedding_version` match the values the caller is about
|
||||
/// to use (`Some(v) == Some(v)` and `None == None` — see design
|
||||
/// doc for the `None == None` rule when no embedder is configured).
|
||||
/// 2. A document already exists at this `workspace_path`
|
||||
/// (`get_document_by_workspace_path`). The lookup is document-side, not
|
||||
/// asset-side, so twin files (identical content at different paths) each
|
||||
/// hit their own stable doc row — `documents.workspace_path` is UNIQUE
|
||||
/// while `assets` may dedupe content into a single row with a flip-flop
|
||||
/// `workspace_path` column (dogfood bug #4, see `tasks/HOTFIXES.md`).
|
||||
/// 3. The existing doc's `source_asset_id` equals the freshly-scanned
|
||||
/// asset's blake3 checksum (content unchanged).
|
||||
/// 4. The existing doc's `parser_version` matches the current extractor's
|
||||
/// `parser_version` (extractor not upgraded). Combined with `chunker_version`
|
||||
/// and `last_embedding_version` checks immediately below — full cascade
|
||||
/// per design §9.
|
||||
///
|
||||
/// Returns `Ok(None)` (proceed with full re-process) when any check
|
||||
/// fails or any DB read errors out — the skip path is opportunistic;
|
||||
@@ -766,31 +776,19 @@ fn try_skip_unchanged(
|
||||
if force_reingest {
|
||||
return Ok(None);
|
||||
}
|
||||
let existing_asset = match app
|
||||
// Document-centric skip: look up the existing document row by
|
||||
// workspace_path directly. This avoids the twin-file flip-flop
|
||||
// that the old asset-side lookup suffers from — multiple files
|
||||
// with identical content share one `assets` row whose
|
||||
// `workspace_path` is overwritten on every UPSERT, so
|
||||
// `get_asset_by_workspace_path(path1)` could return the OTHER
|
||||
// twin's path (or None) after any ingest of the twin. The
|
||||
// `documents` table has a UNIQUE index on `workspace_path` (V001),
|
||||
// so each twin has its own stable row regardless of asset de-dup.
|
||||
let existing_doc = match app
|
||||
.sqlite
|
||||
.get_asset_by_workspace_path(&asset.workspace_path)
|
||||
.get_document_by_workspace_path(&asset.workspace_path)
|
||||
{
|
||||
Ok(Some(a)) => a,
|
||||
Ok(None) => return Ok(None),
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
target: "kebab-app",
|
||||
path = %asset.workspace_path.0,
|
||||
error = %e,
|
||||
"skip-check: get_asset_by_workspace_path failed; falling through to re-process"
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
};
|
||||
if existing_asset.checksum != asset.checksum {
|
||||
return Ok(None);
|
||||
}
|
||||
let candidate_doc_id = kebab_core::id_for_doc(
|
||||
&asset.workspace_path,
|
||||
&asset.asset_id,
|
||||
current_parser_version,
|
||||
);
|
||||
let existing_doc = match app.sqlite.get_document(&candidate_doc_id) {
|
||||
Ok(Some(d)) => d,
|
||||
Ok(None) => return Ok(None),
|
||||
Err(e) => {
|
||||
@@ -798,21 +796,37 @@ fn try_skip_unchanged(
|
||||
target: "kebab-app",
|
||||
path = %asset.workspace_path.0,
|
||||
error = %e,
|
||||
"skip-check: get_document failed; falling through to re-process"
|
||||
"skip-check: get_document_by_workspace_path failed; falling through to re-process"
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
};
|
||||
// 1. Content unchanged: the freshly-computed asset_id (blake3
|
||||
// content hash) must match what this document was ingested from.
|
||||
if existing_doc.source_asset_id != asset.asset_id {
|
||||
return Ok(None);
|
||||
}
|
||||
// 2. Parser unchanged: parser_version is baked into id_for_doc so
|
||||
// a version bump yields a different doc_id and the row above
|
||||
// would have been missing. Checking here explicitly keeps the
|
||||
// logic self-documenting and guards against future id_for_doc
|
||||
// changes.
|
||||
if existing_doc.parser_version != *current_parser_version {
|
||||
return Ok(None);
|
||||
}
|
||||
// 3. Chunker unchanged.
|
||||
let chunker_match = existing_doc.last_chunker_version.as_ref()
|
||||
== Some(current_chunker_version);
|
||||
if !chunker_match {
|
||||
return Ok(None);
|
||||
}
|
||||
// 4. Embedder unchanged.
|
||||
let embedder_match = existing_doc.last_embedding_version.as_ref()
|
||||
== current_embedding_version;
|
||||
if !embedder_match {
|
||||
return Ok(None);
|
||||
}
|
||||
let candidate_doc_id = existing_doc.doc_id.clone();
|
||||
tracing::debug!(
|
||||
target: "kebab-app::ingest",
|
||||
path = %asset.workspace_path.0,
|
||||
@@ -911,7 +925,24 @@ fn ingest_one_asset(
|
||||
force_reingest,
|
||||
);
|
||||
}
|
||||
_ => {
|
||||
// p10-1A-2 / 1B: code ingest dispatch.
|
||||
MediaType::Code(lang)
|
||||
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript") =>
|
||||
{
|
||||
return ingest_one_code_asset(
|
||||
app,
|
||||
asset,
|
||||
chunk_policy,
|
||||
embedder,
|
||||
vector_store,
|
||||
existing_doc_ids,
|
||||
force_reingest,
|
||||
lang.as_str(),
|
||||
);
|
||||
}
|
||||
// p10-1A-2: non-Rust Code, Audio, and Other are not yet wired;
|
||||
// skip until their respective phases.
|
||||
MediaType::Code(_) | MediaType::Audio(_) | MediaType::Other(_) => {
|
||||
return Ok(kebab_core::IngestItem {
|
||||
kind: kebab_core::IngestItemKind::Skipped,
|
||||
doc_id: None,
|
||||
@@ -1611,6 +1642,213 @@ fn ingest_one_pdf_asset(
|
||||
})
|
||||
}
|
||||
|
||||
/// p10-1A-2 Task 8: process one `MediaType::Code("rust")` asset end-to-end.
|
||||
///
|
||||
/// Mirrors `ingest_one_pdf_asset` line-for-line with the substitutions
|
||||
/// documented in the task spec:
|
||||
/// - parser_version → `code-rust-v1` (via `RUST_PARSER_VERSION`)
|
||||
/// - extractor → `RustAstExtractor`
|
||||
/// - chunker → `CodeRustAstV1Chunker`
|
||||
///
|
||||
/// All other steps (incremental skip, byte read, ExtractContext, put_*,
|
||||
/// embed, purge_vector_orphans) are identical to the PDF function.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn ingest_one_code_asset(
|
||||
app: &App,
|
||||
asset: &RawAsset,
|
||||
chunk_policy: &ChunkPolicy,
|
||||
embedder: Option<&Arc<dyn Embedder + Send + Sync>>,
|
||||
vector_store: Option<&Arc<kebab_store_vector::LanceVectorStore>>,
|
||||
existing_doc_ids: &std::collections::HashSet<String>,
|
||||
force_reingest: bool,
|
||||
code_lang: &str, // <-- NEW (p10-1b Task D)
|
||||
) -> anyhow::Result<kebab_core::IngestItem> {
|
||||
let path = match &asset.source_uri {
|
||||
SourceUri::File(p) => p.clone(),
|
||||
SourceUri::Kb(_) => {
|
||||
return Ok(kebab_core::IngestItem {
|
||||
kind: kebab_core::IngestItemKind::Skipped,
|
||||
doc_id: None,
|
||||
doc_path: asset.workspace_path.clone(),
|
||||
asset_id: Some(asset.asset_id.clone()),
|
||||
byte_len: Some(asset.byte_len),
|
||||
block_count: None,
|
||||
chunk_count: None,
|
||||
parser_version: None,
|
||||
chunker_version: None,
|
||||
warnings: vec![
|
||||
"kb:// URI not yet supported".to_string(),
|
||||
],
|
||||
error: None,
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J: parser_version per-lang.
|
||||
let parser_version = match code_lang {
|
||||
"rust" => ParserVersion(kebab_parse_code::RUST_PARSER_VERSION.to_string()),
|
||||
"python" => ParserVersion(kebab_parse_code::PYTHON_PARSER_VERSION.to_string()),
|
||||
"typescript" => ParserVersion(kebab_parse_code::TS_PARSER_VERSION.to_string()),
|
||||
"javascript" => ParserVersion(kebab_parse_code::JS_PARSER_VERSION.to_string()),
|
||||
other => anyhow::bail!("unsupported code_lang: {other}"),
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: chunker_version per-lang.
|
||||
let chunker_version = match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker.chunker_version(),
|
||||
"python" => CodePythonAstV1Chunker.chunker_version(),
|
||||
"typescript" => CodeTsAstV1Chunker.chunker_version(),
|
||||
"javascript" => CodeJsAstV1Chunker.chunker_version(),
|
||||
other => anyhow::bail!("unreachable chunker_version: {other}"),
|
||||
};
|
||||
|
||||
if let Some(item) = try_skip_unchanged(
|
||||
app,
|
||||
asset,
|
||||
&parser_version,
|
||||
&chunker_version,
|
||||
embedder.map(|e| e.model_version()).as_ref(),
|
||||
force_reingest,
|
||||
)? {
|
||||
return Ok(item);
|
||||
}
|
||||
let bytes = std::fs::read(&path)
|
||||
.with_context(|| format!("read code asset bytes from {}", path.display()))?;
|
||||
|
||||
let extract_config = kebab_core::ExtractConfig::default();
|
||||
let workspace_root = app.config.resolve_workspace_root();
|
||||
let ctx = ExtractContext {
|
||||
asset,
|
||||
workspace_root: &workspace_root,
|
||||
config: &extract_config,
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: extractor per-lang.
|
||||
let mut canonical = match code_lang {
|
||||
"rust" => RustAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
|
||||
"python" => PythonAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
|
||||
"typescript" => TypescriptAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
|
||||
"javascript" => JavascriptAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
|
||||
other => anyhow::bail!("unreachable (extract): {other}"),
|
||||
};
|
||||
|
||||
// p10-1b Task D/G/J/L: chunker per-lang.
|
||||
let chunks = match code_lang {
|
||||
"rust" => CodeRustAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
|
||||
"python" => CodePythonAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
|
||||
"typescript" => CodeTsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
|
||||
"javascript" => CodeJsAstV1Chunker
|
||||
.chunk(&canonical, chunk_policy)
|
||||
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
|
||||
other => anyhow::bail!("unreachable (chunk): {other}"),
|
||||
};
|
||||
|
||||
// Stamp chunker + embedding versions so incremental skip detection has
|
||||
// data on the second run.
|
||||
canonical.last_chunker_version = Some(chunker_version.clone());
|
||||
if let Some(emb) = embedder {
|
||||
canonical.last_embedding_version = Some(emb.model_version());
|
||||
}
|
||||
|
||||
purge_vector_orphans_for_workspace_path(app, asset, vector_store)?;
|
||||
app.sqlite
|
||||
.put_asset_with_bytes(asset, &bytes)
|
||||
.context("DocumentStore::put_asset_with_bytes (code)")?;
|
||||
app.sqlite
|
||||
.put_document(&canonical)
|
||||
.context("DocumentStore::put_document (code)")?;
|
||||
app.sqlite
|
||||
.put_blocks(&canonical.doc_id, &canonical.blocks)
|
||||
.context("DocumentStore::put_blocks (code)")?;
|
||||
app.sqlite
|
||||
.put_chunks(&canonical.doc_id, &chunks)
|
||||
.context("DocumentStore::put_chunks (code)")?;
|
||||
|
||||
if let (Some(emb), Some(vec_store)) = (embedder, vector_store)
|
||||
&& !chunks.is_empty()
|
||||
{
|
||||
let inputs: Vec<EmbeddingInput<'_>> = chunks
|
||||
.iter()
|
||||
.map(|c| EmbeddingInput {
|
||||
text: c.text.as_str(),
|
||||
kind: EmbeddingKind::Document,
|
||||
})
|
||||
.collect();
|
||||
let vectors = emb
|
||||
.embed(&inputs)
|
||||
.context("Embedder::embed (code chunks)")?;
|
||||
let model_id = emb.model_id();
|
||||
let model_version = emb.model_version();
|
||||
let dimensions = emb.dimensions();
|
||||
let records: Vec<VectorRecord> = chunks
|
||||
.iter()
|
||||
.zip(vectors)
|
||||
.map(|(c, v)| VectorRecord {
|
||||
embedding_id: kebab_core::id_for_embedding(
|
||||
&c.chunk_id,
|
||||
&model_id,
|
||||
&model_version,
|
||||
dimensions,
|
||||
),
|
||||
chunk_id: c.chunk_id.clone(),
|
||||
vector: v,
|
||||
doc_id: canonical.doc_id.clone(),
|
||||
text: c.text.clone(),
|
||||
heading_path: c.heading_path.clone(),
|
||||
model_id: model_id.clone(),
|
||||
model_version: model_version.clone(),
|
||||
dimensions,
|
||||
})
|
||||
.collect();
|
||||
vec_store
|
||||
.upsert(&records)
|
||||
.context("VectorStore::upsert (code)")?;
|
||||
}
|
||||
|
||||
let kind = if existing_doc_ids.contains(&canonical.doc_id.0) {
|
||||
kebab_core::IngestItemKind::Updated
|
||||
} else {
|
||||
kebab_core::IngestItemKind::New
|
||||
};
|
||||
|
||||
// Surface every `Provenance::Warning` note onto `IngestItem.warnings`.
|
||||
let warnings: Vec<String> = canonical
|
||||
.provenance
|
||||
.events
|
||||
.iter()
|
||||
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
|
||||
.filter_map(|e| e.note.clone())
|
||||
.collect();
|
||||
|
||||
Ok(kebab_core::IngestItem {
|
||||
kind,
|
||||
doc_id: Some(canonical.doc_id.clone()),
|
||||
doc_path: asset.workspace_path.clone(),
|
||||
asset_id: Some(asset.asset_id.clone()),
|
||||
byte_len: Some(asset.byte_len),
|
||||
block_count: u32::try_from(canonical.blocks.len()).ok(),
|
||||
chunk_count: u32::try_from(chunks.len()).ok(),
|
||||
parser_version: Some(canonical.parser_version.clone()),
|
||||
chunker_version: Some(chunker_version),
|
||||
warnings,
|
||||
error: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Pull the BCP-47 language hint from the canonical document. P6-1
|
||||
/// stamps `Lang("und")` by default; image-pipeline OCR / caption
|
||||
/// adapters special-case "und" so the hint is intentionally dropped
|
||||
|
||||
@@ -45,7 +45,7 @@ pub struct Models {
|
||||
pub corpus_revision: u64,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct Stats {
|
||||
pub doc_count: u64,
|
||||
pub chunk_count: u64,
|
||||
@@ -63,6 +63,14 @@ pub struct Stats {
|
||||
/// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold.
|
||||
#[serde(default)]
|
||||
pub stale_doc_count: u64,
|
||||
/// p10-1A-1: code language breakdown (chunk counts by canonical lowercase
|
||||
/// language identifier). Empty until 1A-2 produces code chunks.
|
||||
#[serde(default)]
|
||||
pub code_lang_breakdown: std::collections::BTreeMap<String, u32>,
|
||||
/// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value).
|
||||
/// Empty until 1A-2 produces code chunks.
|
||||
#[serde(default)]
|
||||
pub repo_breakdown: std::collections::BTreeMap<String, u32>,
|
||||
}
|
||||
|
||||
const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION");
|
||||
@@ -158,6 +166,11 @@ fn collect_stats(
|
||||
lang_breakdown: counts.lang_breakdown,
|
||||
index_bytes,
|
||||
stale_doc_count: counts.stale_doc_count,
|
||||
// p10-1A-2: populated by the store query added in this task.
|
||||
code_lang_breakdown: store.code_lang_breakdown()?,
|
||||
// p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
|
||||
// placeholder — mirror of code_lang_breakdown for the repo field.
|
||||
repo_breakdown: store.repo_breakdown()?,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -182,6 +195,32 @@ fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Mode
|
||||
mod tests_stats_ext {
|
||||
use super::*;
|
||||
|
||||
/// p10-1A-1: Stats must serialize `code_lang_breakdown` and
|
||||
/// `repo_breakdown` so downstream consumers (MCP skill, Claude Code)
|
||||
/// can branch on their presence.
|
||||
#[test]
|
||||
fn stats_includes_code_lang_and_repo_breakdown_fields() {
|
||||
let stats = Stats::default();
|
||||
let v = serde_json::to_value(&stats).unwrap();
|
||||
assert!(
|
||||
v.get("code_lang_breakdown").is_some(),
|
||||
"Stats JSON must include code_lang_breakdown: {v}"
|
||||
);
|
||||
assert!(
|
||||
v.get("repo_breakdown").is_some(),
|
||||
"Stats JSON must include repo_breakdown: {v}"
|
||||
);
|
||||
// Empty BTreeMap serializes as `{}` — confirm it's an object, not null.
|
||||
assert!(
|
||||
v["code_lang_breakdown"].is_object(),
|
||||
"code_lang_breakdown must be an object: {v}"
|
||||
);
|
||||
assert!(
|
||||
v["repo_breakdown"].is_object(),
|
||||
"repo_breakdown must be an object: {v}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn stats_includes_breakdowns_and_bytes_on_fresh_corpus() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
|
||||
431
crates/kebab-app/tests/code_ingest_smoke.rs
Normal file
431
crates/kebab-app/tests/code_ingest_smoke.rs
Normal file
@@ -0,0 +1,431 @@
|
||||
//! p10-1A-2 Task 8: smoke test for Rust code ingest dispatch.
|
||||
//!
|
||||
//! Writes a single `.rs` file into a TempDir workspace, ingests it via
|
||||
//! `kebab_app::ingest_with_config`, then searches for the symbol name and
|
||||
//! asserts that the resulting `SearchHit` carries a `Citation::Code`
|
||||
//! with the expected `lang`, `symbol`, and `line_start`.
|
||||
//!
|
||||
//! Mirrors the `pdf_pipeline.rs` harness: lexical-only (no AVX/fastembed),
|
||||
//! no OCR / caption adapters needed.
|
||||
|
||||
mod common;
|
||||
|
||||
use common::{TestEnv, lexical_query};
|
||||
|
||||
use kebab_core::{Citation, IngestItemKind};
|
||||
|
||||
/// A `.rs` file with a single `pub fn add` symbol is ingested, and a
|
||||
/// lexical search for "add" must return at least one `Citation::Code`
|
||||
/// hit whose `lang == "rust"`, `symbol == Some("add")`, and
|
||||
/// `line_start >= 1`.
|
||||
#[test]
|
||||
fn rust_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write a minimal Rust file into the workspace root.
|
||||
std::fs::write(
|
||||
env.workspace_root.join("demo.rs"),
|
||||
"/// adds two integers\npub fn add(a: i32, b: i32) -> i32 {\n a + b\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
assert_eq!(report.errors, 0, "no errors expected: {report:?}");
|
||||
let items = report.items.as_ref().expect("items present");
|
||||
let code_item = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("demo.rs"))
|
||||
.expect("demo.rs item present");
|
||||
assert_eq!(
|
||||
code_item.kind,
|
||||
IngestItemKind::New,
|
||||
"first ingest must be New: {code_item:?}"
|
||||
);
|
||||
assert!(
|
||||
code_item.block_count.unwrap_or(0) >= 1,
|
||||
"at least one block expected: {code_item:?}"
|
||||
);
|
||||
assert!(
|
||||
code_item.chunk_count.unwrap_or(0) >= 1,
|
||||
"at least one chunk expected: {code_item:?}"
|
||||
);
|
||||
assert_eq!(
|
||||
code_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-rust-v1"),
|
||||
"parser_version must be code-rust-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
code_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-rust-ast-v1"),
|
||||
"chunker_version must be code-rust-ast-v1"
|
||||
);
|
||||
|
||||
// Lexical search for the symbol name "add".
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("add"))
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'add'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("rust"),
|
||||
"citation.lang must be 'rust'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("add"),
|
||||
"citation.symbol must be 'add'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be ≥1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("rust"),
|
||||
"SearchHit.code_lang must be 'rust'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1A-2 Task 8b: a code search hit must carry `SearchHit.repo` filled
|
||||
/// from the document's `Metadata.repo` (which is set by `detect_repo` during
|
||||
/// ingest). `detect_repo` returns the name of the directory that contains
|
||||
/// `.git/`, so we `git init` the workspace root before ingesting and then
|
||||
/// assert that `h.repo == Some("workspace")`.
|
||||
#[test]
|
||||
fn rust_code_search_hit_has_repo() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// `detect_repo` walks up from the file looking for `.git/`.
|
||||
// Initialise a bare git repo at the workspace root so it is
|
||||
// discoverable. We only need the `.git/` directory — no commits
|
||||
// required.
|
||||
let git_status = std::process::Command::new("git")
|
||||
.args(["init", "--quiet"])
|
||||
.arg(env.workspace_root.as_os_str())
|
||||
.status()
|
||||
.expect("git init");
|
||||
assert!(git_status.success(), "git init must succeed");
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("repo_demo.rs"),
|
||||
"/// multiplies two integers\npub fn mul(a: i32, b: i32) -> i32 {\n a * b\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("mul"))
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'mul'");
|
||||
|
||||
// The workspace root directory is named "workspace" by `TestEnv`.
|
||||
let expected_repo = env
|
||||
.workspace_root
|
||||
.file_name()
|
||||
.and_then(|n| n.to_str())
|
||||
.map(str::to_owned);
|
||||
assert_eq!(
|
||||
h.repo,
|
||||
expected_repo,
|
||||
"SearchHit.repo must match the workspace dir name (detect_repo result)"
|
||||
);
|
||||
// Also sanity-check code_lang is still filled.
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("rust"),
|
||||
"SearchHit.code_lang must be 'rust'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1b Task G: a `.py` file in a sub-directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="python"`,
|
||||
/// `symbol="kebab_eval.metrics.compute_mrr"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`kebab_eval/`) ensures `module_path_for_python`
|
||||
/// produces a non-empty prefix so the fully-qualified symbol assertion
|
||||
/// exercises the prefix wiring end-to-end.
|
||||
#[test]
|
||||
fn python_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let module_dir = env.workspace_root.join("kebab_eval");
|
||||
std::fs::create_dir_all(&module_dir).unwrap();
|
||||
std::fs::write(
|
||||
module_dir.join("metrics.py"),
|
||||
"\"\"\"compute metrics.\"\"\"\ndef compute_mrr(scores):\n return sum(scores) / max(len(scores), 1)\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
assert!(report.new >= 1, "python file ingested: {report:?}");
|
||||
|
||||
let items = report.items.as_ref().expect("items present");
|
||||
let py_item = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("metrics.py"))
|
||||
.expect("metrics.py item");
|
||||
assert_eq!(
|
||||
py_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-python-v1"),
|
||||
"parser_version must be code-python-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
py_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-python-ast-v1"),
|
||||
"chunker_version must be code-python-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("compute_mrr"))
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'compute_mrr'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("python"),
|
||||
"citation.lang must be 'python'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("kebab_eval.metrics.compute_mrr"),
|
||||
"citation.symbol must be 'kebab_eval.metrics.compute_mrr'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("python"),
|
||||
"SearchHit.code_lang must be 'python'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1b Task J: a `.ts` file in a sub-directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="typescript"`,
|
||||
/// `symbol="src/Foo.Foo.bar"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`src/`) ensures `module_path_for_tsjs` produces
|
||||
/// a non-empty prefix so the fully-qualified symbol assertion exercises
|
||||
/// the prefix wiring end-to-end.
|
||||
#[test]
|
||||
fn typescript_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let src_dir = env.workspace_root.join("src");
|
||||
std::fs::create_dir_all(&src_dir).unwrap();
|
||||
std::fs::write(
|
||||
src_dir.join("Foo.ts"),
|
||||
"export class Foo {\n bar(): number { return 42; }\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
assert!(report.new >= 1, "ts file ingested: {report:?}");
|
||||
|
||||
let items = report.items.as_ref().expect("items present");
|
||||
let ts_item = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Foo.ts"))
|
||||
.expect("Foo.ts item");
|
||||
assert_eq!(
|
||||
ts_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-ts-v1"),
|
||||
"parser_version must be code-ts-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
ts_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-ts-ast-v1"),
|
||||
"chunker_version must be code-ts-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'bar'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("typescript"),
|
||||
"citation.lang must be 'typescript'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("src/Foo.Foo.bar"),
|
||||
"citation.symbol must be 'src/Foo.Foo.bar'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("typescript"),
|
||||
"SearchHit.code_lang must be 'typescript'"
|
||||
);
|
||||
}
|
||||
|
||||
/// p10-1b Task L: a `.js` file in a sub-directory is ingested and the
|
||||
/// resulting `Citation::Code` hit must carry `lang="javascript"`,
|
||||
/// `symbol="src/Bar.Bar.baz"`, and `line_start >= 1`.
|
||||
/// The sub-directory (`src/`) ensures `module_path_for_tsjs` produces
|
||||
/// a non-empty prefix so the fully-qualified symbol assertion exercises
|
||||
/// the prefix wiring end-to-end.
|
||||
#[test]
|
||||
fn javascript_file_ingests_and_searches_as_code_citation() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
let src_dir = env.workspace_root.join("src");
|
||||
std::fs::create_dir_all(&src_dir).unwrap();
|
||||
std::fs::write(
|
||||
src_dir.join("Bar.js"),
|
||||
"export class Bar {\n baz() { return 7; }\n}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let report =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("ingest must succeed");
|
||||
|
||||
assert!(report.new >= 1, "js file ingested: {report:?}");
|
||||
|
||||
let items = report.items.as_ref().expect("items present");
|
||||
let js_item = items
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("Bar.js"))
|
||||
.expect("Bar.js item");
|
||||
assert_eq!(
|
||||
js_item.parser_version.as_ref().map(|p| p.0.as_str()),
|
||||
Some("code-js-v1"),
|
||||
"parser_version must be code-js-v1"
|
||||
);
|
||||
assert_eq!(
|
||||
js_item.chunker_version.as_ref().map(|c| c.0.as_str()),
|
||||
Some("code-js-ast-v1"),
|
||||
"chunker_version must be code-js-ast-v1"
|
||||
);
|
||||
|
||||
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("baz"))
|
||||
.expect("search must succeed");
|
||||
|
||||
let h = hits
|
||||
.iter()
|
||||
.find(|h| matches!(&h.citation, Citation::Code { .. }))
|
||||
.expect("at least one Citation::Code hit for 'baz'");
|
||||
|
||||
match &h.citation {
|
||||
Citation::Code {
|
||||
lang,
|
||||
symbol,
|
||||
line_start,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(
|
||||
lang.as_deref(),
|
||||
Some("javascript"),
|
||||
"citation.lang must be 'javascript'"
|
||||
);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("src/Bar.Bar.baz"),
|
||||
"citation.symbol must be 'src/Bar.Bar.baz'"
|
||||
);
|
||||
assert!(*line_start >= 1, "line_start must be >=1");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
|
||||
assert_eq!(
|
||||
h.code_lang.as_deref(),
|
||||
Some("javascript"),
|
||||
"SearchHit.code_lang must be 'javascript'"
|
||||
);
|
||||
}
|
||||
|
||||
/// Re-ingesting the same `.rs` file without changes must report
|
||||
/// `Unchanged` (incremental-skip path exercised).
|
||||
#[test]
|
||||
fn rust_file_re_ingest_is_unchanged() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
std::fs::write(
|
||||
env.workspace_root.join("stable.rs"),
|
||||
"pub fn noop() {}\n",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let r1 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
|
||||
let item1 = r1
|
||||
.items
|
||||
.as_ref()
|
||||
.unwrap()
|
||||
.iter()
|
||||
.find(|i| i.doc_path.0.ends_with("stable.rs"))
|
||||
.cloned()
|
||||
.unwrap();
|
||||
assert_eq!(item1.kind, IngestItemKind::New);
|
||||
|
||||
let r2 =
|
||||
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
|
||||
let item2 = r2
|
||||
.items
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.find(|i| i.doc_path.0.ends_with("stable.rs"))
|
||||
.unwrap();
|
||||
assert_eq!(
|
||||
item2.kind,
|
||||
IngestItemKind::Unchanged,
|
||||
"identical bytes → Unchanged"
|
||||
);
|
||||
assert_eq!(item2.doc_id, item1.doc_id);
|
||||
}
|
||||
90
crates/kebab-app/tests/twin_files_idempotent.rs
Normal file
90
crates/kebab-app/tests/twin_files_idempotent.rs
Normal file
@@ -0,0 +1,90 @@
|
||||
//! Regression test for the twin-file idempotency bug.
|
||||
//!
|
||||
//! Identical-content files at different workspace paths share one
|
||||
//! `assets` row (`asset_id` = blake3 content hash, PRIMARY KEY). The
|
||||
//! old UPSERT `ON CONFLICT(asset_id) DO UPDATE SET workspace_path =
|
||||
//! excluded.workspace_path` made each twin overwrite the other's path
|
||||
//! on every ingest, so `get_asset_by_workspace_path(path1)` returned
|
||||
//! None (or the wrong twin) → re-process every time.
|
||||
//!
|
||||
//! Fix: `try_skip_unchanged` now uses `get_document_by_workspace_path`
|
||||
//! instead. `documents.workspace_path` is UNIQUE (V001) so each twin
|
||||
//! has its own stable document row.
|
||||
//!
|
||||
//! Assertion contract:
|
||||
//! 1st ingest → 2 New (one per twin)
|
||||
//! 2nd ingest → 0 New, 0 Updated, 2 Unchanged
|
||||
|
||||
mod common;
|
||||
|
||||
use common::TestEnv;
|
||||
use kebab_app::ingest_with_config;
|
||||
use kebab_core::IngestItemKind;
|
||||
|
||||
#[test]
|
||||
fn twin_files_second_ingest_is_unchanged() {
|
||||
let env = TestEnv::lexical_only();
|
||||
|
||||
// Write two files with identical content at different paths.
|
||||
let pkg_a = env.workspace_root.join("pkg_a");
|
||||
let pkg_b = env.workspace_root.join("pkg_b");
|
||||
std::fs::create_dir_all(&pkg_a).unwrap();
|
||||
std::fs::create_dir_all(&pkg_b).unwrap();
|
||||
|
||||
let content = b"# shared\nThis content is identical in both files.\n";
|
||||
std::fs::write(pkg_a.join("__init__.py"), content).unwrap();
|
||||
std::fs::write(pkg_b.join("__init__.py"), content).unwrap();
|
||||
|
||||
// First ingest — both files must be New.
|
||||
let first = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("first ingest must succeed");
|
||||
assert_eq!(first.errors, 0, "first ingest: no errors; report={first:?}");
|
||||
|
||||
let items = first.items.as_ref().expect("items must be present");
|
||||
let twin_items: Vec<_> = items
|
||||
.iter()
|
||||
.filter(|i| {
|
||||
i.doc_path.0.ends_with("__init__.py")
|
||||
})
|
||||
.collect();
|
||||
assert_eq!(
|
||||
twin_items.len(),
|
||||
2,
|
||||
"first ingest: expected exactly 2 __init__.py items; items={items:?}"
|
||||
);
|
||||
for item in &twin_items {
|
||||
assert_eq!(
|
||||
item.kind,
|
||||
IngestItemKind::New,
|
||||
"first ingest: each twin must be New; item={item:?}"
|
||||
);
|
||||
}
|
||||
|
||||
// Second ingest — same files, same content → both must be Unchanged.
|
||||
let second = ingest_with_config(env.config.clone(), env.scope(), false)
|
||||
.expect("second ingest must succeed");
|
||||
assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
|
||||
assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
|
||||
assert_eq!(
|
||||
second.updated, 0,
|
||||
"second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"
|
||||
);
|
||||
|
||||
let second_items = second.items.as_ref().expect("items must be present");
|
||||
let twin_items2: Vec<_> = second_items
|
||||
.iter()
|
||||
.filter(|i| i.doc_path.0.ends_with("__init__.py"))
|
||||
.collect();
|
||||
assert_eq!(
|
||||
twin_items2.len(),
|
||||
2,
|
||||
"second ingest: expected exactly 2 __init__.py items; items={second_items:?}"
|
||||
);
|
||||
for item in &twin_items2 {
|
||||
assert_eq!(
|
||||
item.kind,
|
||||
IngestItemKind::Unchanged,
|
||||
"second ingest: each twin must be Unchanged; item={item:?}"
|
||||
);
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_js_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_js_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-js-ast-v1` — maps a tree-sitter-derived JavaScript AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-js-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeJsAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeJsAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeJsAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeJsAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-js-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.js".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-js-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("javascript".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("javascript".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("javascript".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_js_ast_v1() {
|
||||
assert_eq!(CodeJsAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-js-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "function parse() {\n // x\n}"),
|
||||
("Foo.double", 5, 7, "function double() {\n //\n return 0;\n}"),
|
||||
]);
|
||||
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-js-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("function big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "function parse() {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeJsAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "function parse() {}\n")]);
|
||||
let base: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeJsAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_python_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_python_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-python-ast-v1` — maps a tree-sitter-derived Python AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-python-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodePythonAstV1Chunker;
|
||||
|
||||
impl Chunker for CodePythonAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodePythonAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodePythonAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-python-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.py".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-python-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("python".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("python".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("python".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_python_ast_v1() {
|
||||
assert_eq!(CodePythonAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-python-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "def parse():\n pass\n # x"),
|
||||
("Foo.double", 5, 7, "def double():\n #\n pass"),
|
||||
]);
|
||||
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-python-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!(" x{i} = {i}")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("def big():\n{body}\n");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "def parse(): pass")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodePythonAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "def parse(): pass\n")]);
|
||||
let base: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodePythonAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_rust_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_rust_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-rust-ast-v1` — maps a tree-sitter-derived Rust AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-rust-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeRustAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeRustAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeRustAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeRustAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-rust-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.rs".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-rust-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("rust".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("rust".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_rust_ast_v1() {
|
||||
assert_eq!(CodeRustAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-rust-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "pub fn parse() {}\n// x\n}"),
|
||||
("Foo::double", 5, 7, "fn double() {}\n//\n}"),
|
||||
]);
|
||||
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-rust-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("pub fn big() {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "fn parse(){}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeRustAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]);
|
||||
let base: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeRustAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
322
crates/kebab-chunk/src/code_ts_ast_v1.rs
Normal file
322
crates/kebab-chunk/src/code_ts_ast_v1.rs
Normal file
@@ -0,0 +1,322 @@
|
||||
//! `code-ts-ast-v1` — maps a tree-sitter-derived TypeScript AST
|
||||
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
|
||||
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
|
||||
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
|
||||
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
|
||||
//!
|
||||
//! tree-sitter is intentionally NOT a dependency here: AST work is
|
||||
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
|
||||
//! consumes the `CanonicalDocument`.
|
||||
//!
|
||||
//! `AST_CHUNK_MAX_LINES` is a constant matching
|
||||
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
|
||||
//! config threading needs a chunker registry (P+); same deviation
|
||||
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
|
||||
//! (`tasks/HOTFIXES.md`).
|
||||
|
||||
use kebab_core::{
|
||||
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
|
||||
SourceSpan, id_for_chunk,
|
||||
};
|
||||
|
||||
const VERSION_LABEL: &str = "code-ts-ast-v1";
|
||||
const BYTES_PER_TOKEN: usize = 3;
|
||||
const POLICY_HASH_HEX_LEN: usize = 16;
|
||||
const AST_CHUNK_MAX_LINES: u32 = 200;
|
||||
|
||||
#[derive(Clone, Copy, Debug, Default)]
|
||||
pub struct CodeTsAstV1Chunker;
|
||||
|
||||
impl Chunker for CodeTsAstV1Chunker {
|
||||
fn chunker_version(&self) -> ChunkerVersion {
|
||||
ChunkerVersion(VERSION_LABEL.to_string())
|
||||
}
|
||||
|
||||
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
|
||||
let bytes = serde_json_canonicalizer::to_vec(policy)
|
||||
.expect("canonical JSON serialization of ChunkPolicy must not fail");
|
||||
let hex = blake3::hash(&bytes).to_hex().to_string();
|
||||
hex[..POLICY_HASH_HEX_LEN].to_string()
|
||||
}
|
||||
|
||||
fn chunk(
|
||||
&self,
|
||||
doc: &CanonicalDocument,
|
||||
policy: &ChunkPolicy,
|
||||
) -> anyhow::Result<Vec<Chunk>> {
|
||||
for b in &doc.blocks {
|
||||
let c = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => anyhow::bail!(
|
||||
"CodeTsAstV1Chunker only handles code docs (got non-Code block)"
|
||||
),
|
||||
};
|
||||
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
|
||||
anyhow::bail!(
|
||||
"CodeTsAstV1Chunker only handles code docs (got non-Code source_span)"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let base_policy_hash = self.policy_hash(policy);
|
||||
let chunker_version = self.chunker_version();
|
||||
let mut out: Vec<Chunk> = Vec::new();
|
||||
|
||||
for b in &doc.blocks {
|
||||
let cb = match b {
|
||||
Block::Code(c) => c,
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let (ls, le, symbol, lang) = match &cb.common.source_span {
|
||||
SourceSpan::Code { line_start, line_end, symbol, lang } => {
|
||||
(*line_start, *line_end, symbol.clone(), lang.clone())
|
||||
}
|
||||
_ => unreachable!("validated above"),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
|
||||
let span_lines = le.saturating_sub(ls) + 1;
|
||||
|
||||
if span_lines <= AST_CHUNK_MAX_LINES {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: ls,
|
||||
line_end: le,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
None, span, cb.code.clone(),
|
||||
));
|
||||
} else {
|
||||
let parts = split_oversize(&cb.code);
|
||||
let n = parts.len();
|
||||
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
|
||||
let part_ls = ls + off_start;
|
||||
let part_le = ls + off_end;
|
||||
let part_sym = symbol
|
||||
.as_ref()
|
||||
.map(|s| format!("{s} [part {}/{n}]", i + 1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start: part_ls,
|
||||
line_end: part_le,
|
||||
symbol: part_sym,
|
||||
lang: lang.clone(),
|
||||
};
|
||||
out.push(make_chunk(
|
||||
doc, &chunker_version, &block_ids, &base_policy_hash,
|
||||
Some(part_ls), span, text,
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-chunk",
|
||||
doc_id = %doc.doc_id,
|
||||
chunks = out.len(),
|
||||
"code-ts-ast-v1 chunked",
|
||||
);
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn make_chunk(
|
||||
doc: &CanonicalDocument,
|
||||
chunker_version: &ChunkerVersion,
|
||||
block_ids: &[BlockId],
|
||||
base_policy_hash: &str,
|
||||
split_key: Option<u32>,
|
||||
span: SourceSpan,
|
||||
text: String,
|
||||
) -> Chunk {
|
||||
let id_hash = match split_key {
|
||||
Some(k) => format!("{base_policy_hash}#L{k}"),
|
||||
None => base_policy_hash.to_string(),
|
||||
};
|
||||
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
|
||||
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
|
||||
Chunk {
|
||||
chunk_id,
|
||||
doc_id: DocumentId(doc.doc_id.0.clone()),
|
||||
block_ids: block_ids.to_vec(),
|
||||
text,
|
||||
heading_path: Vec::new(),
|
||||
source_spans: vec![span],
|
||||
token_estimate,
|
||||
chunker_version: chunker_version.clone(),
|
||||
policy_hash: base_policy_hash.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Split an oversize unit at blank-line paragraph boundaries, greedily
|
||||
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
|
||||
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
|
||||
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
|
||||
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
|
||||
let lines: Vec<&str> = code.split('\n').collect();
|
||||
let total = lines.len() as u32;
|
||||
let mut out: Vec<(u32, u32, String)> = Vec::new();
|
||||
let mut start: u32 = 0;
|
||||
while start < total {
|
||||
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
|
||||
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
|
||||
if end < total {
|
||||
if let Some(b) = (floor.min(end)..end)
|
||||
.rev()
|
||||
.find(|&i| lines[i as usize].trim().is_empty())
|
||||
{
|
||||
end = b + 1;
|
||||
}
|
||||
}
|
||||
let text = lines[start as usize..end as usize].join("\n");
|
||||
out.push((start, end.saturating_sub(1), text));
|
||||
start = end;
|
||||
}
|
||||
if out.is_empty() {
|
||||
out.push((0, total.saturating_sub(1), code.to_string()));
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
|
||||
SourceType, TrustLevel, WorkspacePath,
|
||||
};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/x/src/a.ts".into());
|
||||
let aid = AssetId("a".repeat(64));
|
||||
let pv = ParserVersion("code-ts-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
let blocks = units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("typescript".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
|
||||
lang: Some("typescript".into()),
|
||||
code: (*code).to_string(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
CanonicalDocument {
|
||||
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
|
||||
lang: Lang("und".into()), blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![], tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None, user: Default::default(),
|
||||
repo: Some("kebab".into()), git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)), code_lang: Some("typescript".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv, schema_version: 1, doc_version: 1,
|
||||
last_chunker_version: None, last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
fn policy() -> ChunkPolicy {
|
||||
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn chunker_version_is_code_ts_ast_v1() {
|
||||
assert_eq!(CodeTsAstV1Chunker.chunker_version(),
|
||||
ChunkerVersion("code-ts-ast-v1".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn one_chunk_per_unit_preserves_code_span() {
|
||||
let doc = code_doc(&[
|
||||
("parse", 1, 3, "function parse(): void {\n // x\n}"),
|
||||
("Foo.double", 5, 7, "function double(): number {\n //\n return 0;\n}"),
|
||||
]);
|
||||
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert_eq!(chunks.len(), 2);
|
||||
for c in &chunks {
|
||||
assert_eq!(c.source_spans.len(), 1);
|
||||
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
|
||||
assert_eq!(c.heading_path, Vec::<String>::new());
|
||||
assert_eq!(c.chunker_version.0, "code-ts-ast-v1");
|
||||
}
|
||||
match &chunks[0].source_spans[0] {
|
||||
SourceSpan::Code { symbol, line_start, line_end, .. } => {
|
||||
assert_eq!(symbol.as_deref(), Some("parse"));
|
||||
assert_eq!((*line_start, *line_end), (1, 3));
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversize_unit_splits_into_parts_with_unique_ids() {
|
||||
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
|
||||
let code = format!("function big(): void {{\n{body}\n}}");
|
||||
let doc = code_doc(&[("big", 1, 502, &code)]);
|
||||
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
|
||||
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
|
||||
for c in &chunks {
|
||||
match &c.source_spans[0] {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
|
||||
"part-numbered symbol, got {symbol:?}");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
let n = ids.len(); ids.sort(); ids.dedup();
|
||||
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_code_doc_errors() {
|
||||
use kebab_core::TextBlock;
|
||||
let mut doc = code_doc(&[("parse", 1, 1, "function parse(): void {}")]);
|
||||
doc.blocks = vec![Block::Paragraph(TextBlock {
|
||||
common: CommonBlock {
|
||||
block_id: kebab_core::BlockId("b".into()),
|
||||
heading_path: vec![],
|
||||
source_span: SourceSpan::Line { start: 1, end: 1 },
|
||||
},
|
||||
text: "x".into(), inlines: vec![],
|
||||
})];
|
||||
let err = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
|
||||
assert!(err.to_string().contains("CodeTsAstV1Chunker"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_chunk_ids_1000() {
|
||||
let doc = code_doc(&[("parse", 1, 2, "function parse(): void {}\n")]);
|
||||
let base: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
for _ in 0..1000 {
|
||||
let again: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
|
||||
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
|
||||
assert_eq!(again, base);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn policy_hash_matches_md_heading_v1() {
|
||||
let p = policy();
|
||||
assert_eq!(CodeTsAstV1Chunker.policy_hash(&p),
|
||||
crate::MdHeadingV1Chunker.policy_hash(&p));
|
||||
}
|
||||
}
|
||||
@@ -15,8 +15,16 @@
|
||||
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
|
||||
//! It consumes `CanonicalDocument` purely through `kb-core` types.
|
||||
|
||||
mod code_js_ast_v1;
|
||||
mod code_python_ast_v1;
|
||||
mod code_rust_ast_v1;
|
||||
mod code_ts_ast_v1;
|
||||
mod md_heading_v1;
|
||||
mod pdf_page_v1;
|
||||
|
||||
pub use code_js_ast_v1::CodeJsAstV1Chunker;
|
||||
pub use code_python_ast_v1::CodePythonAstV1Chunker;
|
||||
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
|
||||
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
|
||||
pub use md_heading_v1::MdHeadingV1Chunker;
|
||||
pub use pdf_page_v1::PdfPageV1Chunker;
|
||||
|
||||
@@ -472,6 +472,10 @@ mod tests {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: kebab_core::ParserVersion("test-parser-0".into()),
|
||||
|
||||
@@ -347,6 +347,10 @@ mod tests {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version,
|
||||
@@ -512,6 +516,10 @@ mod tests {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version,
|
||||
|
||||
221
crates/kebab-chunk/tests/code_js_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_js_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative JavaScript code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeJsAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("src/bar.js".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-js-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line function body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "function bigTransform(items) {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" const v{i} = items[{i}] !== undefined ? items[{i}] : null;\n"))
|
||||
.collect();
|
||||
let footer = " return items;\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. require/import block (lines 1–5, ≤200)
|
||||
// 1. free fn `add` (lines 7–12, ≤200)
|
||||
// 2. class `EventBus` (lines 14–20, ≤200)
|
||||
// 3. class `BaseHandler` (lines 22–30, ≤200)
|
||||
// 4. method `EventBus.emit` (lines 32–38, ≤200)
|
||||
// 5. method `EventBus.on` (lines 40–46, ≤200)
|
||||
// 6. bigTransform (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"requires",
|
||||
1,
|
||||
5,
|
||||
"const fs = require('fs');\nconst path = require('path');\nconst { EventEmitter } = require('events');\nconst assert = require('assert');\nconst crypto = require('crypto');".to_string(),
|
||||
),
|
||||
(
|
||||
"add",
|
||||
7,
|
||||
12,
|
||||
"export function add(a, b) {\n if (typeof a !== 'number') throw new TypeError('a');\n if (typeof b !== 'number') throw new TypeError('b');\n const result = a + b;\n assert(isFinite(result));\n return result;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"EventBus",
|
||||
14,
|
||||
20,
|
||||
"class EventBus {\n constructor() {\n this._handlers = new Map();\n this._history = [];\n this._maxHistory = 100;\n this._seq = 0;\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"BaseHandler",
|
||||
22,
|
||||
30,
|
||||
"class BaseHandler {\n handle(event) {\n throw new Error('not implemented');\n }\n batchHandle(events) {\n const results = [];\n for (const ev of events) {\n results.push(this.handle(ev));\n }\n return results;\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"EventBus.emit",
|
||||
32,
|
||||
38,
|
||||
"class EventBus {\n emit(name, payload) {\n const handlers = this._handlers.get(name) ?? [];\n for (const h of handlers) {\n h(payload);\n }\n return this;\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"EventBus.on",
|
||||
40,
|
||||
46,
|
||||
"class EventBus {\n on(name, handler) {\n if (!this._handlers.has(name)) {\n this._handlers.set(name, []);\n }\n this._handlers.get(name).push(handler);\n return this;\n }\n}".to_string(),
|
||||
),
|
||||
("bigTransform", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("javascript".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("javascript".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "bar.js".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("javascript".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-js-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_js_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.js.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-js-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_js_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeJsAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeJsAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_python_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_python_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative Python code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodePythonAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("kebab_eval/metrics.py".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-python-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line function body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "def big_compute(data):\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" v{i} = data[{i}] if {i} < len(data) else 0\n"))
|
||||
.collect();
|
||||
let footer = " return sum(data)";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. import block (lines 1–5, ≤200)
|
||||
// 1. free fn `compute_mrr` (lines 7–12, ≤200)
|
||||
// 2. class `MetricsCollector` (lines 14–20, ≤200)
|
||||
// 3. class `BaseEvaluator` (lines 22–30, ≤200)
|
||||
// 4. method `run` (lines 32–38, ≤200)
|
||||
// 5. method `report` (lines 40–46, ≤200)
|
||||
// 6. big_compute (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"imports",
|
||||
1,
|
||||
5,
|
||||
"import os\nimport sys\nfrom typing import List\nfrom pathlib import Path\nfrom collections import defaultdict".to_string(),
|
||||
),
|
||||
(
|
||||
"compute_mrr",
|
||||
7,
|
||||
12,
|
||||
"def compute_mrr(scores):\n if not scores:\n return 0.0\n return sum(\n 1.0 / r for r in scores\n ) / len(scores)".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector",
|
||||
14,
|
||||
20,
|
||||
"class MetricsCollector:\n def __init__(self):\n self.scores = []\n self.labels = []\n self.counts = defaultdict(int)\n self.totals = defaultdict(float)\n self.tags = []".to_string(),
|
||||
),
|
||||
(
|
||||
"BaseEvaluator",
|
||||
22,
|
||||
30,
|
||||
"class BaseEvaluator:\n def evaluate(self, data):\n raise NotImplementedError\n def batch_evaluate(self, items):\n results = []\n for item in items:\n results.append(self.evaluate(item))\n return results\n def name(self):\n return type(self).__name__".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.run",
|
||||
32,
|
||||
38,
|
||||
"class MetricsCollector:\n def run(self, inputs):\n for inp in inputs:\n score = self._score(inp)\n self.scores.append(\n score\n )".to_string(),
|
||||
),
|
||||
(
|
||||
"MetricsCollector.report",
|
||||
40,
|
||||
46,
|
||||
"class MetricsCollector:\n def report(self):\n return {\n 'mean': sum(self.scores) / max(len(self.scores), 1),\n 'count': len(self.scores),\n 'tags': self.tags,\n }".to_string(),
|
||||
),
|
||||
("big_compute", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("python".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("python".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "metrics.py".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("python".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-python-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_python_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.py.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-python-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_python_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodePythonAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodePythonAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_rust_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_rust_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative Rust code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeRustAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("crates/kebab-chunk/src/code_rust_ast_v1.rs".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-rust-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line function body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "pub fn big_fn(input: &[u8]) -> Vec<u8> {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" let v{i} = input.get({i} as usize).copied().unwrap_or(0);\n"))
|
||||
.collect();
|
||||
let footer = " vec![0u8]\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. top-level use+const block (lines 1–5, ≤200)
|
||||
// 1. free fn `parse` (lines 7–12, ≤200)
|
||||
// 2. struct `Foo` (lines 14–20, ≤200)
|
||||
// 3. trait `Frobable` (lines 22–30, ≤200)
|
||||
// 4. impl Foo::double (lines 32–38, ≤200)
|
||||
// 5. impl Foo::triple (lines 40–46, ≤200)
|
||||
// 6. big_fn (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"use+const",
|
||||
1,
|
||||
5,
|
||||
"use std::collections::HashMap;\nuse std::fmt;\n\nconst MAX: usize = 1024;\nconst MIN: usize = 0;".to_string(),
|
||||
),
|
||||
(
|
||||
"parse",
|
||||
7,
|
||||
12,
|
||||
"pub fn parse(input: &str) -> Option<u32> {\n input\n .trim()\n .parse()\n .ok()\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo",
|
||||
14,
|
||||
20,
|
||||
"pub struct Foo {\n pub name: String,\n pub value: u32,\n pub tags: Vec<String>,\n pub meta: Option<String>,\n pub count: usize,\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Frobable",
|
||||
22,
|
||||
30,
|
||||
"pub trait Frobable {\n fn frob(&self) -> String;\n fn frob_twice(&self) -> String {\n let a = self.frob();\n let b = self.frob();\n format!(\"{a}{b}\")\n }\n fn name(&self) -> &str;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo::double",
|
||||
32,
|
||||
38,
|
||||
"impl Foo {\n pub fn double(&self) -> u32 {\n self.value\n .checked_mul(2)\n .unwrap_or(u32::MAX)\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo::triple",
|
||||
40,
|
||||
46,
|
||||
"impl Foo {\n pub fn triple(&self) -> u32 {\n self.value\n .checked_mul(3)\n .unwrap_or(u32::MAX)\n }\n}".to_string(),
|
||||
),
|
||||
("big_fn", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("rust".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("rust".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "code_rust_ast_v1.rs".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("rust".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-rust-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_rust_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-rust-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_rust_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeRustAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeRustAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
221
crates/kebab-chunk/tests/code_ts_ast_snapshot.rs
Normal file
221
crates/kebab-chunk/tests/code_ts_ast_snapshot.rs
Normal file
@@ -0,0 +1,221 @@
|
||||
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
|
||||
//! representative TypeScript code `CanonicalDocument`.
|
||||
//!
|
||||
//! This is an integration test. `kebab-parse-code` is intentionally NOT
|
||||
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
|
||||
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
|
||||
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
|
||||
//! internal `code_doc` test helper.
|
||||
//!
|
||||
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
|
||||
|
||||
use std::path::PathBuf;
|
||||
|
||||
use kebab_chunk::CodeTsAstV1Chunker;
|
||||
use kebab_core::{
|
||||
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
|
||||
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Value;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
fn fixtures_dir() -> PathBuf {
|
||||
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
|
||||
.join("tests")
|
||||
.join("fixtures")
|
||||
}
|
||||
|
||||
fn fixed_doc() -> CanonicalDocument {
|
||||
let wp = WorkspacePath("src/Foo.ts".into());
|
||||
let aid = AssetId("b".repeat(64));
|
||||
// Pin parser_version so doc_id / block_ids are reproducible.
|
||||
let pv = ParserVersion("code-ts-v1".into());
|
||||
let doc_id = id_for_doc(&wp, &aid, &pv);
|
||||
|
||||
// Build a >200-line method body to force split_oversize.
|
||||
let big_body: String = {
|
||||
let header = "export class BigProcessor {\n process(items: string[]): string[] {\n";
|
||||
let body: String = (0..210u32)
|
||||
.map(|i| format!(" const v{i} = items[{i}] ?? '';\n"))
|
||||
.collect();
|
||||
let footer = " return items;\n }\n}";
|
||||
format!("{header}{body}{footer}")
|
||||
};
|
||||
let big_line_count = big_body.lines().count() as u32;
|
||||
let big_line_end = 48 + big_line_count - 1;
|
||||
|
||||
// Representative units:
|
||||
// 0. import block (lines 1–5, ≤200)
|
||||
// 1. free fn `parseInput` (lines 7–12, ≤200)
|
||||
// 2. interface `Frobable` (lines 14–20, ≤200)
|
||||
// 3. class `Foo` (lines 22–30, ≤200)
|
||||
// 4. method `Foo.double` (lines 32–38, ≤200)
|
||||
// 5. method `Foo.triple` (lines 40–46, ≤200)
|
||||
// 6. BigProcessor (>200 lines) to force split_oversize
|
||||
let raw_units: Vec<(&str, u32, u32, String)> = vec![
|
||||
(
|
||||
"imports",
|
||||
1,
|
||||
5,
|
||||
"import { readFileSync } from 'fs';\nimport { join } from 'path';\nimport type { Config } from './config';\nimport { Logger } from './logger';\nimport { EventEmitter } from 'events';".to_string(),
|
||||
),
|
||||
(
|
||||
"parseInput",
|
||||
7,
|
||||
12,
|
||||
"export function parseInput(raw: string): number | null {\n const trimmed = raw.trim();\n const n = Number(trimmed);\n if (isNaN(n)) return null;\n return n;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Frobable",
|
||||
14,
|
||||
20,
|
||||
"export interface Frobable {\n frob(): string;\n frobTwice(): string;\n readonly name: string;\n readonly tags: string[];\n count: number;\n reset(): void;\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo",
|
||||
22,
|
||||
30,
|
||||
"export class Foo implements Frobable {\n constructor(\n public readonly name: string,\n public value: number,\n public tags: string[] = [],\n ) {}\n frob(): string { return this.name; }\n frobTwice(): string { return this.name.repeat(2); }\n reset(): void { this.value = 0; }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo.double",
|
||||
32,
|
||||
38,
|
||||
"export class Foo {\n double(): number {\n const result = this.value * 2;\n if (result > Number.MAX_SAFE_INTEGER) {\n return Number.MAX_SAFE_INTEGER;\n }\n return result;\n }\n}".to_string(),
|
||||
),
|
||||
(
|
||||
"Foo.triple",
|
||||
40,
|
||||
46,
|
||||
"export class Foo {\n triple(): number {\n const result = this.value * 3;\n if (result > Number.MAX_SAFE_INTEGER) {\n return Number.MAX_SAFE_INTEGER;\n }\n return result;\n }\n}".to_string(),
|
||||
),
|
||||
("BigProcessor", 48, big_line_end, big_body),
|
||||
];
|
||||
|
||||
let blocks: Vec<Block> = raw_units
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, (sym, ls, le, code))| {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: *ls,
|
||||
line_end: *le,
|
||||
symbol: Some((*sym).to_string()),
|
||||
lang: Some("typescript".into()),
|
||||
};
|
||||
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
|
||||
Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id: bid,
|
||||
heading_path: vec![],
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("typescript".into()),
|
||||
code: code.clone(),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: aid,
|
||||
workspace_path: wp,
|
||||
title: "Foo.ts".into(),
|
||||
lang: Lang("und".into()),
|
||||
blocks,
|
||||
metadata: Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("0".repeat(40)),
|
||||
code_lang: Some("typescript".into()),
|
||||
},
|
||||
provenance: Provenance { events: vec![] },
|
||||
parser_version: pv,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
}
|
||||
}
|
||||
|
||||
fn fixed_policy() -> ChunkPolicy {
|
||||
ChunkPolicy {
|
||||
target_tokens: 500,
|
||||
overlap_tokens: 80,
|
||||
respect_markdown_headings: false,
|
||||
chunker_version: ChunkerVersion("code-ts-ast-v1".into()),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn code_ts_ast_chunks_snapshot() {
|
||||
let doc = fixed_doc();
|
||||
let policy = fixed_policy();
|
||||
|
||||
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy).expect("chunk");
|
||||
let actual = serde_json::to_value(&chunks).unwrap();
|
||||
|
||||
let dir = fixtures_dir();
|
||||
let baseline_path = dir.join("code-sample.ts.chunks.snapshot.json");
|
||||
let baseline_text = match std::fs::read_to_string(&baseline_path) {
|
||||
Ok(s) => s,
|
||||
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
|
||||
std::fs::create_dir_all(&dir).unwrap();
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
return;
|
||||
}
|
||||
Err(e) => panic!(
|
||||
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
|
||||
baseline_path.display()
|
||||
),
|
||||
};
|
||||
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
|
||||
|
||||
if actual != expected {
|
||||
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
|
||||
eprintln!("updated baseline {}", baseline_path.display());
|
||||
return;
|
||||
}
|
||||
let pretty = serde_json::to_string_pretty(&actual).unwrap();
|
||||
panic!(
|
||||
"code-ts-ast-v1 chunks snapshot drift\n\
|
||||
--- expected ({}) ---\n{baseline_text}\n\
|
||||
--- actual ---\n{pretty}\n\
|
||||
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
|
||||
baseline_path.display()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// Determinism cross-check: re-running the same pipeline yields the same
|
||||
/// chunk_ids byte-for-byte.
|
||||
#[test]
|
||||
fn code_ts_ast_chunks_are_deterministic() {
|
||||
let policy = fixed_policy();
|
||||
let baseline: Vec<String> = CodeTsAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
for _ in 0..5 {
|
||||
let again: Vec<String> = CodeTsAstV1Chunker
|
||||
.chunk(&fixed_doc(), &policy)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|c| c.chunk_id.0)
|
||||
.collect();
|
||||
assert_eq!(again, baseline);
|
||||
}
|
||||
}
|
||||
170
crates/kebab-chunk/tests/fixtures/code-sample.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
170
crates/kebab-chunk/tests/fixtures/code-sample.js.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.js.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
170
crates/kebab-chunk/tests/fixtures/code-sample.py.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.py.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
170
crates/kebab-chunk/tests/fixtures/code-sample.ts.chunks.snapshot.json
vendored
Normal file
170
crates/kebab-chunk/tests/fixtures/code-sample.ts.chunks.snapshot.json
vendored
Normal file
File diff suppressed because one or more lines are too long
@@ -46,6 +46,11 @@ struct Cli {
|
||||
command: Cmd,
|
||||
}
|
||||
|
||||
// p10-1A-1: adding `repo` and `code_lang` Vec<String> fields pushed `Cmd`
|
||||
// over clippy's large_enum_variant threshold. The enum is short-lived
|
||||
// (parsed once at startup, never cloned in a hot path) — boxing would add
|
||||
// noise with no real benefit.
|
||||
#[allow(clippy::large_enum_variant)]
|
||||
#[derive(Subcommand, Debug)]
|
||||
enum Cmd {
|
||||
/// Initialise XDG dirs + workspace + `config.toml`.
|
||||
@@ -165,6 +170,18 @@ enum Cmd {
|
||||
#[arg(long)]
|
||||
doc_id: Option<String>,
|
||||
|
||||
/// p10-1A-1: filter by repo name (`metadata.repo`). Repeatable;
|
||||
/// multi-value = OR. Empty = no filter (all repos returned).
|
||||
#[arg(long = "repo", value_name = "NAME", num_args = 1)]
|
||||
repo: Vec<String>,
|
||||
|
||||
/// p10-1A-1: filter by code language identifier (lowercase
|
||||
/// canonical). Repeatable or comma-separated.
|
||||
/// Examples: `rust`, `python`, `typescript`.
|
||||
/// Unknown values produce empty hits.
|
||||
#[arg(long = "code-lang", value_name = "LANG", num_args = 1, value_delimiter = ',')]
|
||||
code_lang: Vec<String>,
|
||||
|
||||
/// p9-fb-37: emit pre-fusion lexical / vector / RRF candidate
|
||||
/// lists + per-stage timing in the response. Bypasses cache
|
||||
/// (debug intent — fresh run guaranteed). Requires embeddings
|
||||
@@ -688,6 +705,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
media,
|
||||
ingested_after,
|
||||
doc_id,
|
||||
repo,
|
||||
code_lang,
|
||||
trace,
|
||||
bulk,
|
||||
} => {
|
||||
@@ -819,7 +838,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
None => None,
|
||||
};
|
||||
|
||||
// p9-fb-36: build SearchFilters from the 7 new flags.
|
||||
// p9-fb-36 + p10-1A-1: build SearchFilters from CLI flags.
|
||||
let filters = kebab_core::SearchFilters {
|
||||
tags_any: tag.clone(),
|
||||
lang: lang.as_ref().map(|s| kebab_core::Lang(s.clone())),
|
||||
@@ -828,6 +847,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
|
||||
media: media_norm,
|
||||
ingested_after: ingested_after_parsed,
|
||||
doc_id: doc_id.as_ref().map(|s| kebab_core::DocumentId(s.clone())),
|
||||
repo: repo.clone(),
|
||||
code_lang: code_lang.clone(),
|
||||
};
|
||||
|
||||
let q = kebab_core::SearchQuery {
|
||||
|
||||
@@ -239,7 +239,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn ingest_wrapper_tags_schema_version() {
|
||||
use kebab_core::SourceScope;
|
||||
use kebab_core::{SkipExamples, SourceScope};
|
||||
let r = IngestReport {
|
||||
scope: SourceScope {
|
||||
root: std::path::PathBuf::from("/tmp"),
|
||||
@@ -254,6 +254,12 @@ mod tests {
|
||||
errors: 0,
|
||||
duration_ms: 0,
|
||||
skipped_by_extension: std::collections::BTreeMap::new(),
|
||||
skipped_gitignore: 0,
|
||||
skipped_kebabignore: 0,
|
||||
skipped_builtin_blacklist: 0,
|
||||
skipped_generated: 0,
|
||||
skipped_size_exceeded: 0,
|
||||
skip_examples: SkipExamples::default(),
|
||||
items: None,
|
||||
};
|
||||
let v = wire_ingest(&r);
|
||||
@@ -328,6 +334,8 @@ mod tests {
|
||||
lang_breakdown: Default::default(),
|
||||
index_bytes: Default::default(),
|
||||
stale_doc_count: 0,
|
||||
// p10-1A-1: new fields added to Stats; use Default for the test fixture.
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
let v = wire_schema(&schema);
|
||||
|
||||
100
crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs
Normal file
100
crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs
Normal file
@@ -0,0 +1,100 @@
|
||||
//! p10-1A-1 Task 13: regression — the 5 original Citation variants
|
||||
//! (Line, Page, Region, Caption, Time) serialize byte-identically to
|
||||
//! pre-Task-1 form. No spurious `code`, `line_start`, or `symbol` keys
|
||||
//! must leak into these variants.
|
||||
|
||||
use kebab_core::{Citation, WorkspacePath};
|
||||
|
||||
#[test]
|
||||
fn line_variant_serialization_unchanged() {
|
||||
let c = Citation::Line {
|
||||
path: WorkspacePath::new("a.md".into()).unwrap(),
|
||||
start: 1,
|
||||
end: 2,
|
||||
section: Some("§14".into()),
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "line");
|
||||
assert_eq!(v["start"], 1);
|
||||
assert_eq!(v["end"], 2);
|
||||
assert_eq!(v["section"], "§14");
|
||||
// Must not bleed Code-variant keys.
|
||||
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
|
||||
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
|
||||
assert!(v.get("code").is_none(), "code must be absent: {v}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn line_variant_null_section_omitted() {
|
||||
let c = Citation::Line {
|
||||
path: WorkspacePath::new("b.md".into()).unwrap(),
|
||||
start: 5,
|
||||
end: 10,
|
||||
section: None,
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "line");
|
||||
// `section` with None should be omitted (skip_serializing_if = is_none).
|
||||
assert!(v.get("section").is_none() || v["section"].is_null());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn page_variant_serialization_unchanged() {
|
||||
let c = Citation::Page {
|
||||
path: WorkspacePath::new("a.pdf".into()).unwrap(),
|
||||
page: 13,
|
||||
section: None,
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "page");
|
||||
assert_eq!(v["page"], 13);
|
||||
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
|
||||
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn region_variant_serialization_unchanged() {
|
||||
let c = Citation::Region {
|
||||
path: WorkspacePath::new("img.png".into()).unwrap(),
|
||||
x: 10,
|
||||
y: 20,
|
||||
w: 100,
|
||||
h: 200,
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "region");
|
||||
assert_eq!(v["x"], 10);
|
||||
assert_eq!(v["y"], 20);
|
||||
assert_eq!(v["w"], 100);
|
||||
assert_eq!(v["h"], 200);
|
||||
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn caption_variant_serialization_unchanged() {
|
||||
let c = Citation::Caption {
|
||||
path: WorkspacePath::new("a.png".into()).unwrap(),
|
||||
model: "qwen2.5-vl:7b".into(),
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "caption");
|
||||
assert_eq!(v["model"], "qwen2.5-vl:7b");
|
||||
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn time_variant_serialization_unchanged() {
|
||||
let c = Citation::Time {
|
||||
path: WorkspacePath::new("audio.mp3".into()).unwrap(),
|
||||
start_ms: 1000,
|
||||
end_ms: 5000,
|
||||
speaker: Some("Alice".into()),
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "time");
|
||||
assert_eq!(v["start_ms"], 1000);
|
||||
assert_eq!(v["end_ms"], 5000);
|
||||
assert_eq!(v["speaker"], "Alice");
|
||||
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
|
||||
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
|
||||
}
|
||||
72
crates/kebab-cli/tests/wire_search_filters_code.rs
Normal file
72
crates/kebab-cli/tests/wire_search_filters_code.rs
Normal file
@@ -0,0 +1,72 @@
|
||||
//! p10-1A-1 Task 15: CLI accepts --repo and --code-lang flags.
|
||||
//!
|
||||
//! These tests verify that clap parses the new flags without error.
|
||||
//! They drive `kebab search --help` (which exercises flag parsing
|
||||
//! via clap's help generation path, exiting 0) or use a minimal
|
||||
//! config + `--json` round-trip to verify the flags reach the wire.
|
||||
|
||||
use std::process::Command;
|
||||
|
||||
fn kebab() -> Command {
|
||||
Command::new(env!("CARGO_BIN_EXE_kebab"))
|
||||
}
|
||||
|
||||
/// `kebab search --help` must exit 0 and mention `--repo`.
|
||||
#[test]
|
||||
fn cli_search_help_mentions_repo_flag() {
|
||||
let out = kebab()
|
||||
.args(["search", "--help"])
|
||||
.output()
|
||||
.expect("failed to run kebab");
|
||||
// clap help exits 0.
|
||||
assert!(
|
||||
out.status.success(),
|
||||
"kebab search --help exited non-zero: {:?}",
|
||||
out.status
|
||||
);
|
||||
let stdout = String::from_utf8_lossy(&out.stdout);
|
||||
assert!(
|
||||
stdout.contains("--repo"),
|
||||
"--repo flag must appear in search help output:\n{stdout}"
|
||||
);
|
||||
}
|
||||
|
||||
/// `kebab search --help` must exit 0 and mention `--code-lang`.
|
||||
#[test]
|
||||
fn cli_search_help_mentions_code_lang_flag() {
|
||||
let out = kebab()
|
||||
.args(["search", "--help"])
|
||||
.output()
|
||||
.expect("failed to run kebab");
|
||||
assert!(
|
||||
out.status.success(),
|
||||
"kebab search --help exited non-zero: {:?}",
|
||||
out.status
|
||||
);
|
||||
let stdout = String::from_utf8_lossy(&out.stdout);
|
||||
assert!(
|
||||
stdout.contains("--code-lang"),
|
||||
"--code-lang flag must appear in search help output:\n{stdout}"
|
||||
);
|
||||
}
|
||||
|
||||
/// `kebab search --help` must exit 0 and mention `--media`.
|
||||
/// Confirms `--media code` value pathway is available (media is
|
||||
/// a free-form Vec<String> that already accepted arbitrary values).
|
||||
#[test]
|
||||
fn cli_search_help_mentions_media_flag() {
|
||||
let out = kebab()
|
||||
.args(["search", "--help"])
|
||||
.output()
|
||||
.expect("failed to run kebab");
|
||||
assert!(
|
||||
out.status.success(),
|
||||
"kebab search --help exited non-zero: {:?}",
|
||||
out.status
|
||||
);
|
||||
let stdout = String::from_utf8_lossy(&out.stdout);
|
||||
assert!(
|
||||
stdout.contains("--media"),
|
||||
"--media flag must appear in search help output:\n{stdout}"
|
||||
);
|
||||
}
|
||||
47
crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs
Normal file
47
crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs
Normal file
@@ -0,0 +1,47 @@
|
||||
//! p10-1A-1 Task 13: regression — markdown SearchHit omits `repo` and
|
||||
//! `code_lang` from JSON when both are `None`.
|
||||
//!
|
||||
//! Proves that adding optional fields to SearchHit does not silently
|
||||
//! inject spurious keys into the existing markdown corpus wire shape.
|
||||
|
||||
use kebab_core::{
|
||||
Citation, ChunkId, ChunkerVersion, DocumentId, IndexVersion, RetrievalDetail, ScoreKind,
|
||||
SearchHit, WorkspacePath,
|
||||
};
|
||||
|
||||
#[test]
|
||||
fn markdown_hit_omits_repo_and_code_lang() {
|
||||
let hit = SearchHit {
|
||||
rank: 1,
|
||||
chunk_id: ChunkId("c1".into()),
|
||||
doc_id: DocumentId("d1".into()),
|
||||
doc_path: WorkspacePath::new("notes/foo.md".into()).unwrap(),
|
||||
heading_path: vec!["A".into(), "B".into()],
|
||||
section_label: Some("B".into()),
|
||||
snippet: "hi".into(),
|
||||
citation: Citation::Line {
|
||||
path: WorkspacePath::new("notes/foo.md".into()).unwrap(),
|
||||
start: 1,
|
||||
end: 2,
|
||||
section: None,
|
||||
},
|
||||
retrieval: RetrievalDetail::default(),
|
||||
index_version: IndexVersion("v1".into()),
|
||||
embedding_model: None,
|
||||
chunker_version: ChunkerVersion("md-heading-v1".into()),
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
};
|
||||
let s = serde_json::to_string(&hit).unwrap();
|
||||
assert!(
|
||||
!s.contains("\"repo\""),
|
||||
"repo should be absent from markdown hit JSON: {s}"
|
||||
);
|
||||
assert!(
|
||||
!s.contains("\"code_lang\""),
|
||||
"code_lang should be absent from markdown hit JSON: {s}"
|
||||
);
|
||||
}
|
||||
@@ -45,6 +45,11 @@ pub struct Config {
|
||||
/// `dark`).
|
||||
#[serde(default = "UiCfg::defaults")]
|
||||
pub ui: UiCfg,
|
||||
/// p10-1A-1: code ingest settings. `#[serde(default)]` so existing
|
||||
/// config files without an `[ingest]` / `[ingest.code]` section
|
||||
/// load cleanly with built-in defaults.
|
||||
#[serde(default)]
|
||||
pub ingest: IngestCfg,
|
||||
/// p9-fb-05: directory of the on-disk config file this `Config`
|
||||
/// was loaded from, if any. Populated by `Config::from_file` /
|
||||
/// `Config::load` — never serialized (`#[serde(skip)]`). Used by
|
||||
@@ -265,6 +270,52 @@ impl UiCfg {
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1A-1: top-level ingest configuration wrapper. Contains per-media-type
|
||||
/// sub-sections; currently only `code` is defined.
|
||||
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
|
||||
#[serde(default)]
|
||||
pub struct IngestCfg {
|
||||
pub code: IngestCodeCfg,
|
||||
}
|
||||
|
||||
/// p10-1A-1: settings for the code ingest pipeline. All fields have
|
||||
/// reasonable defaults so the user need not set anything in `config.toml`
|
||||
/// to get working code ingest.
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
#[serde(default)]
|
||||
pub struct IngestCodeCfg {
|
||||
/// Generated header sniff. Reads first ~512 bytes, checks 7 markers.
|
||||
pub skip_generated_header: bool,
|
||||
/// Max byte size per file. Bigger files skipped.
|
||||
pub max_file_bytes: u64,
|
||||
/// Max line count per file. Bigger files skipped (byte cap checked first).
|
||||
pub max_file_lines: u32,
|
||||
/// User extra skip globs (gitignore syntax). Applied on top of built-in
|
||||
/// + `.gitignore` + `.kebabignore`.
|
||||
pub extra_skip_globs: Vec<String>,
|
||||
/// AST chunk size cap. Functions/classes longer than this fall back to
|
||||
/// paragraph-based split (1A-2 and later).
|
||||
pub ast_chunk_max_lines: u32,
|
||||
/// Tier 3 fallback chunker: lines per chunk.
|
||||
pub fallback_lines_per_chunk: u32,
|
||||
/// Tier 3 fallback chunker: line overlap between adjacent chunks.
|
||||
pub fallback_lines_overlap: u32,
|
||||
}
|
||||
|
||||
impl Default for IngestCodeCfg {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
skip_generated_header: true,
|
||||
max_file_bytes: 262_144,
|
||||
max_file_lines: 5_000,
|
||||
extra_skip_globs: vec![],
|
||||
ast_chunk_max_lines: 200,
|
||||
fallback_lines_per_chunk: 80,
|
||||
fallback_lines_overlap: 20,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Config {
|
||||
/// Defaults per design §6.4.
|
||||
pub fn defaults() -> Self {
|
||||
@@ -302,9 +353,9 @@ impl Config {
|
||||
models: ModelsCfg {
|
||||
embedding: EmbeddingModelCfg {
|
||||
provider: "fastembed".to_string(),
|
||||
model: "multilingual-e5-small".to_string(),
|
||||
model: "multilingual-e5-large".to_string(),
|
||||
version: "v1".to_string(),
|
||||
dimensions: 384,
|
||||
dimensions: 1024,
|
||||
batch_size: 64,
|
||||
},
|
||||
llm: LlmCfg {
|
||||
@@ -336,6 +387,7 @@ impl Config {
|
||||
},
|
||||
image: ImageCfg::defaults(),
|
||||
ui: UiCfg::defaults(),
|
||||
ingest: IngestCfg::default(),
|
||||
// p9-fb-05: defaults are not loaded from disk, so no
|
||||
// source_dir. Relative `workspace.root` (rare with
|
||||
// defaults) falls back to caller `cwd` via the
|
||||
@@ -764,7 +816,8 @@ mod tests {
|
||||
let c = Config::defaults();
|
||||
assert_eq!(c.rag.score_gate, 0.30);
|
||||
assert_eq!(c.chunking.target_tokens, 500);
|
||||
assert_eq!(c.models.embedding.dimensions, 384);
|
||||
assert_eq!(c.models.embedding.model, "multilingual-e5-large");
|
||||
assert_eq!(c.models.embedding.dimensions, 1024);
|
||||
assert_eq!(c.search.rrf_k, 60);
|
||||
}
|
||||
|
||||
@@ -947,9 +1000,9 @@ chunker_version = "md-heading-v1"
|
||||
|
||||
[models.embedding]
|
||||
provider = "fastembed"
|
||||
model = "multilingual-e5-small"
|
||||
model = "multilingual-e5-large"
|
||||
version = "v1"
|
||||
dimensions = 384
|
||||
dimensions = 1024
|
||||
batch_size = 64
|
||||
|
||||
[models.llm]
|
||||
@@ -1059,6 +1112,49 @@ max_context_tokens = 8000
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_code_cfg_defaults() {
|
||||
let cfg: IngestCodeCfg = toml::from_str("").unwrap();
|
||||
assert_eq!(cfg.max_file_bytes, 262_144);
|
||||
assert_eq!(cfg.max_file_lines, 5_000);
|
||||
assert!(cfg.skip_generated_header);
|
||||
assert!(cfg.extra_skip_globs.is_empty());
|
||||
assert_eq!(cfg.ast_chunk_max_lines, 200);
|
||||
assert_eq!(cfg.fallback_lines_per_chunk, 80);
|
||||
assert_eq!(cfg.fallback_lines_overlap, 20);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_code_cfg_user_override() {
|
||||
let toml = r#"
|
||||
max_file_bytes = 1048576
|
||||
max_file_lines = 20000
|
||||
skip_generated_header = false
|
||||
extra_skip_globs = ["**/fixtures/**", "**/snapshots/**"]
|
||||
"#;
|
||||
let cfg: IngestCodeCfg = toml::from_str(toml).unwrap();
|
||||
assert_eq!(cfg.max_file_bytes, 1_048_576);
|
||||
assert_eq!(cfg.max_file_lines, 20_000);
|
||||
assert!(!cfg.skip_generated_header);
|
||||
assert_eq!(cfg.extra_skip_globs.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn config_with_ingest_code_section() {
|
||||
// Build a full valid Config serialization and patch only the
|
||||
// [ingest.code] field we care about — avoids having to enumerate
|
||||
// every required Config field in the test fixture.
|
||||
let base = Config::defaults();
|
||||
let mut toml_text = toml::to_string(&base).unwrap();
|
||||
// Inject max_file_bytes override into the [ingest.code] table.
|
||||
toml_text = toml_text.replace(
|
||||
"max_file_bytes = 262144",
|
||||
"max_file_bytes = 524288",
|
||||
);
|
||||
let cfg: Config = toml::from_str(&toml_text).unwrap();
|
||||
assert_eq!(cfg.ingest.code.max_file_bytes, 524_288);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -37,6 +37,13 @@ pub enum Citation {
|
||||
end_ms: u64,
|
||||
speaker: Option<String>,
|
||||
},
|
||||
Code {
|
||||
path: WorkspacePath,
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
symbol: Option<String>,
|
||||
lang: Option<String>,
|
||||
},
|
||||
}
|
||||
|
||||
impl Citation {
|
||||
@@ -46,7 +53,8 @@ impl Citation {
|
||||
| Citation::Page { path, .. }
|
||||
| Citation::Region { path, .. }
|
||||
| Citation::Caption { path, .. }
|
||||
| Citation::Time { path, .. } => path,
|
||||
| Citation::Time { path, .. }
|
||||
| Citation::Code { path, .. } => path,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -80,6 +88,18 @@ impl Citation {
|
||||
None => format!("{}#t={},{}", path.0, s, e),
|
||||
}
|
||||
}
|
||||
Citation::Code {
|
||||
path,
|
||||
line_start,
|
||||
line_end,
|
||||
..
|
||||
} => {
|
||||
if line_start == line_end {
|
||||
format!("{}#L{}", path.0, line_start)
|
||||
} else {
|
||||
format!("{}#L{}-L{}", path.0, line_start, line_end)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -354,4 +374,64 @@ mod tests {
|
||||
let r = Citation::parse("notes/x#evil.md#L7");
|
||||
assert!(r.is_err(), "path with embedded '#' must be rejected");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn citation_code_variant_serializes_with_kind_tag() {
|
||||
let c = Citation::Code {
|
||||
path: WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()),
|
||||
line_start: 142,
|
||||
line_end: 168,
|
||||
symbol: Some("MdHeadingV1Chunker::chunk_doc".into()),
|
||||
lang: Some("rust".into()),
|
||||
};
|
||||
let v = serde_json::to_value(&c).unwrap();
|
||||
assert_eq!(v["kind"], "code");
|
||||
assert_eq!(v["line_start"], 142);
|
||||
assert_eq!(v["line_end"], 168);
|
||||
assert_eq!(v["symbol"], "MdHeadingV1Chunker::chunk_doc");
|
||||
assert_eq!(v["lang"], "rust");
|
||||
// Existing 5 variants must NOT pick up these fields.
|
||||
let line = Citation::Line {
|
||||
path: WorkspacePath("notes/foo.md".into()),
|
||||
start: 1,
|
||||
end: 10,
|
||||
section: None,
|
||||
};
|
||||
let lv = serde_json::to_value(&line).unwrap();
|
||||
assert!(lv.get("line_start").is_none());
|
||||
assert!(lv.get("symbol").is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn citation_code_uri_format() {
|
||||
let c = Citation::Code {
|
||||
path: WorkspacePath("a/b.rs".into()),
|
||||
line_start: 10,
|
||||
line_end: 20,
|
||||
symbol: None,
|
||||
lang: Some("rust".into()),
|
||||
};
|
||||
assert_eq!(c.to_uri(), "a/b.rs#L10-L20");
|
||||
// Single-line uses `#L10`.
|
||||
let single = Citation::Code {
|
||||
path: WorkspacePath("a/b.rs".into()),
|
||||
line_start: 5,
|
||||
line_end: 5,
|
||||
symbol: None,
|
||||
lang: None,
|
||||
};
|
||||
assert_eq!(single.to_uri(), "a/b.rs#L5");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn citation_code_path_accessor() {
|
||||
let c = Citation::Code {
|
||||
path: WorkspacePath("x.rs".into()),
|
||||
line_start: 1,
|
||||
line_end: 1,
|
||||
symbol: None,
|
||||
lang: None,
|
||||
};
|
||||
assert_eq!(c.path().0, "x.rs");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -142,6 +142,18 @@ pub enum SourceSpan {
|
||||
start_ms: u64,
|
||||
end_ms: u64,
|
||||
},
|
||||
/// p10-1A-2: AST-unit span for code ingest. Internal storage shape
|
||||
/// (chunks.source_spans_json) — `citation_helper` maps this to the
|
||||
/// wire `Citation::Code` (added 1A-1). `symbol` is the per-language
|
||||
/// self-reference path (design §3.4); `<top-level>` / `<module>` for
|
||||
/// glue regions, never null for an identified unit. `lang` is the
|
||||
/// canonical code_lang.
|
||||
Code {
|
||||
line_start: u32,
|
||||
line_end: u32,
|
||||
symbol: Option<String>,
|
||||
lang: Option<String>,
|
||||
},
|
||||
}
|
||||
|
||||
// ── Forward-declared stubs (§3.7a). Bodies are final per design. ────────
|
||||
@@ -195,6 +207,24 @@ mod tests {
|
||||
/// previously failed at serde runtime because `tag = "kind"` cannot
|
||||
/// describe a newtype carrying a non-struct value. The struct-variant
|
||||
/// shape used here is the §9 schema migration.
|
||||
#[test]
|
||||
fn source_span_code_round_trips_and_tags_lowercase() {
|
||||
let s = SourceSpan::Code {
|
||||
line_start: 10,
|
||||
line_end: 42,
|
||||
symbol: Some("foo::Bar::baz".to_string()),
|
||||
lang: Some("rust".to_string()),
|
||||
};
|
||||
let v = serde_json::to_value(&s).unwrap();
|
||||
assert_eq!(v["kind"], "code");
|
||||
assert_eq!(v["line_start"], 10);
|
||||
assert_eq!(v["line_end"], 42);
|
||||
assert_eq!(v["symbol"], "foo::Bar::baz");
|
||||
assert_eq!(v["lang"], "rust");
|
||||
let back: SourceSpan = serde_json::from_value(v).unwrap();
|
||||
assert_eq!(back, s);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn inline_serde_round_trip() {
|
||||
let cases = vec![
|
||||
|
||||
@@ -25,10 +25,46 @@ pub struct IngestReport {
|
||||
/// extension key under "<no-ext>". `BTreeMap` so the wire JSON
|
||||
/// has stable key order across runs.
|
||||
pub skipped_by_extension: std::collections::BTreeMap<String, u32>,
|
||||
/// p10-1A-1: files skipped because they matched a repo-local `.gitignore`.
|
||||
#[serde(default)]
|
||||
pub skipped_gitignore: u32,
|
||||
/// p10-1A-1: files skipped because they matched a `.kebabignore` entry.
|
||||
#[serde(default)]
|
||||
pub skipped_kebabignore: u32,
|
||||
/// p10-1A-1: files skipped because they matched the built-in safety-net
|
||||
/// blacklist (`node_modules/`, `target/`, `__pycache__/`, `.venv/`,
|
||||
/// `venv/`, `env/`).
|
||||
#[serde(default)]
|
||||
pub skipped_builtin_blacklist: u32,
|
||||
/// p10-1A-1: files skipped because their first ~512 bytes contained a
|
||||
/// generated-file marker (`@generated`, `do not edit`, …).
|
||||
#[serde(default)]
|
||||
pub skipped_generated: u32,
|
||||
/// p10-1A-1: files skipped because they exceeded `max_file_bytes` or
|
||||
/// `max_file_lines` in `[ingest.code]`.
|
||||
#[serde(default)]
|
||||
pub skipped_size_exceeded: u32,
|
||||
/// p10-1A-1: sample file paths per skip category (≤ 5 each).
|
||||
#[serde(default)]
|
||||
pub skip_examples: SkipExamples,
|
||||
/// `None` ↔ wire `items: null` (`--summary-only`).
|
||||
pub items: Option<Vec<IngestItem>>,
|
||||
}
|
||||
|
||||
/// p10-1A-1: per-category sample of skipped file paths. Each category caps at
|
||||
/// 5 entries (oldest-first). Used for debugging "why was X not indexed?"
|
||||
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
|
||||
pub struct SkipExamples {
|
||||
#[serde(default)]
|
||||
pub generated: Vec<String>,
|
||||
#[serde(default)]
|
||||
pub size_exceeded: Vec<String>,
|
||||
#[serde(default)]
|
||||
pub builtin_blacklist: Vec<String>,
|
||||
#[serde(default)]
|
||||
pub gitignore: Vec<String>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
pub struct IngestItem {
|
||||
pub kind: IngestItemKind,
|
||||
@@ -58,3 +94,55 @@ pub enum IngestItemKind {
|
||||
Unchanged,
|
||||
Error,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::traits::SourceScope;
|
||||
|
||||
#[test]
|
||||
fn skip_examples_default_is_empty() {
|
||||
let s = SkipExamples::default();
|
||||
assert!(s.generated.is_empty());
|
||||
assert!(s.size_exceeded.is_empty());
|
||||
assert!(s.builtin_blacklist.is_empty());
|
||||
assert!(s.gitignore.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_report_skip_counters_serialize() {
|
||||
let r = IngestReport {
|
||||
scope: SourceScope {
|
||||
root: std::path::PathBuf::from("/tmp"),
|
||||
include: vec![],
|
||||
exclude: vec![],
|
||||
},
|
||||
scanned: 100,
|
||||
new: 50,
|
||||
updated: 0,
|
||||
skipped: 0,
|
||||
unchanged: 0,
|
||||
errors: 0,
|
||||
duration_ms: 1234,
|
||||
skipped_by_extension: Default::default(),
|
||||
skipped_gitignore: 30,
|
||||
skipped_kebabignore: 5,
|
||||
skipped_builtin_blacklist: 10,
|
||||
skipped_generated: 3,
|
||||
skipped_size_exceeded: 2,
|
||||
skip_examples: SkipExamples {
|
||||
generated: vec!["a/b.pb.rs".into()],
|
||||
size_exceeded: vec![],
|
||||
builtin_blacklist: vec!["node_modules/x.js".into()],
|
||||
gitignore: vec![],
|
||||
},
|
||||
items: None,
|
||||
};
|
||||
let v = serde_json::to_value(&r).unwrap();
|
||||
assert_eq!(v["skipped_gitignore"], 30);
|
||||
assert_eq!(v["skipped_builtin_blacklist"], 10);
|
||||
assert_eq!(v["skipped_generated"], 3);
|
||||
assert_eq!(v["skipped_size_exceeded"], 2);
|
||||
assert_eq!(v["skip_examples"]["generated"][0], "a/b.pb.rs");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -59,7 +59,7 @@ pub use answer::{
|
||||
Answer, AnswerCitation, AnswerRetrievalSummary, ModelRef, RefusalReason, TokenUsage,
|
||||
TraceId, Turn,
|
||||
};
|
||||
pub use ingest::{IngestItem, IngestItemKind, IngestReport};
|
||||
pub use ingest::{IngestItem, IngestItemKind, IngestReport, SkipExamples};
|
||||
pub use jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus};
|
||||
pub use vector::{VectorHit, VectorRecord};
|
||||
pub use errors::CoreError;
|
||||
|
||||
@@ -40,5 +40,23 @@ pub enum MediaType {
|
||||
Pdf,
|
||||
Image(ImageType),
|
||||
Audio(AudioType),
|
||||
/// p10-1A-2: a source-code file. Inner string is the canonical
|
||||
/// code_lang (design §3.5). 1A activates `"rust"` only; other
|
||||
/// recognized code langs are still routed `Other` until their phase.
|
||||
Code(String),
|
||||
Other(String),
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn media_type_code_serializes_lowercase_tagged() {
|
||||
let m = MediaType::Code("rust".to_string());
|
||||
let v = serde_json::to_value(&m).unwrap();
|
||||
assert_eq!(v, serde_json::json!({ "code": "rust" }));
|
||||
let back: MediaType = serde_json::from_value(v).unwrap();
|
||||
assert_eq!(back, m);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -17,6 +17,25 @@ pub struct Metadata {
|
||||
pub user_id_alias: Option<String>,
|
||||
/// Frontmatter keys we don't recognise are preserved here per §0 Q9.
|
||||
pub user: Map<String, Value>,
|
||||
|
||||
/// p10-1A-1: name of the source repo if the file lives inside a git
|
||||
/// working tree (`.git/` walk-up). null otherwise.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub repo: Option<String>,
|
||||
|
||||
/// p10-1A-1: HEAD branch at ingest time. null when no repo or detached HEAD.
|
||||
/// Informational only — current-state observability, not a partition key.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub git_branch: Option<String>,
|
||||
|
||||
/// p10-1A-1: HEAD commit (40-hex) at ingest time. null when no repo.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub git_commit: Option<String>,
|
||||
|
||||
/// p10-1A-1: programming language identifier (lowercase canonical). null
|
||||
/// for markdown / pdf / image. Set by `kebab_parse_code::lang::code_lang_for_path`.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub code_lang: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq, Serialize, Deserialize)]
|
||||
@@ -66,3 +85,54 @@ pub enum ProvenanceKind {
|
||||
Warning,
|
||||
Error,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn metadata_repo_fields_default_to_none_and_omit_when_serialized() {
|
||||
let m = Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
updated_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
source_type: SourceType::Markdown,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
};
|
||||
let v = serde_json::to_value(&m).unwrap();
|
||||
assert!(v.get("repo").is_none());
|
||||
assert!(v.get("git_branch").is_none());
|
||||
assert!(v.get("git_commit").is_none());
|
||||
assert!(v.get("code_lang").is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn metadata_repo_fields_present_when_some() {
|
||||
let m = Metadata {
|
||||
aliases: vec![],
|
||||
tags: vec![],
|
||||
created_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
updated_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
source_type: SourceType::Markdown,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: Some("kebab".into()),
|
||||
git_branch: Some("main".into()),
|
||||
git_commit: Some("a".repeat(40)),
|
||||
code_lang: Some("rust".into()),
|
||||
};
|
||||
let v = serde_json::to_value(&m).unwrap();
|
||||
assert_eq!(v["repo"], "kebab");
|
||||
assert_eq!(v["git_branch"], "main");
|
||||
assert_eq!(v["git_commit"].as_str().unwrap().len(), 40);
|
||||
assert_eq!(v["code_lang"], "rust");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -61,6 +61,14 @@ pub struct SearchFilters {
|
||||
/// p9-fb-36: restrict hits to a single document. None = no filter.
|
||||
#[serde(default)]
|
||||
pub doc_id: Option<DocumentId>,
|
||||
/// p10-1A-1: filter by `metadata.repo`. Empty = no filter; multi-value = OR.
|
||||
#[serde(default)]
|
||||
pub repo: Vec<String>,
|
||||
/// p10-1A-1: filter by `metadata.code_lang`. Empty = no filter; multi-value = OR.
|
||||
/// Identifiers are lowercase canonical names (`rust`, `python`, `typescript`, ...).
|
||||
/// Unknown values produce empty hits (consistent with `media` policy).
|
||||
#[serde(default)]
|
||||
pub code_lang: Vec<String>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
@@ -89,6 +97,15 @@ pub struct SearchHit {
|
||||
/// 옛 wire (fb-38 미만) 부재 시 `Rrf` default — hybrid 가 기본 mode.
|
||||
#[serde(default)]
|
||||
pub score_kind: ScoreKind,
|
||||
/// p10-1A-1: optional. Filled when the source file lives in a git repo
|
||||
/// (`.git/` walk-up). null for markdown / pdf / image hits and for code
|
||||
/// hits ingested via `kebab ingest-file` outside a repo boundary.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub repo: Option<String>,
|
||||
/// p10-1A-1: optional. Programming language identifier (lowercase). Set for
|
||||
/// every code/manifest/k8s chunk; null for markdown / pdf / image hits.
|
||||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||
pub code_lang: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
@@ -101,6 +118,19 @@ pub struct RetrievalDetail {
|
||||
pub vector_rank: Option<u32>,
|
||||
}
|
||||
|
||||
impl Default for RetrievalDetail {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
method: SearchMode::Hybrid,
|
||||
fusion_score: 0.0,
|
||||
lexical_score: None,
|
||||
vector_score: None,
|
||||
lexical_rank: None,
|
||||
vector_rank: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Filter for `kb-app::list_docs` (§7.2 DocumentStore::list_documents).
|
||||
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
|
||||
pub struct DocFilter {
|
||||
@@ -257,6 +287,8 @@ mod tests {
|
||||
indexed_at: datetime!(2026-05-09 12:00:00 UTC),
|
||||
stale: true,
|
||||
score_kind: ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
};
|
||||
let v = serde_json::to_value(&hit).unwrap();
|
||||
assert_eq!(v["indexed_at"], "2026-05-09T12:00:00Z");
|
||||
@@ -429,4 +461,74 @@ mod tests {
|
||||
assert!(v["response"].is_null());
|
||||
assert_eq!(v["error"]["code"], "config_invalid");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_hit_repo_and_code_lang_are_optional_and_omit_when_none() {
|
||||
let hit = SearchHit {
|
||||
rank: 1,
|
||||
chunk_id: ChunkId("c1".into()),
|
||||
doc_id: DocumentId("d1".into()),
|
||||
doc_path: WorkspacePath("a.md".into()),
|
||||
heading_path: vec![],
|
||||
section_label: None,
|
||||
snippet: "".into(),
|
||||
citation: Citation::Line {
|
||||
path: WorkspacePath("a.md".into()),
|
||||
start: 1,
|
||||
end: 2,
|
||||
section: None,
|
||||
},
|
||||
retrieval: RetrievalDetail::default(),
|
||||
index_version: IndexVersion("v1".into()),
|
||||
embedding_model: None,
|
||||
chunker_version: ChunkerVersion("md-heading-v1".into()),
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
};
|
||||
let v = serde_json::to_value(&hit).unwrap();
|
||||
assert!(v.get("repo").is_none(), "repo should be omitted when None");
|
||||
assert!(v.get("code_lang").is_none(), "code_lang should be omitted when None");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_hit_repo_and_code_lang_present_when_some() {
|
||||
let hit = SearchHit {
|
||||
rank: 1,
|
||||
chunk_id: ChunkId("c1".into()),
|
||||
doc_id: DocumentId("d1".into()),
|
||||
doc_path: WorkspacePath("a.rs".into()),
|
||||
heading_path: vec![],
|
||||
section_label: None,
|
||||
snippet: "".into(),
|
||||
citation: Citation::Code {
|
||||
path: WorkspacePath("a.rs".into()),
|
||||
line_start: 1,
|
||||
line_end: 2,
|
||||
symbol: None,
|
||||
lang: Some("rust".into()),
|
||||
},
|
||||
retrieval: RetrievalDetail::default(),
|
||||
index_version: IndexVersion("v1".into()),
|
||||
embedding_model: None,
|
||||
chunker_version: ChunkerVersion("code-rust-ast-v1".into()),
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: ScoreKind::Rrf,
|
||||
repo: Some("kebab".into()),
|
||||
code_lang: Some("rust".into()),
|
||||
};
|
||||
let v = serde_json::to_value(&hit).unwrap();
|
||||
assert_eq!(v["repo"], "kebab");
|
||||
assert_eq!(v["code_lang"], "rust");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn search_filters_repo_and_code_lang_default_to_empty_vec() {
|
||||
let f = SearchFilters::default();
|
||||
assert!(f.repo.is_empty());
|
||||
assert!(f.code_lang.is_empty());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -169,6 +169,20 @@ pub trait DocumentStore {
|
||||
&self,
|
||||
path: &WorkspacePath,
|
||||
) -> anyhow::Result<Option<RawAsset>>;
|
||||
|
||||
/// Look up a document row by its workspace path. Used by the
|
||||
/// document-centric skip path in `try_skip_unchanged` to avoid the
|
||||
/// twin-file flip-flop that the asset-side lookup suffers from
|
||||
/// (multiple files with identical content share one `assets` row
|
||||
/// whose `workspace_path` is overwritten on every UPSERT, so
|
||||
/// `get_asset_by_workspace_path` returns the wrong twin's path).
|
||||
///
|
||||
/// `documents.workspace_path` is UNIQUE (V001), so each twin has
|
||||
/// its own stable document row regardless of the asset de-dup.
|
||||
fn get_document_by_workspace_path(
|
||||
&self,
|
||||
path: &WorkspacePath,
|
||||
) -> anyhow::Result<Option<CanonicalDocument>>;
|
||||
}
|
||||
|
||||
pub trait VectorStore {
|
||||
|
||||
@@ -5,14 +5,14 @@ edition = { workspace = true }
|
||||
rust-version = { workspace = true }
|
||||
license = { workspace = true }
|
||||
repository = { workspace = true }
|
||||
description = "Local fastembed-rs adapter implementing kb_core::Embedder (multilingual-e5-small default)"
|
||||
description = "Local fastembed-rs adapter implementing kb_core::Embedder (multilingual-e5-large default, e5-small backwards-compat)"
|
||||
|
||||
[dependencies]
|
||||
kebab-config = { path = "../kebab-config" }
|
||||
kebab-embed = { path = "../kebab-embed" }
|
||||
# Default features bring `ort-download-binaries` (bundled ONNX runtime)
|
||||
# and `hf-hub-native-tls` (first-run model download). No extra features
|
||||
# needed for the multilingual-e5-small path.
|
||||
# needed for the multilingual-e5-{small,large} paths.
|
||||
fastembed = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
anyhow = { workspace = true }
|
||||
|
||||
@@ -1,8 +1,9 @@
|
||||
//! `kb-embed-local` — `FastembedEmbedder`, a local ONNX-backed
|
||||
//! [`Embedder`](kebab_embed::Embedder) implementation.
|
||||
//!
|
||||
//! Wraps [`fastembed::TextEmbedding`] for the default `multilingual-e5-small`
|
||||
//! (384-dim) model. Honors `config.models.embedding.batch_size` and applies
|
||||
//! Wraps [`fastembed::TextEmbedding`]. Default is `multilingual-e5-large`
|
||||
//! (1024-dim, p9-fb-39b); `multilingual-e5-small` (384-dim) is also supported
|
||||
//! for backwards-compat. Honors `config.models.embedding.batch_size` and applies
|
||||
//! the e5 prefix convention (§11.3 of the design report):
|
||||
//!
|
||||
//! * `EmbeddingKind::Document` → `"passage: "` prefix
|
||||
@@ -69,9 +70,9 @@ impl FastembedEmbedder {
|
||||
.with_context(|| format!("create fastembed cache dir {}", cache_dir.display()))?;
|
||||
|
||||
// 2. Resolve the fastembed enum variant from
|
||||
// `config.models.embedding.model`. Currently only the default
|
||||
// `multilingual-e5-small` is wired; other model names error
|
||||
// out with a clear message rather than silently misconfiguring.
|
||||
// `config.models.embedding.model`. Currently `multilingual-e5-large`
|
||||
// (default) and `multilingual-e5-small` are wired; other model names
|
||||
// error out with a clear message rather than silently misconfiguring.
|
||||
let model_name = resolve_model(&config.models.embedding.model)?;
|
||||
|
||||
// 3. Verify dim match BEFORE loading the model — if the config
|
||||
@@ -100,7 +101,7 @@ impl FastembedEmbedder {
|
||||
target: "kebab-embed-local",
|
||||
model = %config.models.embedding.model,
|
||||
cache_dir = %cache_dir.display(),
|
||||
"loading embedding model (first run will download ~470MB)"
|
||||
"loading embedding model (first run downloads model weights — ~470MB for e5-small, ~1.3GB for e5-large)"
|
||||
);
|
||||
let inner = TextEmbedding::try_new(opts)
|
||||
.context("fastembed: TextEmbedding::try_new")?;
|
||||
@@ -193,17 +194,18 @@ fn prefix_input(input: &EmbeddingInput<'_>) -> String {
|
||||
}
|
||||
|
||||
/// Resolve a `config.models.embedding.model` string to a fastembed
|
||||
/// `EmbeddingModel` enum variant. Only `multilingual-e5-small` is wired
|
||||
/// for p3-2; additional model names should be added (and their dims
|
||||
/// pinned in tests) as needed.
|
||||
/// `EmbeddingModel` enum variant. Currently supports `multilingual-e5-small`
|
||||
/// (384-dim) and `multilingual-e5-large` (1024-dim); additional model names
|
||||
/// should be added (and their dims pinned in tests) as needed.
|
||||
fn resolve_model(name: &str) -> Result<EmbeddingModel> {
|
||||
match name {
|
||||
"multilingual-e5-small" => Ok(EmbeddingModel::MultilingualE5Small),
|
||||
"multilingual-e5-large" => Ok(EmbeddingModel::MultilingualE5Large),
|
||||
other => anyhow::bail!(
|
||||
"kb-embed-local: unsupported embedding model {other:?}; \
|
||||
this adapter currently only ships `multilingual-e5-small`. \
|
||||
Add a new arm to `resolve_model` (and a fastembed feature \
|
||||
flag if needed) to support more."
|
||||
this adapter currently ships `multilingual-e5-small` and \
|
||||
`multilingual-e5-large`. Add a new arm to `resolve_model` \
|
||||
(and a fastembed feature flag if needed) to support more."
|
||||
),
|
||||
}
|
||||
}
|
||||
@@ -294,6 +296,12 @@ mod tests {
|
||||
resolve_model("multilingual-e5-small").expect("default model resolves");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn resolve_model_supports_e5_large() {
|
||||
let m = resolve_model("multilingual-e5-large").expect("e5-large should resolve");
|
||||
let _ = m;
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn resolve_unknown_model_errors() {
|
||||
let err = resolve_model("not-a-real-model").expect_err("unknown model errors");
|
||||
@@ -301,6 +309,21 @@ mod tests {
|
||||
assert!(msg.contains("unsupported embedding model"), "msg={msg}");
|
||||
}
|
||||
|
||||
// ── check_dim ────────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn check_dim_passes_for_1024() {
|
||||
check_dim(1024, 1024).expect("matching dims must pass");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn check_dim_rejects_384_vs_1024() {
|
||||
let err = check_dim(384, 1024).expect_err("dim mismatch must error");
|
||||
let msg = format!("{err}");
|
||||
assert!(msg.contains("384") && msg.contains("1024"),
|
||||
"error must mention both dims, got: {msg}");
|
||||
}
|
||||
|
||||
// expand_path tests live in `kb-config::paths`. The adapter imports
|
||||
// it and trusts the upstream coverage rather than duplicating it.
|
||||
}
|
||||
|
||||
@@ -3,10 +3,11 @@
|
||||
//!
|
||||
//! ## Why every test in this file is `#[ignore]`
|
||||
//!
|
||||
//! The first call to `FastembedEmbedder::new` downloads ~470 MB of
|
||||
//! weights from Hugging Face into `data_dir/models/fastembed/`. Doing
|
||||
//! that on every `cargo test` invocation is wasteful, so the bare
|
||||
//! invocation skips this file entirely.
|
||||
//! The first call to `FastembedEmbedder::new` downloads ~1.3 GB of
|
||||
//! weights (multilingual-e5-large per p9-fb-39b default) from Hugging
|
||||
//! Face into `data_dir/models/fastembed/`. Doing that on every
|
||||
//! `cargo test` invocation is wasteful, so the bare invocation skips
|
||||
//! this file entirely.
|
||||
//!
|
||||
//! Run the full suite with:
|
||||
//! ```text
|
||||
@@ -58,19 +59,20 @@ fn shared_embedder() -> &'static FastembedEmbedder {
|
||||
// ─── construction ─────────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
#[ignore = "downloads ~470MB ONNX model on first run; CI-only"]
|
||||
fn default_config_constructs_with_dims_384() {
|
||||
#[ignore = "downloads ~1.3GB ONNX model on first run; CI-only"]
|
||||
fn default_config_constructs_with_dims_1024() {
|
||||
// p9-fb-39b: default flipped to multilingual-e5-large (1024 dim).
|
||||
let emb = shared_embedder();
|
||||
assert_eq!(emb.dimensions(), 384);
|
||||
assert_eq!(emb.model_id().0, "multilingual-e5-small");
|
||||
assert_eq!(emb.dimensions(), 1024);
|
||||
assert_eq!(emb.model_id().0, "multilingual-e5-large");
|
||||
assert_eq!(emb.model_version().0, "v1");
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "downloads ~470MB ONNX model on first run; CI-only"]
|
||||
#[ignore = "downloads ~1.3GB ONNX model on first run; CI-only"]
|
||||
fn mismatched_dims_in_config_errors_at_construction() {
|
||||
let (mut cfg, _tmp) = test_config();
|
||||
cfg.models.embedding.dimensions = 512; // model is 384
|
||||
cfg.models.embedding.dimensions = 512; // model is 1024 (e5-large default)
|
||||
// `FastembedEmbedder` deliberately does not implement `Debug`
|
||||
// (its inner ONNX session has no useful debug shape), so we
|
||||
// can't use `expect_err`; match the Result manually.
|
||||
@@ -80,7 +82,7 @@ fn mismatched_dims_in_config_errors_at_construction() {
|
||||
};
|
||||
let msg = format!("{err}");
|
||||
assert!(msg.contains("dimension mismatch"), "msg={msg}");
|
||||
assert!(msg.contains("384"), "msg={msg}");
|
||||
assert!(msg.contains("1024"), "msg={msg}");
|
||||
assert!(msg.contains("512"), "msg={msg}");
|
||||
}
|
||||
|
||||
@@ -104,8 +106,8 @@ fn document_and_query_yield_different_vectors() {
|
||||
])
|
||||
.expect("embed two inputs");
|
||||
assert_eq!(out.len(), 2);
|
||||
assert_eq!(out[0].len(), 384);
|
||||
assert_eq!(out[1].len(), 384);
|
||||
assert_eq!(out[0].len(), 1024);
|
||||
assert_eq!(out[1].len(), 1024);
|
||||
|
||||
// Both vectors are L2-normalized → cosine similarity == dot product.
|
||||
let cos: f32 = out[0]
|
||||
@@ -142,11 +144,11 @@ fn output_vectors_are_l2_normalized() {
|
||||
];
|
||||
let out = emb.embed(&inputs).expect("embed");
|
||||
// Per `kebab_embed::assert_unit_norm` docs: `5e-4` is the safe bound at
|
||||
// 384 dims (f32::EPSILON × √384 ≈ 2.3e-6, but ONNX kernels add
|
||||
// 1024 dims (f32::EPSILON × √1024 ≈ 2.3e-6, but ONNX kernels add
|
||||
// their own per-component noise; 1e-3 is very generous and matches
|
||||
// the spec's `± 1e-3`).
|
||||
kebab_embed::assert_unit_norm(&out, 1e-3);
|
||||
kebab_embed::assert_vector_shape(&out, 384);
|
||||
kebab_embed::assert_vector_shape(&out, 1024);
|
||||
}
|
||||
|
||||
// ─── determinism ──────────────────────────────────────────────────────
|
||||
@@ -254,7 +256,7 @@ fn snapshot_aggregate_hash_is_stable() {
|
||||
// Round every component to 4 decimal places, hash deterministically.
|
||||
let mut hasher = DefaultHasher::new();
|
||||
for (i, v) in out.iter().enumerate() {
|
||||
assert_eq!(v.len(), 384, "row {i} dim mismatch");
|
||||
assert_eq!(v.len(), 1024, "row {i} dim mismatch");
|
||||
for x in v {
|
||||
let rounded: i32 = (*x * 1.0e4).round() as i32;
|
||||
rounded.hash(&mut hasher);
|
||||
|
||||
@@ -338,7 +338,8 @@ pub(crate) fn aggregate_from_rows(
|
||||
| Citation::Page { path, .. }
|
||||
| Citation::Region { path, .. }
|
||||
| Citation::Caption { path, .. }
|
||||
| Citation::Time { path, .. } => !path.0.is_empty(),
|
||||
| Citation::Time { path, .. }
|
||||
| Citation::Code { path, .. } => !path.0.is_empty(),
|
||||
});
|
||||
if covered {
|
||||
citation_num += 1;
|
||||
@@ -472,6 +473,8 @@ mod tests {
|
||||
indexed_at: OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -87,6 +87,8 @@ fn hit(rank: u32, chunk_id: &str, doc_id: &str) -> SearchHit {
|
||||
indexed_at: OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -336,21 +336,29 @@ fn runner_lexical_is_deterministic_per_query_payload() {
|
||||
"- id: q1\n query: ownership\n- id: q2\n query: heading\n",
|
||||
);
|
||||
|
||||
let run_a = run_with_golden(&yaml, || {
|
||||
let mut run_a = run_with_golden(&yaml, || {
|
||||
run_eval_with_config(&env.config, &lexical_opts()).unwrap()
|
||||
});
|
||||
let run_b = run_with_golden(&yaml, || {
|
||||
let mut run_b = run_with_golden(&yaml, || {
|
||||
run_eval_with_config(&env.config, &lexical_opts()).unwrap()
|
||||
});
|
||||
|
||||
// Run-level fields (`run_id`, `created_at`) intentionally diverge;
|
||||
// the per-query payload (which is what the snapshot fixture pins)
|
||||
// must be byte-identical.
|
||||
// must be byte-identical EXCEPT for `elapsed_ms`. Timing-sensitive
|
||||
// fields aren't determinism signals — they're µs-scale wall-clock
|
||||
// jitter and would otherwise make this assertion a flaky one (a 0
|
||||
// vs 1 ms divergence was observed under contended-CI load). Normalize
|
||||
// before comparing; see test #7 for the same exclusion done via a
|
||||
// projection.
|
||||
for qr in run_a.per_query.iter_mut().chain(run_b.per_query.iter_mut()) {
|
||||
qr.elapsed_ms = 0;
|
||||
}
|
||||
let a_json = serde_json::to_string(&run_a.per_query).unwrap();
|
||||
let b_json = serde_json::to_string(&run_b.per_query).unwrap();
|
||||
assert_eq!(
|
||||
a_json, b_json,
|
||||
"lexical-only per_query payload must be byte-identical across runs"
|
||||
"lexical-only per_query payload must be byte-identical across runs (timing normalized)"
|
||||
);
|
||||
}
|
||||
|
||||
|
||||
@@ -110,6 +110,8 @@ pub fn handle(state: &KebabAppState, input: SearchInput) -> CallToolResult {
|
||||
media,
|
||||
ingested_after,
|
||||
doc_id: input.doc_id.clone().map(kebab_core::DocumentId),
|
||||
repo: vec![],
|
||||
code_lang: vec![],
|
||||
};
|
||||
|
||||
let query = kebab_core::SearchQuery {
|
||||
|
||||
@@ -467,6 +467,10 @@ mod tests {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user,
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
24
crates/kebab-parse-code/Cargo.toml
Normal file
24
crates/kebab-parse-code/Cargo.toml
Normal file
@@ -0,0 +1,24 @@
|
||||
[package]
|
||||
name = "kebab-parse-code"
|
||||
version = { workspace = true }
|
||||
edition = { workspace = true }
|
||||
rust-version = { workspace = true }
|
||||
license = { workspace = true }
|
||||
repository = { workspace = true }
|
||||
description = "Language-aware code parsing for the kebab pipeline: lang dispatch / .git detect / skip helpers (P10-1A-1) + tree-sitter Rust AST extractor (P10-1A-2)"
|
||||
|
||||
[dependencies]
|
||||
kebab-core = { path = "../kebab-core" }
|
||||
anyhow = { workspace = true }
|
||||
gix = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
time = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tree-sitter = { workspace = true }
|
||||
tree-sitter-rust = { workspace = true }
|
||||
tree-sitter-python = { workspace = true }
|
||||
tree-sitter-typescript = { workspace = true }
|
||||
tree-sitter-javascript = { workspace = true }
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = { workspace = true }
|
||||
574
crates/kebab-parse-code/src/javascript.rs
Normal file
574
crates/kebab-parse-code/src/javascript.rs
Normal file
@@ -0,0 +1,574 @@
|
||||
//! `kebab-parse-code::javascript` — tree-sitter JavaScript / JSX AST
|
||||
//! extractor (P10-1B Task K).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("javascript")`].
|
||||
//! Walks the tree-sitter parse tree (single grammar
|
||||
//! [`tree_sitter_javascript::LANGUAGE`] — the JS grammar handles `.jsx`
|
||||
//! as well, no second grammar needed) and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit (free fn, class, each method,
|
||||
//! recursively per nested class), each carrying [`SourceSpan::Code`]
|
||||
//! with the unit's dotted symbol path prefixed by
|
||||
//! [`module_path_for_tsjs`].
|
||||
//!
|
||||
//! Glue declarations (`import_statement`, bare `export_statement`
|
||||
//! re-exports, `lexical_declaration` / `variable_declaration` at the
|
||||
//! module level, etc.) collapse into one grouped `<top-level>` (or
|
||||
//! `<module>`) unit.
|
||||
//!
|
||||
//! `export_statement` is unwrapped: an `export function|class` is
|
||||
//! treated as the inner declaration arm but the unit's line range
|
||||
//! comes from the OUTER `export_statement` so the `export ` prefix is
|
||||
//! folded in. `export default function () {}` / `export default class
|
||||
//! {}` (no `name` field) emits `default` as the symbol name.
|
||||
//!
|
||||
//! Differs from `typescript.rs` only by: single-grammar (no
|
||||
//! TS/TSX selection) and no `interface_declaration` /
|
||||
//! `type_alias_declaration` / `enum_declaration` arms (TS-only). All
|
||||
//! other walker behavior (export unwrap with `value`-field quirk for
|
||||
//! default-exported anonymous function/class, class-body method walk,
|
||||
//! glue flush, post-pass `<module>` → `<top-level>` rewrite) is
|
||||
//! identical.
|
||||
//!
|
||||
//! Scope follows 1A-2 / 1B Task K: AST unit extraction + dotted symbol
|
||||
//! paths + line ranges. Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-js-v1";
|
||||
|
||||
/// JavaScript / JSX AST extractor. Per-unit blocks via
|
||||
/// tree-sitter-javascript 0.25 (single `LANGUAGE` `LanguageFn` — the
|
||||
/// JS grammar covers `.jsx` natively, no second grammar) parsed by
|
||||
/// tree-sitter 0.26.
|
||||
pub struct JavascriptAstExtractor;
|
||||
|
||||
impl JavascriptAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for JavascriptAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for JavascriptAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "javascript")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for JavascriptAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: JavaScript source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let mod_prefix = crate::lang::module_path_for_tsjs(&asset.workspace_path.0);
|
||||
let language: tree_sitter::Language = tree_sitter_javascript::LANGUAGE.into();
|
||||
let blocks = build_blocks(&source, &doc_id, &mod_prefix, language)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("javascript".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted JavaScript doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
mod_prefix: &str,
|
||||
language: tree_sitter::Language,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&language)
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-javascript language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse JavaScript source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (same algorithm
|
||||
// as 1A Gap 1 / 1B Python / 1B TS).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_module_only_kind 0/1, s, e). `is_module_only_kind` flags
|
||||
// `import_statement` and bare re-export `export_statement`s — used by
|
||||
// the glue flush to pick `<module>` vs `<top-level>` provisional
|
||||
// label (1A's `is_mod_decl` analog).
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
/// Walk preceding `comment` siblings to extend the unit's line range
|
||||
/// upward, folding leading doc / line comments into the unit.
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
fn name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
/// Walk a class body, emitting one unit per `method_definition`.
|
||||
/// Class names already pushed onto `mod_path` by the caller, so
|
||||
/// method symbols come out as `<mod_prefix>.<Class>.<method>`.
|
||||
fn walk_class_body(
|
||||
body: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
) {
|
||||
let mut cur = body.walk();
|
||||
for child in body.named_children(&mut cur) {
|
||||
if child.kind() == "method_definition" {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
fn walk(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_declaration" => {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"class_declaration" => {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_class_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
"export_statement" => {
|
||||
// Try field "declaration" first (export class /
|
||||
// function). If absent, fall back to "value" —
|
||||
// `export default function () {}` / `export default
|
||||
// class {}` expose the anonymous function_expression
|
||||
// / class under the `value` field (same grammar
|
||||
// quirk as TS 0.23).
|
||||
let outer_s = s; // includes `export ` prefix line
|
||||
let outer_e = e;
|
||||
if let Some(inner) = child.child_by_field_name("declaration") {
|
||||
let inner_kind = inner.kind();
|
||||
match inner_kind {
|
||||
"function_declaration" | "class_declaration" => {
|
||||
let name_opt = name_text(&inner, src).map(|s| s.to_string());
|
||||
if let Some(name) = name_opt {
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, &name);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
if inner_kind == "class_declaration" {
|
||||
if let Some(body) = inner.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name);
|
||||
walk_class_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Defensive: `export default` with a
|
||||
// function_declaration that somehow
|
||||
// lacks `name`. Emit `default`.
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, "default");
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
}
|
||||
}
|
||||
// `lexical_declaration` etc. wrapped in
|
||||
// export: treat as glue (assigned arrow
|
||||
// fns / consts don't get their own unit).
|
||||
_ => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
}
|
||||
} else if let Some(value) = child.child_by_field_name("value") {
|
||||
// `export default <expr>`. We emit a unit only
|
||||
// for the function / class shapes (named or
|
||||
// anonymous); other value shapes are glue.
|
||||
match value.kind() {
|
||||
"function_expression"
|
||||
| "function_declaration"
|
||||
| "class"
|
||||
| "class_declaration" => {
|
||||
let name_opt = name_text(&value, src).map(|s| s.to_string());
|
||||
let leaf =
|
||||
name_opt.as_deref().unwrap_or("default").to_string();
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, &leaf);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
// Recurse into class body if we have one.
|
||||
if matches!(value.kind(), "class" | "class_declaration") {
|
||||
if let Some(body) = value.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(leaf);
|
||||
walk_class_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Bare `export { x };` / `export * from "..."` —
|
||||
// a re-export, glue with module-only flag set
|
||||
// (we have no `declaration` / `value` field for
|
||||
// it).
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
}
|
||||
"import_statement" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
"lexical_declaration" | "variable_declaration" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
}
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
let only_module = glue.iter().all(|(is_mod, _, _)| *is_mod == 1);
|
||||
let label = if only_module { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
walk(
|
||||
tree.root_node(),
|
||||
source,
|
||||
mod_prefix,
|
||||
&[],
|
||||
&mut units,
|
||||
&mut glue,
|
||||
);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import-only group becomes `<top-level>` (same
|
||||
// post-pass as 1A Gap 1 / Python / TS).
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("javascript".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("javascript".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture(workspace_path: &str) -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.js"),
|
||||
)
|
||||
.unwrap();
|
||||
let asset = crate::rust::tests_support::fixed_code_asset(workspace_path, "javascript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
JavascriptAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
fn symbols(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
|
||||
let mut s: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("javascript"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
s.sort();
|
||||
s
|
||||
}
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_javascript() {
|
||||
let e = JavascriptAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("javascript".into())));
|
||||
assert!(!e.supports(&MediaType::Code("typescript".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
#[test]
|
||||
fn js_units_match_design_3_4_symbols() {
|
||||
let doc = extract_fixture("src/sample.js");
|
||||
let syms = symbols(&doc);
|
||||
assert!(syms.iter().any(|s| s == "src/sample.add"), "got {syms:?}");
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever.search"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever.create"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.default"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.<top-level>"));
|
||||
}
|
||||
#[test]
|
||||
fn jsx_via_js_grammar() {
|
||||
// tree-sitter-javascript handles .jsx via the same single grammar.
|
||||
let bytes = b"export function App() { return null; }\n";
|
||||
let asset = crate::rust::tests_support::fixed_code_asset("src/App.jsx", "javascript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
let doc = JavascriptAstExtractor::new().extract(&ctx, bytes).unwrap();
|
||||
let syms = symbols(&doc);
|
||||
assert!(syms.iter().any(|s| s == "src/App.App"), "got {syms:?}");
|
||||
}
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture("src/sample.js");
|
||||
for _ in 0..30 {
|
||||
assert_eq!(extract_fixture("src/sample.js").blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
|
||||
/// In tree-sitter-javascript, `decorator` is a CHILD of
|
||||
/// `method_definition` (stored in the `decorator` field), so
|
||||
/// `method_definition.start_row` already covers the decorator line
|
||||
/// without any sibling walk. Verify that the emitted unit already
|
||||
/// includes the decorator line and line_start is 2 (the @Log() line).
|
||||
#[test]
|
||||
fn js_class_method_decorator_already_folded_by_grammar() {
|
||||
// Line 1 (1-indexed): "class Foo {"
|
||||
// Line 2: " @Log()" <- decorator (child of method_definition in JS grammar)
|
||||
// Line 3: " bar() { return 1; }"
|
||||
// Line 4: "}"
|
||||
let bytes = b"class Foo {\n @Log()\n bar() { return 1; }\n}\n";
|
||||
let asset = crate::rust::tests_support::fixed_code_asset("src/foo.js", "javascript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
let doc = JavascriptAstExtractor::new().extract(&ctx, bytes).unwrap();
|
||||
|
||||
let bar_block = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.find_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. }
|
||||
if symbol.as_deref() == Some("src/foo.Foo.bar") =>
|
||||
{
|
||||
Some(c)
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.expect("src/foo.Foo.bar block should be present");
|
||||
|
||||
// JS grammar: method_definition.start_row == decorator row, so
|
||||
// no sibling walk change needed -- decorator is already included.
|
||||
assert!(
|
||||
bar_block.code.contains("@Log()"),
|
||||
"JS method unit must include decorator (grammar folds it natively); got: {:?}",
|
||||
bar_block.code
|
||||
);
|
||||
match &bar_block.common.source_span {
|
||||
SourceSpan::Code { line_start, .. } => {
|
||||
assert_eq!(
|
||||
*line_start, 2,
|
||||
"JS line_start must cover the @Log() decorator line (got {line_start})"
|
||||
);
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
}
|
||||
121
crates/kebab-parse-code/src/lang.rs
Normal file
121
crates/kebab-parse-code/src/lang.rs
Normal file
@@ -0,0 +1,121 @@
|
||||
//! Canonical extension → language identifier mapping (spec §3.5).
|
||||
//!
|
||||
//! Lowercase canonical identifiers, matching tree-sitter parser conventions:
|
||||
//! `rust`, `python`, `typescript`, `javascript`, `go`, `java`, `kotlin`, `c`,
|
||||
//! `cpp`, `yaml`, `toml`, `json`, `shell`, `make`, `dockerfile`.
|
||||
|
||||
use std::path::Path;
|
||||
|
||||
/// Returns the canonical language identifier for a given file path, or
|
||||
/// `None` if the extension / filename is not recognized.
|
||||
///
|
||||
/// Matching priority:
|
||||
/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`)
|
||||
/// 2. lowercase extension match
|
||||
pub fn code_lang_for_path(path: &Path) -> Option<&'static str> {
|
||||
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
|
||||
match name {
|
||||
"Dockerfile" => return Some("dockerfile"),
|
||||
"Makefile" | "GNUmakefile" => return Some("make"),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
let ext = path.extension()?.to_str()?.to_ascii_lowercase();
|
||||
match ext.as_str() {
|
||||
"rs" => Some("rust"),
|
||||
"py" | "pyi" => Some("python"),
|
||||
"ts" | "tsx" | "mts" | "cts" => Some("typescript"),
|
||||
"js" | "mjs" | "cjs" | "jsx" => Some("javascript"),
|
||||
"go" => Some("go"),
|
||||
"java" => Some("java"),
|
||||
"kt" | "kts" => Some("kotlin"),
|
||||
"c" | "h" => Some("c"),
|
||||
"cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"),
|
||||
"yaml" | "yml" => Some("yaml"),
|
||||
"toml" => Some("toml"),
|
||||
"json" => Some("json"),
|
||||
"sh" | "bash" | "zsh" => Some("shell"),
|
||||
"mk" => Some("make"),
|
||||
"dockerfile" => Some("dockerfile"),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
/// p10-1B: workspace-relative Python file path → dotted module-path prefix.
|
||||
/// See plan §Task C for the exact rules + tasks/p10/p10-1b for the §3.4
|
||||
/// design contract.
|
||||
///
|
||||
/// Stripped source-roots: `src/`, `lib/`, and `crates/<crate>/src/`.
|
||||
/// `tests/`, `examples/`, and `benches/` are intentionally NOT stripped —
|
||||
/// they appear in test/example/bench namespaces and dropping them would
|
||||
/// conflate identical symbol names across conventional Python directories
|
||||
/// (e.g. `tests/test_foo.py` → `tests.test_foo`, not `test_foo`).
|
||||
pub fn module_path_for_python(workspace_path: &str) -> String {
|
||||
let mut p: &str = workspace_path;
|
||||
if let Some(rest) = p.strip_prefix("crates/") {
|
||||
if let Some(slash) = rest.find('/') {
|
||||
let after = &rest[slash + 1..];
|
||||
if let Some(stripped) = after.strip_prefix("src/") {
|
||||
p = stripped;
|
||||
}
|
||||
}
|
||||
} else if let Some(stripped) = p.strip_prefix("src/") {
|
||||
p = stripped;
|
||||
} else if let Some(stripped) = p.strip_prefix("lib/") {
|
||||
p = stripped;
|
||||
}
|
||||
let p = match p.strip_suffix(".py") {
|
||||
Some(s) => s,
|
||||
None => p.strip_suffix(".pyi").unwrap_or(p),
|
||||
};
|
||||
let p = if let Some(parent) = p.strip_suffix("/__init__") {
|
||||
parent
|
||||
} else if p == "__init__" {
|
||||
""
|
||||
} else {
|
||||
p
|
||||
};
|
||||
p.replace('/', ".")
|
||||
}
|
||||
|
||||
/// p10-1B: workspace-relative TS/JS file path → path-style prefix
|
||||
/// (no slash replacement, no source-root strip). See plan §Task C.
|
||||
pub fn module_path_for_tsjs(workspace_path: &str) -> String {
|
||||
let p = workspace_path;
|
||||
for ext in [".tsx", ".mts", ".cts", ".ts", ".jsx", ".mjs", ".cjs", ".js"] {
|
||||
if let Some(stripped) = p.strip_suffix(ext) {
|
||||
return stripped.to_string();
|
||||
}
|
||||
}
|
||||
p.to_string()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn module_path_for_python_strips_src_roots_and_extensions() {
|
||||
assert_eq!(module_path_for_python("kebab_eval/metrics.py"), "kebab_eval.metrics");
|
||||
assert_eq!(module_path_for_python("kebab_eval/__init__.py"), "kebab_eval");
|
||||
assert_eq!(module_path_for_python("src/foo/bar.py"), "foo.bar");
|
||||
assert_eq!(module_path_for_python("crates/x/src/foo/bar.py"), "foo.bar");
|
||||
assert_eq!(module_path_for_python("a/b/c.pyi"), "a.b.c");
|
||||
assert_eq!(module_path_for_python("standalone.py"), "standalone");
|
||||
assert_eq!(module_path_for_python("src/__init__.py"), "");
|
||||
// `tests/` is NOT a stripped source-root — it is preserved as
|
||||
// part of the module path so test symbols stay namespaced.
|
||||
assert_eq!(module_path_for_python("tests/test_foo.py"), "tests.test_foo");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn module_path_for_tsjs_keeps_slashes_and_strips_ext() {
|
||||
for ext in ["ts", "tsx", "mts", "cts", "js", "jsx", "mjs", "cjs"] {
|
||||
let p = format!("src/search/retriever/Retriever.{ext}");
|
||||
assert_eq!(module_path_for_tsjs(&p), "src/search/retriever/Retriever");
|
||||
}
|
||||
assert_eq!(module_path_for_tsjs("foo.ts"), "foo");
|
||||
assert_eq!(module_path_for_tsjs("a/b/c.ts"), "a/b/c");
|
||||
assert_eq!(module_path_for_tsjs("packages/x/src/Foo.ts"), "packages/x/src/Foo");
|
||||
}
|
||||
}
|
||||
31
crates/kebab-parse-code/src/lib.rs
Normal file
31
crates/kebab-parse-code/src/lib.rs
Normal file
@@ -0,0 +1,31 @@
|
||||
//! `kebab-parse-code` — language-aware parsing for code corpora.
|
||||
//!
|
||||
//! Phase 1A-1 ships infrastructure only:
|
||||
//!
|
||||
//! - [`lang::code_lang_for_path`] — extension → language identifier.
|
||||
//! - [`repo::detect_repo`] — `.git/` walk-up → repo / branch / commit metadata.
|
||||
//! - [`skip::is_generated_file`] / [`skip::is_oversized`] — pre-ingest skip
|
||||
//! helpers consulted by `kebab-source-fs`.
|
||||
//! - [`skip::BUILTIN_BLACKLIST`] — 6-entry safety-net pattern list.
|
||||
//!
|
||||
//! Per-language parser modules (`rust`, `python`, `typescript`, …) land in
|
||||
//! later phases (1A-2 onwards). The crate boundary follows other
|
||||
//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed
|
||||
//! / llm / rag.
|
||||
|
||||
pub mod javascript;
|
||||
pub mod lang;
|
||||
pub mod python;
|
||||
pub mod repo;
|
||||
pub mod rust;
|
||||
pub(crate) mod scaffold;
|
||||
pub mod skip;
|
||||
pub mod typescript;
|
||||
|
||||
pub use javascript::{PARSER_VERSION as JS_PARSER_VERSION, JavascriptAstExtractor};
|
||||
pub use lang::{code_lang_for_path, module_path_for_python, module_path_for_tsjs};
|
||||
pub use python::{PARSER_VERSION as PYTHON_PARSER_VERSION, PythonAstExtractor};
|
||||
pub use repo::{RepoMeta, detect_repo};
|
||||
pub use rust::{PARSER_VERSION as RUST_PARSER_VERSION, RustAstExtractor};
|
||||
pub use skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized};
|
||||
pub use typescript::{PARSER_VERSION as TS_PARSER_VERSION, TypescriptAstExtractor};
|
||||
437
crates/kebab-parse-code/src/python.rs
Normal file
437
crates/kebab-parse-code/src/python.rs
Normal file
@@ -0,0 +1,437 @@
|
||||
//! `kebab-parse-code::python` — tree-sitter Python AST extractor (P10-1B Task E).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("python")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit (free fn, class, each method, recursively
|
||||
//! per nested class), each carrying [`SourceSpan::Code`] with the unit's
|
||||
//! dotted self-reference symbol path prefixed by `module_path_for_python`
|
||||
//! (design §3.4). Glue declarations (`import` / `import from` /
|
||||
//! `expression_statement` / `assignment` / `global_statement` /
|
||||
//! `future_import_statement`) collapse into one grouped `<top-level>`
|
||||
//! (or `<module>`) unit.
|
||||
//!
|
||||
//! Decorators are folded into the decorated unit's line range via the
|
||||
//! `decorated_definition` unwrap arm (analog of the Rust `attribute_item`
|
||||
//! re-absorption in 1A — see §9.1).
|
||||
//!
|
||||
//! Scope follows 1A: AST unit extraction + dotted symbol paths + line
|
||||
//! ranges. Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-python-v1";
|
||||
|
||||
/// Python AST extractor. Per-unit blocks via tree-sitter-python 0.25
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct PythonAstExtractor;
|
||||
|
||||
impl PythonAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for PythonAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for PythonAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "python")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for PythonAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: Python source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let mod_prefix = crate::lang::module_path_for_python(&asset.workspace_path.0);
|
||||
let blocks = build_blocks(&source, &doc_id, &mod_prefix)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("python".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted Python doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
mod_prefix: &str,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_python::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-python language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Python source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (same algorithm
|
||||
// as 1A Gap 1).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_import 0/1, s, e). `is_import` flags `import_statement` /
|
||||
// `import_from_statement` / `future_import_statement` — used by the
|
||||
// glue flush to pick `<module>` vs `<top-level>` provisional label
|
||||
// (1A's `is_mod_decl` analog).
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
fn node_name<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
/// Walk preceding `comment` siblings to extend the unit's line range
|
||||
/// upward, folding leading doc / line comments into the unit. Note
|
||||
/// that Python decorators are NOT preceding siblings — they live
|
||||
/// INSIDE a `decorated_definition` parent — so they are handled by
|
||||
/// the unwrap arm below, not here.
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
fn walk(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
// Default unit line range — overridden by the
|
||||
// `decorated_definition` unwrap arm so decorator lines are
|
||||
// included.
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_definition" => {
|
||||
if let Some(name) = node_name(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"class_definition" => {
|
||||
if let Some(name) = node_name(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
// Recurse into the class body with the class
|
||||
// name pushed onto mod_path; methods become
|
||||
// `<...>.<ClassName>.<method>` and nested
|
||||
// classes recurse further with both names.
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk(body, src, mod_prefix, &np, units, glue);
|
||||
debug_assert!(
|
||||
glue.is_empty(),
|
||||
"inner walk must flush its glue before returning"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
"decorated_definition" => {
|
||||
// Unwrap: the inner definition supplies the symbol
|
||||
// name, but the unit's line range comes from the
|
||||
// OUTER `decorated_definition` so decorator lines
|
||||
// are folded in (analog of `attribute_item`
|
||||
// re-absorption in 1A — see plan §Task E note (b)).
|
||||
if let Some(inner) = child.child_by_field_name("definition") {
|
||||
let outer_s = s; // already includes decorators
|
||||
let outer_e = e;
|
||||
match inner.kind() {
|
||||
"function_definition" => {
|
||||
if let Some(name) = node_name(&inner, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
}
|
||||
}
|
||||
"class_definition" => {
|
||||
if let Some(name) = node_name(&inner, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
if let Some(body) = inner.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk(body, src, mod_prefix, &np, units, glue);
|
||||
debug_assert!(
|
||||
glue.is_empty(),
|
||||
"inner walk must flush its glue before returning"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
"import_statement" | "import_from_statement" | "future_import_statement" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
"expression_statement" | "assignment" | "global_statement" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
}
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
// Provisional label: `<module>` only if the group is exclusively
|
||||
// imports (1A's `only_mod_decls` analog). The post-pass below
|
||||
// demotes any `<module>` to `<top-level>` if the file produced
|
||||
// any real unit.
|
||||
let only_imports = glue.iter().all(|(is_import, _, _)| *is_import == 1);
|
||||
let label = if only_imports { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
walk(tree.root_node(), source, mod_prefix, &[], &mut units, &mut glue);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import-only group becomes `<top-level>` (same
|
||||
// algorithm as 1A Gap 1). Match on the suffix so a class-nested
|
||||
// glue group (which doesn't exist in current Python AST but is
|
||||
// future-proofed) still demotes correctly.
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("python".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("python".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.py"),
|
||||
)
|
||||
.unwrap();
|
||||
let asset = crate::rust::tests_support::fixed_code_asset(
|
||||
"kebab_eval/metrics.py", "python",
|
||||
);
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset, workspace_root: &root, config: &cfg,
|
||||
};
|
||||
PythonAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_python() {
|
||||
let e = PythonAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("python".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn python_units_carry_module_prefixed_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<String> = doc.blocks.iter().map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("python"));
|
||||
symbol.clone().unwrap()
|
||||
}
|
||||
_ => panic!("expected SourceSpan::Code"),
|
||||
},
|
||||
other => panic!("expected Block::Code, got {other:?}"),
|
||||
}).collect();
|
||||
syms.sort();
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.free"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Foo"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Foo.double"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Foo.name"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Outer"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Outer.Inner"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.Outer.Inner.helper"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.with_decorator"));
|
||||
assert!(syms.iter().any(|s| s == "kebab_eval.metrics.<top-level>"));
|
||||
// The `@no_type_check` decorator on `free` is folded into its
|
||||
// unit's line range (decorated_definition unwrap).
|
||||
let free_src = doc.blocks.iter().find_map(|b| match b {
|
||||
Block::Code(c) if matches!(&c.common.source_span,
|
||||
SourceSpan::Code{symbol,..} if symbol.as_deref()==Some("kebab_eval.metrics.free")) => Some(c.code.clone()),
|
||||
_ => None,
|
||||
}).unwrap();
|
||||
assert!(free_src.contains("@no_type_check"), "decorator folded in: {free_src}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 { assert_eq!(extract_fixture().blocks, a.blocks); }
|
||||
}
|
||||
}
|
||||
61
crates/kebab-parse-code/src/repo.rs
Normal file
61
crates/kebab-parse-code/src/repo.rs
Normal file
@@ -0,0 +1,61 @@
|
||||
//! Git repo auto-detection (spec §5.1).
|
||||
//!
|
||||
//! Walks up from `path` looking for a `.git/` directory. If found, reads
|
||||
//! repo dir name, current branch, and HEAD commit using `gix` (pure Rust;
|
||||
//! no `git` binary on PATH required).
|
||||
|
||||
use std::path::Path;
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Eq)]
|
||||
pub struct RepoMeta {
|
||||
pub name: String,
|
||||
pub branch: Option<String>,
|
||||
pub commit: Option<String>,
|
||||
}
|
||||
|
||||
/// Walk up from `path` until a `.git/` directory is found. Returns repo
|
||||
/// metadata, or `None` if no repo boundary is reached before the filesystem
|
||||
/// root.
|
||||
///
|
||||
/// - `name`: directory name containing `.git/`.
|
||||
/// - `branch`: current HEAD branch, or `"detached"` if detached HEAD, or
|
||||
/// `None` if branch can't be read.
|
||||
/// - `commit`: 40-hex commit SHA at HEAD, or `None` if empty repo / read
|
||||
/// failure.
|
||||
///
|
||||
/// `.git/` as a file (worktree marker / submodule) returns `None` for
|
||||
/// `branch` and `commit` and falls back to the parent dir name for `name`.
|
||||
pub fn detect_repo(path: &Path) -> Option<RepoMeta> {
|
||||
let mut cur = if path.is_dir() { path } else { path.parent()? };
|
||||
loop {
|
||||
let dotgit = cur.join(".git");
|
||||
if dotgit.is_dir() {
|
||||
let name = cur.file_name()?.to_string_lossy().into_owned();
|
||||
let (branch, commit) = read_head(cur);
|
||||
return Some(RepoMeta { name, branch, commit });
|
||||
} else if dotgit.is_file() {
|
||||
let name = cur.file_name()?.to_string_lossy().into_owned();
|
||||
return Some(RepoMeta { name, branch: None, commit: None });
|
||||
}
|
||||
cur = cur.parent()?;
|
||||
}
|
||||
}
|
||||
|
||||
fn read_head(repo_dir: &Path) -> (Option<String>, Option<String>) {
|
||||
match gix::open(repo_dir) {
|
||||
Ok(repo) => {
|
||||
let branch = repo
|
||||
.head_name()
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|n| n.shorten().to_string())
|
||||
.or_else(|| Some("detached".to_string()));
|
||||
let commit = repo
|
||||
.head_id()
|
||||
.ok()
|
||||
.map(|id| id.to_string());
|
||||
(branch, commit)
|
||||
}
|
||||
Err(_) => (None, None),
|
||||
}
|
||||
}
|
||||
545
crates/kebab-parse-code/src/rust.rs
Normal file
545
crates/kebab-parse-code/src/rust.rs
Normal file
@@ -0,0 +1,545 @@
|
||||
//! `kebab-parse-code::rust` — tree-sitter Rust AST extractor (P10-1A-2).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("rust")`].
|
||||
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
|
||||
//! top-level AST semantic unit (free fn, type, trait, macro, each impl
|
||||
//! method, recursively per module), each carrying [`SourceSpan::Code`]
|
||||
//! with the unit's self-reference symbol path (design §3.4). Glue
|
||||
//! declarations (`use` / `const` / `static` / bodyless `mod` / top-level
|
||||
//! attributes / macro invocations) collapse into one grouped
|
||||
//! `<top-level>` (or `<module>`) unit.
|
||||
//!
|
||||
//! Doc comments and attributes immediately preceding an item are folded
|
||||
//! into that item's line range (design §9.1 "선언 + doc comment").
|
||||
//!
|
||||
//! Scope is intentionally narrow: AST unit extraction + symbol paths +
|
||||
//! line ranges for Rust. The `CanonicalDocument` scaffold mirrors
|
||||
//! `kebab-parse-pdf`. Per design §3.4 / §9.1 / §9 versioning.
|
||||
//!
|
||||
//! Edge cases: a Rust file consisting solely of comments / whitespace
|
||||
//! (no fn / type / impl / mod / glue items) yields zero blocks → zero
|
||||
//! chunks → not surfaced in search. Safe (no panic) and consistent with
|
||||
//! "an empty page produces no chunks" in `pdf-page-v1`.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-rust-v1";
|
||||
|
||||
/// Rust AST extractor. Per-unit blocks via tree-sitter-rust 0.24
|
||||
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
|
||||
pub struct RustAstExtractor;
|
||||
|
||||
impl RustAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for RustAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for RustAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "rust")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for RustAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: Rust source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let blocks = build_blocks(&source, &doc_id)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("rust".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted Rust doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&tree_sitter_rust::LANGUAGE.into())
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-rust language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Rust source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (Gap 1).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new(); // (is_mod_decl 0/1, s, e)
|
||||
|
||||
fn node_name<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
let k = p.kind();
|
||||
if k == "line_comment" || k == "block_comment" || k == "attribute_item" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
fn walk(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
// Module-path prefix for this scope. Used for both real units
|
||||
// (`format!("{prefix}{name}")`) and glue group labels
|
||||
// (`format!("{prefix}<top-level>")`) so glue from `mod inner`
|
||||
// doesn't collide on symbol with file-top-level glue and keeps
|
||||
// module context downstream. Empty at file top level -> glue
|
||||
// stays exactly `<top-level>` / `<module>`.
|
||||
let prefix = if mod_path.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
format!("{}::", mod_path.join("::"))
|
||||
};
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_item" | "struct_item" | "enum_item" | "union_item"
|
||||
| "trait_item" | "type_item" => {
|
||||
if let Some(name) = node_name(&child, src) {
|
||||
// Gap 2: a leading attribute/comment that this unit
|
||||
// re-absorbs (via `unit_start`'s upward extension to
|
||||
// `s`) must not also remain in the glue group, or it
|
||||
// would be emitted in both chunks. Drop glue entries
|
||||
// at/after the unit's extended start.
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, &prefix);
|
||||
units.push((format!("{prefix}{name}"), s, e, true));
|
||||
}
|
||||
}
|
||||
"macro_definition" => {
|
||||
if let Some(name) = node_name(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, &prefix);
|
||||
units.push((format!("{prefix}{name}!"), s, e, true));
|
||||
}
|
||||
}
|
||||
// `impl` blocks: emit one unit per inner `function_item`.
|
||||
// Associated consts / types / non-fn members do not become
|
||||
// their own units in 1A (plan §1A scope; HOTFIXES will log
|
||||
// if a future need arises). See inner comment below.
|
||||
"impl_item" => {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, &prefix);
|
||||
let ty = child
|
||||
.child_by_field_name("type")
|
||||
.map(|c| src[c.start_byte()..c.end_byte()].trim().to_string());
|
||||
let tr = child
|
||||
.child_by_field_name("trait")
|
||||
.map(|c| src[c.start_byte()..c.end_byte()].trim().to_string());
|
||||
let owner = tr.or(ty).unwrap_or_else(|| "<impl>".to_string());
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let mut bc = body.walk();
|
||||
// 1A scope: only inner `function_item` children
|
||||
// become units. Associated consts / types and other
|
||||
// non-fn impl members are intentionally NOT emitted
|
||||
// as separate units in 1A (plan spec: "1 per inner
|
||||
// function_item").
|
||||
for m in body.named_children(&mut bc) {
|
||||
if m.kind() == "function_item" {
|
||||
if let Some(mn) = node_name(&m, src) {
|
||||
let ms = unit_start(&m);
|
||||
let me = m.end_position().row as u32 + 1;
|
||||
units.push((format!("{prefix}{owner}::{mn}"), ms, me, true));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"mod_item" => {
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
flush_glue(glue, units, &prefix);
|
||||
let name = node_name(&child, src).unwrap_or("mod").to_string();
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name);
|
||||
walk(body, src, &np, units, glue);
|
||||
// Invariant: `glue` is shared by `&mut` across
|
||||
// recursive `walk` calls; every `walk` path ends with
|
||||
// a `flush_glue`, so inner-scope glue can never leak
|
||||
// into this outer scope's group. Assert it structurally
|
||||
// rather than relying on that being incidental.
|
||||
debug_assert!(
|
||||
glue.is_empty(),
|
||||
"inner walk must flush its glue before returning"
|
||||
);
|
||||
} else {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
}
|
||||
"use_declaration" | "extern_crate_declaration" | "const_item"
|
||||
| "static_item" | "attribute_item" | "macro_invocation" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, &prefix);
|
||||
}
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
prefix: &str,
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
// Provisional label: `<module>` only if this group is exclusively
|
||||
// bodyless `mod foo;` declarations. The final decision (Gap 1) also
|
||||
// requires the *whole file* to have produced zero real units; that
|
||||
// demotion to `<top-level>` happens in the post-pass below.
|
||||
let only_mod_decls = glue.iter().all(|(is_mod, _, _)| *is_mod == 1);
|
||||
let label = if only_mod_decls { "<module>" } else { "<top-level>" };
|
||||
// Module-path-prefix the label so glue from `mod inner` carries
|
||||
// module context (`inner::<top-level>`) and doesn't collide with
|
||||
// file-top-level glue. `prefix` is empty at file top level, so the
|
||||
// symbol stays exactly `<top-level>` / `<module>` there.
|
||||
units.push((format!("{prefix}{label}"), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
walk(tree.root_node(), source, &[], &mut units, &mut glue);
|
||||
|
||||
// Gap 1: `<module>` is correct only when the file produced no real
|
||||
// (non-glue) semantic unit at all. If any real unit exists, every glue
|
||||
// group is `<top-level>`, even a pure mod-decl group.
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
// Match on the *suffix*: a glue group may now carry a module
|
||||
// prefix (`inner::<module>`), so demote any `…<module>` to the
|
||||
// same-prefixed `…<top-level>` rather than only the bare form.
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("rust".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("rust".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture() -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.rs"),
|
||||
)
|
||||
.unwrap();
|
||||
let asset = tests_support::fixed_code_asset("crates/x/src/sample.rs", "rust");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
|
||||
RustAstExtractor::new().extract(&ctx, &bytes).unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_rust() {
|
||||
let e = RustAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Code("python".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn emits_one_block_per_semantic_unit_with_symbols() {
|
||||
let doc = extract_fixture();
|
||||
let mut syms: Vec<(String, u32, u32)> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, line_start, line_end, lang } => {
|
||||
assert_eq!(lang.as_deref(), Some("rust"));
|
||||
(symbol.clone().unwrap(), *line_start, *line_end)
|
||||
}
|
||||
_ => panic!("code block must carry SourceSpan::Code"),
|
||||
},
|
||||
other => panic!("expected Block::Code, got {other:?}"),
|
||||
})
|
||||
.collect();
|
||||
syms.sort();
|
||||
let names: Vec<&str> = syms.iter().map(|(s, _, _)| s.as_str()).collect();
|
||||
assert!(names.contains(&"parse"));
|
||||
assert!(names.contains(&"Foo"));
|
||||
assert!(names.contains(&"Foo::double"));
|
||||
assert!(names.contains(&"Foo::name"));
|
||||
assert!(names.contains(&"Greet"));
|
||||
assert!(names.contains(&"inner::helper"));
|
||||
assert!(names.contains(&"<top-level>")); // use + const grouped
|
||||
let parse_src = doc.blocks.iter().find_map(|b| match b {
|
||||
Block::Code(c) if matches!(&c.common.source_span, SourceSpan::Code{symbol,..} if symbol.as_deref()==Some("parse")) => Some(c.code.clone()),
|
||||
_ => None,
|
||||
}).unwrap();
|
||||
assert!(parse_src.contains("/// Doc comment on a free fn."), "doc comment folded in: {parse_src}");
|
||||
}
|
||||
|
||||
/// Run the extractor on an in-memory Rust source string (no fixture
|
||||
/// file) and return (symbol, code) for every emitted block.
|
||||
fn extract_inline(source: &str) -> Vec<(String, String)> {
|
||||
let asset = tests_support::fixed_code_asset("crates/x/src/inline.rs", "rust");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
|
||||
let doc = RustAstExtractor::new()
|
||||
.extract(&ctx, source.as_bytes())
|
||||
.unwrap();
|
||||
doc.blocks
|
||||
.iter()
|
||||
.map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. } => {
|
||||
(symbol.clone().unwrap(), c.code.clone())
|
||||
}
|
||||
_ => panic!("code block must carry SourceSpan::Code"),
|
||||
},
|
||||
other => panic!("expected Block::Code, got {other:?}"),
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn module_label_scope_and_attribute_dedup() {
|
||||
// Source A (Gap 2): leading attribute is re-absorbed into the unit
|
||||
// and must NOT also form a separate <top-level> glue chunk.
|
||||
let a = extract_inline("#[derive(Debug)]\npub struct Tagged { x: u32 }\n");
|
||||
assert_eq!(a.len(), 1, "Gap 2: exactly one block, got {a:?}");
|
||||
assert_eq!(a[0].0, "Tagged");
|
||||
assert!(
|
||||
a[0].1.contains("#[derive(Debug)]"),
|
||||
"attribute folded into unit: {:?}",
|
||||
a[0].1
|
||||
);
|
||||
assert!(
|
||||
!a.iter().any(|(s, _)| s == "<top-level>"),
|
||||
"attribute must not also form a glue chunk: {a:?}"
|
||||
);
|
||||
|
||||
// Source B (Gap 1): file has no real units, only bodyless mod
|
||||
// decls -> the glue group is <module>.
|
||||
let b = extract_inline("mod a;\nmod b;\n");
|
||||
assert_eq!(b.len(), 1, "one glue block, got {b:?}");
|
||||
assert_eq!(b[0].0, "<module>");
|
||||
|
||||
// Source C (Gap 1): mod decls + a real unit -> the glue group is
|
||||
// <top-level>, NOT <module>, because the file has a real unit.
|
||||
let c = extract_inline("mod a;\nmod b;\npub fn f() {}\n");
|
||||
let syms: Vec<&str> = c.iter().map(|(s, _)| s.as_str()).collect();
|
||||
assert!(syms.contains(&"f"), "real unit present: {c:?}");
|
||||
assert!(
|
||||
syms.contains(&"<top-level>"),
|
||||
"mod-decl glue demoted to <top-level>: {c:?}"
|
||||
);
|
||||
assert!(
|
||||
!syms.contains(&"<module>"),
|
||||
"must not be <module> when file has a real unit: {c:?}"
|
||||
);
|
||||
|
||||
// Source D (Fix 1): glue inside a bodied `mod inner` must carry the
|
||||
// module-path prefix so it doesn't collide with file-top-level glue
|
||||
// and keeps module context downstream.
|
||||
let d = extract_inline("mod inner {\n use std::fmt;\n pub fn helper() {}\n}\n");
|
||||
let dsyms: Vec<&str> = d.iter().map(|(s, _)| s.as_str()).collect();
|
||||
assert!(
|
||||
dsyms.contains(&"inner::helper"),
|
||||
"real unit inside mod is prefixed: {d:?}"
|
||||
);
|
||||
assert!(
|
||||
dsyms.contains(&"inner::<top-level>"),
|
||||
"glue inside mod inner is module-prefixed, not bare: {d:?}"
|
||||
);
|
||||
assert!(
|
||||
!dsyms.contains(&"<top-level>"),
|
||||
"glue inside mod inner must NOT be the bare top-level symbol: {d:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture();
|
||||
for _ in 0..50 {
|
||||
assert_eq!(extract_fixture().blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests_support {
|
||||
use kebab_core::*;
|
||||
use time::OffsetDateTime;
|
||||
/// Test-only `RawAsset` builder for any tree-sitter language. Shared
|
||||
/// across `rust.rs` / `python.rs` / future TS+JS extractor tests so all
|
||||
/// in-crate code-extractor tests use a single canonical fixture shape.
|
||||
pub fn fixed_code_asset(workspace_path: &str, code_lang: &str) -> RawAsset {
|
||||
RawAsset {
|
||||
asset_id: AssetId("a".repeat(64)),
|
||||
source_uri: SourceUri::File(std::path::PathBuf::from(workspace_path)),
|
||||
workspace_path: WorkspacePath(workspace_path.to_string()),
|
||||
media_type: MediaType::Code(code_lang.to_string()),
|
||||
byte_len: 0,
|
||||
checksum: Checksum("b".repeat(64)),
|
||||
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
|
||||
stored: AssetStorage::Reference {
|
||||
path: std::path::PathBuf::from(workspace_path),
|
||||
sha: Checksum("b".repeat(64)),
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
45
crates/kebab-parse-code/src/scaffold.rs
Normal file
45
crates/kebab-parse-code/src/scaffold.rs
Normal file
@@ -0,0 +1,45 @@
|
||||
//! `kebab-parse-code::scaffold` — shared pure helpers used by all
|
||||
//! per-language extractor modules.
|
||||
//!
|
||||
//! These are `pub(crate)` utilities extracted from the four extractor
|
||||
//! modules (rust / python / typescript / javascript) where identical
|
||||
//! copies existed. Keeping them here is the single source of truth.
|
||||
|
||||
/// Extract the last path component (filename) from a `/`-separated
|
||||
/// workspace path string.
|
||||
/// For a path like `crates/x/src/foo.rs` this returns `foo.rs`.
|
||||
pub(crate) fn filename_from_workspace_path(p: &str) -> String {
|
||||
p.rsplit('/').next().unwrap_or(p).to_string()
|
||||
}
|
||||
|
||||
/// Strip the last dot-extension from a filename string.
|
||||
/// A leading dot (hidden-file convention) is preserved as-is.
|
||||
/// `foo.rs` → `foo`, `.hidden` → `.hidden`, `noext` → `noext`.
|
||||
pub(crate) fn strip_extension(filename: &str) -> String {
|
||||
match filename.rfind('.') {
|
||||
Some(0) => filename.to_string(),
|
||||
Some(idx) => filename[..idx].to_string(),
|
||||
None => filename.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Join `(mod_prefix, mod_path, name)` into a dotted symbol string.
|
||||
///
|
||||
/// Used by Python / TypeScript / JavaScript extractors. Rust uses
|
||||
/// `::` separators instead and builds symbols inline; this helper
|
||||
/// covers the `.`-joined languages.
|
||||
///
|
||||
/// Empty `mod_prefix` (e.g. file is `__init__.py` at workspace root)
|
||||
/// drops the leading prefix segment; empty `mod_path` (file top-level)
|
||||
/// drops the class-nesting middle segment.
|
||||
pub(crate) fn join_symbol(mod_prefix: &str, mod_path: &[String], name: &str) -> String {
|
||||
let mut parts: Vec<&str> = Vec::with_capacity(mod_path.len() + 2);
|
||||
if !mod_prefix.is_empty() {
|
||||
parts.push(mod_prefix);
|
||||
}
|
||||
for p in mod_path {
|
||||
parts.push(p.as_str());
|
||||
}
|
||||
parts.push(name);
|
||||
parts.join(".")
|
||||
}
|
||||
65
crates/kebab-parse-code/src/skip.rs
Normal file
65
crates/kebab-parse-code/src/skip.rs
Normal file
@@ -0,0 +1,65 @@
|
||||
//! Pre-ingest skip helpers (spec §5.2 + §5.3 + §5.4).
|
||||
//!
|
||||
//! - [`BUILTIN_BLACKLIST`] — 6 gitignore-style patterns universal across
|
||||
//! ecosystems. Source of truth: spec §5.2.
|
||||
//! - [`is_generated_file`] — reads first ~512 bytes, checks for 7
|
||||
//! case-insensitive markers.
|
||||
//! - [`is_oversized`] — byte cap then line cap.
|
||||
|
||||
use anyhow::Result;
|
||||
use std::fs::File;
|
||||
use std::io::{BufRead, BufReader, Read};
|
||||
use std::path::Path;
|
||||
|
||||
/// 6 built-in gitignore-style patterns. Applied in addition to `.gitignore`
|
||||
/// + `.kebabignore`. User can override via `.kebabignore` negation
|
||||
/// (`!pattern`).
|
||||
pub const BUILTIN_BLACKLIST: &[&str] = &[
|
||||
"**/node_modules/**",
|
||||
"**/target/**",
|
||||
"**/__pycache__/**",
|
||||
"**/.venv/**",
|
||||
"**/venv/**",
|
||||
"**/env/**",
|
||||
];
|
||||
|
||||
/// Read first 512 bytes, check for any of 7 case-insensitive generated-file
|
||||
/// markers. Returns Ok(true) on match, Ok(false) otherwise.
|
||||
pub fn is_generated_file(path: &Path) -> Result<bool> {
|
||||
let mut buf = [0u8; 512];
|
||||
let mut f = File::open(path)?;
|
||||
let n = f.read(&mut buf)?;
|
||||
if n == 0 {
|
||||
return Ok(false);
|
||||
}
|
||||
let head = std::str::from_utf8(&buf[..n]).unwrap_or("");
|
||||
let lower: String = head.lines().take(10).collect::<Vec<_>>().join("\n").to_ascii_lowercase();
|
||||
Ok(
|
||||
lower.contains("@generated")
|
||||
|| lower.contains("code generated by")
|
||||
|| lower.contains("do not edit")
|
||||
|| lower.contains("do not modify")
|
||||
|| lower.contains("automatically generated")
|
||||
|| lower.contains("auto-generated")
|
||||
|| lower.contains("autogenerated"),
|
||||
)
|
||||
}
|
||||
|
||||
/// Check if `path` exceeds `max_bytes` or `max_lines`. Byte cap first
|
||||
/// (cheap), then line cap (streaming with early exit).
|
||||
pub fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result<bool> {
|
||||
let meta = std::fs::metadata(path)?;
|
||||
if meta.len() > max_bytes {
|
||||
return Ok(true);
|
||||
}
|
||||
let reader = BufReader::new(File::open(path)?);
|
||||
let mut count: u32 = 0;
|
||||
for line in reader.lines() {
|
||||
let _ = line?;
|
||||
count = count.saturating_add(1);
|
||||
if count > max_lines {
|
||||
return Ok(true);
|
||||
}
|
||||
}
|
||||
Ok(false)
|
||||
}
|
||||
691
crates/kebab-parse-code/src/typescript.rs
Normal file
691
crates/kebab-parse-code/src/typescript.rs
Normal file
@@ -0,0 +1,691 @@
|
||||
//! `kebab-parse-code::typescript` — tree-sitter TypeScript / TSX AST
|
||||
//! extractor (P10-1B Task H).
|
||||
//!
|
||||
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("typescript")`].
|
||||
//! Walks the tree-sitter parse tree (one of two grammars selected by the
|
||||
//! workspace path's extension — `.tsx` uses [`tree_sitter_typescript::LANGUAGE_TSX`],
|
||||
//! everything else uses [`tree_sitter_typescript::LANGUAGE_TYPESCRIPT`]) and
|
||||
//! emits one [`Block::Code`] per top-level AST semantic unit (free fn,
|
||||
//! class, each method, interface, type alias, enum, recursively per
|
||||
//! nested class), each carrying [`SourceSpan::Code`] with the unit's
|
||||
//! dotted symbol path prefixed by [`module_path_for_tsjs`].
|
||||
//!
|
||||
//! Glue declarations (`import_statement`, bare `export_statement`
|
||||
//! re-exports, `lexical_declaration` / `variable_declaration` at the
|
||||
//! module level, namespace / module declarations, etc.) collapse into
|
||||
//! one grouped `<top-level>` (or `<module>`) unit.
|
||||
//!
|
||||
//! `export_statement` is unwrapped: an `export function|class|interface
|
||||
//! |type|enum` is treated as the inner declaration arm but the unit's
|
||||
//! line range comes from the OUTER `export_statement` so the `export `
|
||||
//! prefix is folded in. `export default function () {}` / `export
|
||||
//! default class {}` (no `name` field) emits `default` as the symbol
|
||||
//! name.
|
||||
//!
|
||||
//! Scope follows 1A-2 / 1B Task E: AST unit extraction + dotted symbol
|
||||
//! paths + line ranges. Per design §3.4 / §9.1 / §9 versioning.
|
||||
|
||||
use anyhow::Result;
|
||||
use kebab_core::{
|
||||
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
|
||||
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
|
||||
id_for_block, id_for_doc,
|
||||
};
|
||||
use serde_json::Map;
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use crate::scaffold::{filename_from_workspace_path, join_symbol, strip_extension};
|
||||
|
||||
pub const PARSER_VERSION: &str = "code-ts-v1";
|
||||
|
||||
/// TypeScript / TSX AST extractor. Per-unit blocks via
|
||||
/// tree-sitter-typescript 0.23 (`LANGUAGE_TYPESCRIPT` / `LANGUAGE_TSX`
|
||||
/// — two `LanguageFn`s, selected by extension) parsed by tree-sitter
|
||||
/// 0.26.
|
||||
pub struct TypescriptAstExtractor;
|
||||
|
||||
impl TypescriptAstExtractor {
|
||||
pub fn new() -> Self {
|
||||
Self
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for TypescriptAstExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for TypescriptAstExtractor {
|
||||
fn supports(&self, m: &MediaType) -> bool {
|
||||
matches!(m, MediaType::Code(l) if l == "typescript")
|
||||
}
|
||||
|
||||
fn parser_version(&self) -> ParserVersion {
|
||||
ParserVersion(PARSER_VERSION.to_string())
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
ctx: &kebab_core::ExtractContext<'_>,
|
||||
bytes: &[u8],
|
||||
) -> Result<CanonicalDocument> {
|
||||
let asset = ctx.asset;
|
||||
if !self.supports(&asset.media_type) {
|
||||
anyhow::bail!(
|
||||
"kebab-parse-code: unsupported media_type for TypescriptAstExtractor: {:?}",
|
||||
asset.media_type
|
||||
);
|
||||
}
|
||||
|
||||
let parser_version = self.parser_version();
|
||||
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
|
||||
|
||||
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
|
||||
anyhow::anyhow!("kebab-parse-code: TypeScript source is not valid UTF-8: {e}")
|
||||
})?;
|
||||
|
||||
let mod_prefix = crate::lang::module_path_for_tsjs(&asset.workspace_path.0);
|
||||
let language = select_grammar(&asset.workspace_path.0);
|
||||
let blocks = build_blocks(&source, &doc_id, &mod_prefix, language)?;
|
||||
let unit_count = blocks.len() as u32;
|
||||
|
||||
let now = OffsetDateTime::now_utc();
|
||||
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
|
||||
events.push(ProvenanceEvent {
|
||||
at: asset.discovered_at,
|
||||
agent: "kb-source-fs".to_string(),
|
||||
kind: ProvenanceKind::Discovered,
|
||||
note: None,
|
||||
});
|
||||
events.push(ProvenanceEvent {
|
||||
at: now,
|
||||
agent: "kb-parse-code".to_string(),
|
||||
kind: ProvenanceKind::Parsed,
|
||||
note: Some(format!(
|
||||
"parser_version={}; unit_count={}",
|
||||
parser_version.0, unit_count
|
||||
)),
|
||||
});
|
||||
|
||||
let title = {
|
||||
let fname = filename_from_workspace_path(&asset.workspace_path.0);
|
||||
strip_extension(&fname)
|
||||
};
|
||||
|
||||
// Resolve the file's absolute path for repo detection. If the
|
||||
// source URI carries a relative path, anchor it at the workspace
|
||||
// root so the `.git/` walk-up starts from the right place.
|
||||
let abs_path = match &asset.source_uri {
|
||||
kebab_core::SourceUri::File(p) => {
|
||||
if p.is_absolute() {
|
||||
p.clone()
|
||||
} else {
|
||||
ctx.workspace_root.join(p)
|
||||
}
|
||||
}
|
||||
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
|
||||
};
|
||||
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
|
||||
Some(r) => (Some(r.name), r.branch, r.commit),
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
let metadata = Metadata {
|
||||
aliases: Vec::new(),
|
||||
tags: Vec::new(),
|
||||
created_at: asset.discovered_at,
|
||||
updated_at: asset.discovered_at,
|
||||
source_type: SourceType::Note,
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Map::new(),
|
||||
repo,
|
||||
git_branch,
|
||||
git_commit,
|
||||
code_lang: Some("typescript".to_string()),
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
target: "kebab-parse-code",
|
||||
"extracted TypeScript doc_id={} workspace_path={} units={}",
|
||||
doc_id.0,
|
||||
asset.workspace_path.0,
|
||||
unit_count
|
||||
);
|
||||
|
||||
Ok(CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: asset.asset_id.clone(),
|
||||
workspace_path: asset.workspace_path.clone(),
|
||||
title,
|
||||
lang: Lang("und".to_string()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance: Provenance { events },
|
||||
parser_version,
|
||||
schema_version: 1,
|
||||
doc_version: 1,
|
||||
last_chunker_version: None,
|
||||
last_embedding_version: None,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Select the tree-sitter grammar based on the workspace path's
|
||||
/// extension. `.tsx` → TSX grammar; everything else (`.ts`, `.mts`,
|
||||
/// `.cts`, `.d.ts`, missing extension) → TypeScript grammar (the JSX-
|
||||
/// agnostic variants all share one grammar in tree-sitter-typescript 0.23).
|
||||
fn select_grammar(workspace_path: &str) -> tree_sitter::Language {
|
||||
if workspace_path.ends_with(".tsx") {
|
||||
tree_sitter_typescript::LANGUAGE_TSX.into()
|
||||
} else {
|
||||
tree_sitter_typescript::LANGUAGE_TYPESCRIPT.into()
|
||||
}
|
||||
}
|
||||
|
||||
fn build_blocks(
|
||||
source: &str,
|
||||
doc_id: &kebab_core::DocumentId,
|
||||
mod_prefix: &str,
|
||||
language: tree_sitter::Language,
|
||||
) -> anyhow::Result<Vec<kebab_core::Block>> {
|
||||
let mut parser = tree_sitter::Parser::new();
|
||||
parser
|
||||
.set_language(&language)
|
||||
.map_err(|e| anyhow::anyhow!("set tree-sitter-typescript language: {e}"))?;
|
||||
let tree = parser
|
||||
.parse(source.as_bytes(), None)
|
||||
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse TypeScript source"))?;
|
||||
let lines: Vec<&str> = source.split('\n').collect();
|
||||
|
||||
// units: (symbol, line_start, line_end, is_real_semantic_unit).
|
||||
// Glue groups are pushed with a sentinel symbol + is_real=false so a
|
||||
// post-pass can decide `<module>` vs `<top-level>` (same algorithm
|
||||
// as 1A Gap 1 / 1B Python).
|
||||
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
|
||||
// (is_module_only_kind 0/1, s, e). `is_module_only_kind` flags
|
||||
// `import_statement` and bare re-export `export_statement`s — used by
|
||||
// the glue flush to pick `<module>` vs `<top-level>` provisional
|
||||
// label (1A's `is_mod_decl` analog).
|
||||
let mut glue: Vec<(usize, u32, u32)> = Vec::new();
|
||||
|
||||
/// Walk preceding `comment` and `decorator` siblings to extend the
|
||||
/// unit's line range upward, folding leading doc/line comments and
|
||||
/// decorators into the unit.
|
||||
///
|
||||
/// In tree-sitter-typescript 0.23, TS class-method decorators (and
|
||||
/// class-level decorators) are **`class_body` siblings** that
|
||||
/// immediately precede the `method_definition` node — they are NOT
|
||||
/// children of `method_definition`. (Contrast with
|
||||
/// tree-sitter-javascript, where the `decorator` IS stored inside
|
||||
/// `method_definition` as a named child via the `decorator` field, so
|
||||
/// `method_definition.start_row` already covers the decorator line
|
||||
/// there — no sibling walk needed in `javascript.rs`.)
|
||||
///
|
||||
/// Extending backward over `decorator` siblings here matches Python's
|
||||
/// `decorated_definition` arm behavior: the decorator line is folded
|
||||
/// into the emitted unit's line range.
|
||||
fn unit_start(n: &tree_sitter::Node) -> u32 {
|
||||
let mut start = n.start_position().row as u32 + 1;
|
||||
let mut prev = n.prev_sibling();
|
||||
while let Some(p) = prev {
|
||||
if p.kind() == "comment" || p.kind() == "decorator" {
|
||||
start = p.start_position().row as u32 + 1;
|
||||
prev = p.prev_sibling();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
start
|
||||
}
|
||||
fn name_text<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
|
||||
n.child_by_field_name("name")
|
||||
.map(|c| &src[c.start_byte()..c.end_byte()])
|
||||
}
|
||||
/// Walk a class body, emitting one unit per `method_definition`.
|
||||
/// Class names already pushed onto `mod_path` by the caller, so
|
||||
/// method symbols come out as `<mod_prefix>.<Class>.<method>`.
|
||||
fn walk_class_body(
|
||||
body: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
) {
|
||||
let mut cur = body.walk();
|
||||
for child in body.named_children(&mut cur) {
|
||||
if child.kind() == "method_definition" {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
fn walk(
|
||||
node: tree_sitter::Node,
|
||||
src: &str,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
) {
|
||||
let mut cur = node.walk();
|
||||
for child in node.named_children(&mut cur) {
|
||||
let s = unit_start(&child);
|
||||
let e = child.end_position().row as u32 + 1;
|
||||
match child.kind() {
|
||||
"function_declaration" => {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"class_declaration" => {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
if let Some(body) = child.child_by_field_name("body") {
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name.to_string());
|
||||
walk_class_body(body, src, mod_prefix, &np, units);
|
||||
}
|
||||
}
|
||||
}
|
||||
"interface_declaration"
|
||||
| "type_alias_declaration"
|
||||
| "enum_declaration" => {
|
||||
if let Some(name) = name_text(&child, src) {
|
||||
glue.retain(|(_, gs, _)| *gs < s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, name);
|
||||
units.push((sym, s, e, true));
|
||||
}
|
||||
}
|
||||
"export_statement" => {
|
||||
// Try field "declaration" first (export class /
|
||||
// function / interface / type / enum). If absent,
|
||||
// fall back to "value" — `export default function
|
||||
// () {}` / `export default class {}` expose the
|
||||
// anonymous function_expression / class under the
|
||||
// `value` field (TS grammar 0.23).
|
||||
let outer_s = s; // includes `export ` prefix line
|
||||
let outer_e = e;
|
||||
if let Some(inner) = child.child_by_field_name("declaration") {
|
||||
let inner_kind = inner.kind();
|
||||
match inner_kind {
|
||||
"function_declaration"
|
||||
| "class_declaration"
|
||||
| "interface_declaration"
|
||||
| "type_alias_declaration"
|
||||
| "enum_declaration" => {
|
||||
let name_opt = name_text(&inner, src).map(|s| s.to_string());
|
||||
if let Some(name) = name_opt {
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym =
|
||||
join_symbol(mod_prefix, mod_path, &name);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
if inner_kind == "class_declaration" {
|
||||
if let Some(body) =
|
||||
inner.child_by_field_name("body")
|
||||
{
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(name);
|
||||
walk_class_body(
|
||||
body, src, mod_prefix, &np, units,
|
||||
);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// `export default function foo() {}`
|
||||
// path is covered by name_opt =
|
||||
// Some(_) above; the no-name path
|
||||
// here is `export default` with a
|
||||
// function_declaration that
|
||||
// somehow lacks `name`. Emit
|
||||
// `default` defensively.
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym =
|
||||
join_symbol(mod_prefix, mod_path, "default");
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
}
|
||||
}
|
||||
// `lexical_declaration` etc. wrapped in
|
||||
// export: treat as glue (assigned arrow
|
||||
// fns / consts don't get their own unit).
|
||||
_ => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
}
|
||||
} else if let Some(value) = child.child_by_field_name("value") {
|
||||
// `export default <expr>`. We emit a unit only
|
||||
// for the function / class shapes (named or
|
||||
// anonymous); other value shapes are glue.
|
||||
match value.kind() {
|
||||
"function_expression"
|
||||
| "function_declaration"
|
||||
| "class"
|
||||
| "class_declaration" => {
|
||||
let name_opt =
|
||||
name_text(&value, src).map(|s| s.to_string());
|
||||
let leaf = name_opt
|
||||
.as_deref()
|
||||
.unwrap_or("default")
|
||||
.to_string();
|
||||
glue.retain(|(_, gs, _)| *gs < outer_s);
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
let sym = join_symbol(mod_prefix, mod_path, &leaf);
|
||||
units.push((sym, outer_s, outer_e, true));
|
||||
// Recurse into class body if we have one.
|
||||
if matches!(
|
||||
value.kind(),
|
||||
"class" | "class_declaration"
|
||||
) {
|
||||
if let Some(body) =
|
||||
value.child_by_field_name("body")
|
||||
{
|
||||
let mut np = mod_path.to_vec();
|
||||
np.push(leaf);
|
||||
walk_class_body(
|
||||
body, src, mod_prefix, &np, units,
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Bare `export { x };` / `export * from "..."` —
|
||||
// a re-export, glue with module-only flag set
|
||||
// (we have no `declaration` / `value` field for
|
||||
// it).
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
}
|
||||
"import_statement" => {
|
||||
glue.push((1, s, e));
|
||||
}
|
||||
"lexical_declaration" | "variable_declaration" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
// Namespace / module declarations (rare in app code,
|
||||
// common in `.d.ts`): treat as glue per plan §Task H
|
||||
// (1B 1차 scope; documented under spec Risks).
|
||||
"internal_module" | "module" | "ambient_declaration" => {
|
||||
glue.push((0, s, e));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
flush_glue(glue, units, mod_prefix, mod_path);
|
||||
}
|
||||
fn flush_glue(
|
||||
glue: &mut Vec<(usize, u32, u32)>,
|
||||
units: &mut Vec<(String, u32, u32, bool)>,
|
||||
mod_prefix: &str,
|
||||
mod_path: &[String],
|
||||
) {
|
||||
if glue.is_empty() {
|
||||
return;
|
||||
}
|
||||
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
|
||||
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
|
||||
let only_module = glue.iter().all(|(is_mod, _, _)| *is_mod == 1);
|
||||
let label = if only_module { "<module>" } else { "<top-level>" };
|
||||
units.push((join_symbol(mod_prefix, mod_path, label), s, e, false));
|
||||
glue.clear();
|
||||
}
|
||||
|
||||
walk(
|
||||
tree.root_node(),
|
||||
source,
|
||||
mod_prefix,
|
||||
&[],
|
||||
&mut units,
|
||||
&mut glue,
|
||||
);
|
||||
|
||||
// `<module>` is correct only when the file produced no real unit.
|
||||
// Otherwise the import-only group becomes `<top-level>` (same
|
||||
// post-pass as 1A Gap 1 / Python).
|
||||
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
|
||||
if has_real_unit {
|
||||
for (sym, _, _, is_real) in units.iter_mut() {
|
||||
if !*is_real && sym.ends_with("<module>") {
|
||||
let pre = &sym[..sym.len() - "<module>".len()];
|
||||
*sym = format!("{pre}<top-level>");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let total_lines = lines.len() as u32;
|
||||
let mut blocks = Vec::with_capacity(units.len());
|
||||
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
|
||||
let line_start = ls.max(1);
|
||||
let line_end = le.min(total_lines.max(1));
|
||||
let span = SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol: Some(symbol),
|
||||
lang: Some("typescript".to_string()),
|
||||
};
|
||||
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
|
||||
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
|
||||
blocks.push(Block::Code(CodeBlock {
|
||||
common: CommonBlock {
|
||||
block_id,
|
||||
heading_path: Vec::new(),
|
||||
source_span: span,
|
||||
},
|
||||
lang: Some("typescript".to_string()),
|
||||
code,
|
||||
}));
|
||||
}
|
||||
Ok(blocks)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kebab_core::{Block, MediaType, SourceSpan};
|
||||
|
||||
fn extract_fixture(name: &str, workspace_path: &str) -> kebab_core::CanonicalDocument {
|
||||
let bytes = std::fs::read(format!(
|
||||
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/{}"),
|
||||
name
|
||||
))
|
||||
.unwrap();
|
||||
let asset = crate::rust::tests_support::fixed_code_asset(workspace_path, "typescript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
TypescriptAstExtractor::new()
|
||||
.extract(&ctx, &bytes)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
fn symbols(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
|
||||
let mut s: Vec<String> = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.filter_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, lang, .. } => {
|
||||
assert_eq!(lang.as_deref(), Some("typescript"));
|
||||
symbol.clone()
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.collect();
|
||||
s.sort();
|
||||
s
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extractor_supports_only_media_code_typescript() {
|
||||
let e = TypescriptAstExtractor::new();
|
||||
assert!(e.supports(&MediaType::Code("typescript".into())));
|
||||
assert!(!e.supports(&MediaType::Code("rust".into())));
|
||||
assert!(!e.supports(&MediaType::Markdown));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ts_units_match_design_3_4_symbols() {
|
||||
// workspace_path `src/sample.ts` → mod_prefix `src/sample`
|
||||
let doc = extract_fixture("sample.ts", "src/sample.ts");
|
||||
let syms = symbols(&doc);
|
||||
assert!(syms.iter().any(|s| s == "src/sample.add"), "got {syms:?}");
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Greet"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Maybe"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever.search"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.Retriever.create"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.default"));
|
||||
assert!(syms.iter().any(|s| s == "src/sample.<top-level>"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tsx_uses_tsx_grammar_and_emits_units() {
|
||||
let doc = extract_fixture("sample.tsx", "src/sample.tsx");
|
||||
let syms = symbols(&doc);
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "src/sample.Hello"),
|
||||
"got {syms:?}"
|
||||
);
|
||||
assert!(
|
||||
syms.iter().any(|s| s == "src/sample.<top-level>"),
|
||||
"arrow fn + import should roll into top-level glue"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_runs() {
|
||||
let a = extract_fixture("sample.ts", "src/sample.ts");
|
||||
for _ in 0..30 {
|
||||
assert_eq!(extract_fixture("sample.ts", "src/sample.ts").blocks, a.blocks);
|
||||
}
|
||||
}
|
||||
|
||||
/// Regression: TS class-method decorators are `class_body` preceding
|
||||
/// siblings (not children of `method_definition`). The `unit_start`
|
||||
/// backward walk must fold the decorator line into the emitted unit's
|
||||
/// line range, matching Python's `decorated_definition` behavior.
|
||||
#[test]
|
||||
fn class_method_decorator_folded_into_method_unit() {
|
||||
// Line 1 (1-indexed): "class Foo {"
|
||||
// Line 2: " @Log()" <- decorator
|
||||
// Line 3: " bar() { return 1; }"
|
||||
// Line 4: "}"
|
||||
let bytes = b"class Foo {\n @Log()\n bar() { return 1; }\n}\n";
|
||||
let asset = crate::rust::tests_support::fixed_code_asset("src/foo.ts", "typescript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
let doc = TypescriptAstExtractor::new().extract(&ctx, bytes).unwrap();
|
||||
|
||||
let bar_block = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.find_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. }
|
||||
if symbol.as_deref() == Some("src/foo.Foo.bar") =>
|
||||
{
|
||||
Some(c)
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.expect("src/foo.Foo.bar block should be present");
|
||||
|
||||
// After the fix, the unit MUST include the @Log() decorator line.
|
||||
assert!(
|
||||
bar_block.code.contains("@Log()"),
|
||||
"decorator must be folded into class-method unit (Python parity); got code: {:?}",
|
||||
bar_block.code
|
||||
);
|
||||
|
||||
// line_start must be 2 (the @Log() line), NOT 3 (the bar() line).
|
||||
match &bar_block.common.source_span {
|
||||
SourceSpan::Code { line_start, .. } => {
|
||||
assert_eq!(
|
||||
*line_start, 2,
|
||||
"line_start must cover the @Log() decorator line (got {line_start})"
|
||||
);
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Class-level decorator (preceding sibling of `class_declaration` in
|
||||
/// the module root): same `unit_start` backward walk folds it in.
|
||||
/// Line 1: "@Injectable()"
|
||||
/// Line 2: "class Service {"
|
||||
/// Line 3: "}"
|
||||
#[test]
|
||||
fn ts_class_decorator_folded_into_class_unit() {
|
||||
let bytes = b"@Injectable()\nclass Service {\n}\n";
|
||||
let asset = crate::rust::tests_support::fixed_code_asset("src/svc.ts", "typescript");
|
||||
let cfg = kebab_core::ExtractConfig::default();
|
||||
let root = std::path::PathBuf::from("/tmp");
|
||||
let ctx = kebab_core::ExtractContext {
|
||||
asset: &asset,
|
||||
workspace_root: &root,
|
||||
config: &cfg,
|
||||
};
|
||||
let doc = TypescriptAstExtractor::new().extract(&ctx, bytes).unwrap();
|
||||
|
||||
let svc_block = doc
|
||||
.blocks
|
||||
.iter()
|
||||
.find_map(|b| match b {
|
||||
Block::Code(c) => match &c.common.source_span {
|
||||
SourceSpan::Code { symbol, .. }
|
||||
if symbol.as_deref() == Some("src/svc.Service") =>
|
||||
{
|
||||
Some(c)
|
||||
}
|
||||
_ => None,
|
||||
},
|
||||
_ => None,
|
||||
})
|
||||
.expect("src/svc.Service block should be present");
|
||||
|
||||
assert!(
|
||||
svc_block.code.contains("@Injectable()"),
|
||||
"class-level decorator must be folded into the class unit; got code: {:?}",
|
||||
svc_block.code
|
||||
);
|
||||
match &svc_block.common.source_span {
|
||||
SourceSpan::Code { line_start, .. } => {
|
||||
assert_eq!(
|
||||
*line_start, 1,
|
||||
"line_start must cover the @Injectable() line (got {line_start})"
|
||||
);
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
}
|
||||
9
crates/kebab-parse-code/tests/fixtures/sample.js
vendored
Normal file
9
crates/kebab-parse-code/tests/fixtures/sample.js
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
// sample.js
|
||||
import { x } from "./other";
|
||||
const ANSWER = 42;
|
||||
export function add(a, b) { return a + b; }
|
||||
export class Retriever {
|
||||
search(q) { return []; }
|
||||
static create() { return new Retriever(); }
|
||||
}
|
||||
export default function () { return 1; }
|
||||
26
crates/kebab-parse-code/tests/fixtures/sample.py
vendored
Normal file
26
crates/kebab-parse-code/tests/fixtures/sample.py
vendored
Normal file
@@ -0,0 +1,26 @@
|
||||
"""sample fixture."""
|
||||
import os
|
||||
|
||||
ANSWER = 42
|
||||
|
||||
@no_type_check
|
||||
def free(x):
|
||||
"""free fn."""
|
||||
return x + 1
|
||||
|
||||
class Foo:
|
||||
"""doc."""
|
||||
def double(self, n):
|
||||
return n * 2
|
||||
|
||||
@classmethod
|
||||
def name(cls):
|
||||
return "foo"
|
||||
|
||||
class Outer:
|
||||
class Inner:
|
||||
def helper(self):
|
||||
return True
|
||||
|
||||
def with_decorator():
|
||||
pass
|
||||
35
crates/kebab-parse-code/tests/fixtures/sample.rs
vendored
Normal file
35
crates/kebab-parse-code/tests/fixtures/sample.rs
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
//! sample fixture
|
||||
|
||||
use std::fmt;
|
||||
|
||||
const ANSWER: u32 = 42;
|
||||
|
||||
/// Doc comment on a free fn.
|
||||
pub fn parse(input: &str) -> usize {
|
||||
input.len()
|
||||
}
|
||||
|
||||
pub struct Foo {
|
||||
pub n: u32,
|
||||
}
|
||||
|
||||
impl Foo {
|
||||
/// method doc
|
||||
pub fn double(&self) -> u32 {
|
||||
self.n * 2
|
||||
}
|
||||
|
||||
fn name() -> &'static str {
|
||||
"foo"
|
||||
}
|
||||
}
|
||||
|
||||
pub trait Greet {
|
||||
fn hello(&self) -> String;
|
||||
}
|
||||
|
||||
mod inner {
|
||||
pub fn helper() -> bool {
|
||||
true
|
||||
}
|
||||
}
|
||||
11
crates/kebab-parse-code/tests/fixtures/sample.ts
vendored
Normal file
11
crates/kebab-parse-code/tests/fixtures/sample.ts
vendored
Normal file
@@ -0,0 +1,11 @@
|
||||
// sample.ts
|
||||
import { x } from "./other";
|
||||
const ANSWER = 42;
|
||||
export interface Greet { hello(): string; }
|
||||
export type Maybe<T> = T | null;
|
||||
export function add(a: number, b: number): number { return a + b; }
|
||||
export class Retriever {
|
||||
search(q: string): string[] { return []; }
|
||||
static create(): Retriever { return new Retriever(); }
|
||||
}
|
||||
export default function () { return 1; }
|
||||
4
crates/kebab-parse-code/tests/fixtures/sample.tsx
vendored
Normal file
4
crates/kebab-parse-code/tests/fixtures/sample.tsx
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
// sample.tsx
|
||||
import React from "react";
|
||||
export function Hello({ name }: { name: string }) { return <span>{name}</span>; }
|
||||
export const App = () => <Hello name="x" />; // arrow fn assigned → glue
|
||||
67
crates/kebab-parse-code/tests/lang.rs
Normal file
67
crates/kebab-parse-code/tests/lang.rs
Normal file
@@ -0,0 +1,67 @@
|
||||
use kebab_parse_code::code_lang_for_path;
|
||||
use std::path::Path;
|
||||
|
||||
#[test]
|
||||
fn known_extensions_map_to_canonical_identifiers() {
|
||||
let cases = [
|
||||
("foo.rs", Some("rust")),
|
||||
("foo.py", Some("python")),
|
||||
("foo.pyi", Some("python")),
|
||||
("foo.ts", Some("typescript")),
|
||||
("foo.tsx", Some("typescript")),
|
||||
("foo.mts", Some("typescript")), // ESM TS — same grammar
|
||||
("foo.cts", Some("typescript")), // CommonJS TS — same grammar
|
||||
("foo.js", Some("javascript")),
|
||||
("foo.mjs", Some("javascript")),
|
||||
("foo.cjs", Some("javascript")),
|
||||
("foo.jsx", Some("javascript")),
|
||||
("foo.go", Some("go")),
|
||||
("foo.java", Some("java")),
|
||||
("foo.kt", Some("kotlin")),
|
||||
("foo.kts", Some("kotlin")),
|
||||
("foo.c", Some("c")),
|
||||
("foo.h", Some("c")),
|
||||
("foo.cpp", Some("cpp")),
|
||||
("foo.cc", Some("cpp")),
|
||||
("foo.cxx", Some("cpp")),
|
||||
("foo.hpp", Some("cpp")),
|
||||
("foo.hh", Some("cpp")),
|
||||
("foo.hxx", Some("cpp")),
|
||||
("foo.yaml", Some("yaml")),
|
||||
("foo.yml", Some("yaml")),
|
||||
("foo.toml", Some("toml")),
|
||||
("foo.json", Some("json")),
|
||||
("foo.sh", Some("shell")),
|
||||
("foo.bash", Some("shell")),
|
||||
("foo.zsh", Some("shell")),
|
||||
("foo.mk", Some("make")),
|
||||
];
|
||||
for (path, expected) in cases {
|
||||
assert_eq!(
|
||||
code_lang_for_path(Path::new(path)),
|
||||
expected,
|
||||
"path = {path}"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn special_filenames_map_to_identifiers() {
|
||||
assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile"));
|
||||
assert_eq!(code_lang_for_path(Path::new("foo.dockerfile")), Some("dockerfile"));
|
||||
assert_eq!(code_lang_for_path(Path::new("Makefile")), Some("make"));
|
||||
assert_eq!(code_lang_for_path(Path::new("GNUmakefile")), Some("make"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn unknown_extension_returns_none() {
|
||||
assert_eq!(code_lang_for_path(Path::new("foo.docx")), None);
|
||||
assert_eq!(code_lang_for_path(Path::new("foo")), None);
|
||||
assert_eq!(code_lang_for_path(Path::new("foo.unknown")), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn case_insensitive() {
|
||||
assert_eq!(code_lang_for_path(Path::new("Foo.RS")), Some("rust"));
|
||||
assert_eq!(code_lang_for_path(Path::new("FOO.YAML")), Some("yaml"));
|
||||
}
|
||||
62
crates/kebab-parse-code/tests/repo.rs
Normal file
62
crates/kebab-parse-code/tests/repo.rs
Normal file
@@ -0,0 +1,62 @@
|
||||
use kebab_parse_code::repo::detect_repo;
|
||||
use std::fs;
|
||||
use std::process::Command;
|
||||
use tempfile::TempDir;
|
||||
|
||||
fn init_git_repo(root: &std::path::Path) {
|
||||
let run = |args: &[&str]| {
|
||||
Command::new("git")
|
||||
.args(args)
|
||||
.current_dir(root)
|
||||
.status()
|
||||
.expect("git command failed");
|
||||
};
|
||||
run(&["init", "-q"]);
|
||||
run(&["config", "user.email", "test@test"]);
|
||||
run(&["config", "user.name", "test"]);
|
||||
fs::write(root.join("README.md"), "hi").unwrap();
|
||||
run(&["add", "README.md"]);
|
||||
run(&["commit", "-q", "-m", "init"]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn detect_repo_returns_none_outside_git() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let nested = tmp.path().join("a/b/c.txt");
|
||||
fs::create_dir_all(nested.parent().unwrap()).unwrap();
|
||||
fs::write(&nested, "x").unwrap();
|
||||
assert!(detect_repo(&nested).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn detect_repo_walks_up_to_git_dir() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let repo_root = tmp.path().join("myrepo");
|
||||
fs::create_dir_all(&repo_root).unwrap();
|
||||
init_git_repo(&repo_root);
|
||||
let nested = repo_root.join("src/deep/file.rs");
|
||||
fs::create_dir_all(nested.parent().unwrap()).unwrap();
|
||||
fs::write(&nested, "x").unwrap();
|
||||
|
||||
let meta = detect_repo(&nested).expect("should detect repo");
|
||||
assert_eq!(meta.name, "myrepo");
|
||||
assert!(meta.branch.is_some());
|
||||
assert!(meta.commit.is_some());
|
||||
assert_eq!(meta.commit.as_ref().unwrap().len(), 40);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn detect_repo_returns_consistent_metadata_for_paths_in_same_repo() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let repo_root = tmp.path().join("myrepo");
|
||||
fs::create_dir_all(&repo_root).unwrap();
|
||||
init_git_repo(&repo_root);
|
||||
let f1 = repo_root.join("a.rs");
|
||||
let f2 = repo_root.join("b.rs");
|
||||
fs::write(&f1, "x").unwrap();
|
||||
fs::write(&f2, "x").unwrap();
|
||||
let m1 = detect_repo(&f1).unwrap();
|
||||
let m2 = detect_repo(&f2).unwrap();
|
||||
assert_eq!(m1.name, m2.name);
|
||||
assert_eq!(m1.commit, m2.commit);
|
||||
}
|
||||
74
crates/kebab-parse-code/tests/skip.rs
Normal file
74
crates/kebab-parse-code/tests/skip.rs
Normal file
@@ -0,0 +1,74 @@
|
||||
use kebab_parse_code::skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized};
|
||||
use std::fs;
|
||||
use tempfile::NamedTempFile;
|
||||
|
||||
#[test]
|
||||
fn generated_header_markers_trigger_skip() {
|
||||
let cases = [
|
||||
"// @generated\nfn foo() {}\n",
|
||||
"// Code generated by tonic-build. DO NOT EDIT.\nfn x() {}\n",
|
||||
"/* DO NOT EDIT */\nfn x() {}\n",
|
||||
"/* do not modify */\nfn x() {}\n",
|
||||
"// AUTOMATICALLY GENERATED\nfn x() {}\n",
|
||||
"# auto-generated\ndef x(): pass\n",
|
||||
"// autogenerated\nfn x() {}\n",
|
||||
];
|
||||
for content in cases {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
fs::write(f.path(), content).unwrap();
|
||||
assert!(is_generated_file(f.path()).unwrap(), "content: {content:?}");
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normal_code_is_not_flagged_generated() {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
fs::write(f.path(), "fn main() {\n println!(\"hi\");\n}\n").unwrap();
|
||||
assert!(!is_generated_file(f.path()).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generated_returns_false_for_empty_file() {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
fs::write(f.path(), "").unwrap();
|
||||
assert!(!is_generated_file(f.path()).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversized_by_bytes_returns_true() {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
let body: String = "x".repeat(300_000);
|
||||
fs::write(f.path(), &body).unwrap();
|
||||
assert!(is_oversized(f.path(), 262_144, 5_000).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oversized_by_lines_returns_true() {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
let body: String = "x\n".repeat(6_000);
|
||||
fs::write(f.path(), &body).unwrap();
|
||||
assert!(is_oversized(f.path(), 262_144, 5_000).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn small_file_returns_false_for_oversize() {
|
||||
let f = NamedTempFile::new().unwrap();
|
||||
fs::write(f.path(), "fn foo() {}\n").unwrap();
|
||||
assert!(!is_oversized(f.path(), 262_144, 5_000).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn builtin_blacklist_has_exactly_six_entries() {
|
||||
assert_eq!(BUILTIN_BLACKLIST.len(), 6);
|
||||
let expected = [
|
||||
"**/node_modules/**",
|
||||
"**/target/**",
|
||||
"**/__pycache__/**",
|
||||
"**/.venv/**",
|
||||
"**/venv/**",
|
||||
"**/env/**",
|
||||
];
|
||||
for pat in expected {
|
||||
assert!(BUILTIN_BLACKLIST.contains(&pat), "missing pattern: {pat}");
|
||||
}
|
||||
}
|
||||
@@ -190,6 +190,10 @@ impl Extractor for ImageExtractor {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user,
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
|
||||
@@ -471,6 +471,10 @@ fn derive_metadata(
|
||||
trust_level,
|
||||
user_id_alias,
|
||||
user,
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -194,6 +194,10 @@ impl Extractor for PdfTextExtractor {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user,
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
};
|
||||
|
||||
tracing::debug!(
|
||||
|
||||
@@ -1170,6 +1170,8 @@ mod stream_event_serde_tests {
|
||||
indexed_at: datetime!(2026-05-09 12:00:00 UTC),
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -171,6 +171,8 @@ pub fn mk_hit_with_indexed_at(
|
||||
indexed_at,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -49,6 +49,13 @@ pub(crate) fn citation_from_first_span(
|
||||
end_ms: *end_ms,
|
||||
speaker: None,
|
||||
},
|
||||
Some(SourceSpan::Code { line_start, line_end, symbol, lang }) => Citation::Code {
|
||||
path,
|
||||
line_start: *line_start,
|
||||
line_end: *line_end,
|
||||
symbol: symbol.clone(),
|
||||
lang: lang.clone(),
|
||||
},
|
||||
// Byte-spans don't have a Citation variant. Fall back to a
|
||||
// Line citation pointing at the document head — better than
|
||||
// fabricating a position. Spans-empty falls into the same
|
||||
@@ -72,3 +79,43 @@ pub(crate) fn citation_from_first_span(
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use kebab_core::{Citation, SourceSpan, WorkspacePath};
|
||||
|
||||
#[test]
|
||||
fn build_citation_code_maps_symbol_and_lang() {
|
||||
let span = SourceSpan::Code {
|
||||
line_start: 5,
|
||||
line_end: 30,
|
||||
symbol: Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk".into()),
|
||||
lang: Some("rust".into()),
|
||||
};
|
||||
let c = super::citation_from_first_span(
|
||||
"c1",
|
||||
WorkspacePath::new("crates/kebab-chunk/src/md_heading_v1.rs".to_string()).unwrap(),
|
||||
None,
|
||||
Some(&span),
|
||||
);
|
||||
match c {
|
||||
Citation::Code {
|
||||
path,
|
||||
line_start,
|
||||
line_end,
|
||||
symbol,
|
||||
lang,
|
||||
} => {
|
||||
assert_eq!(path.0, "crates/kebab-chunk/src/md_heading_v1.rs");
|
||||
assert_eq!(line_start, 5);
|
||||
assert_eq!(line_end, 30);
|
||||
assert_eq!(
|
||||
symbol.as_deref(),
|
||||
Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk")
|
||||
);
|
||||
assert_eq!(lang.as_deref(), Some("rust"));
|
||||
}
|
||||
other => panic!("expected Citation::Code, got {other:?}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -509,6 +509,8 @@ mod tests {
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -760,6 +762,8 @@ mod tests {
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -346,6 +346,34 @@ fn run_query(
|
||||
}
|
||||
}
|
||||
|
||||
// p10-1A-1 fix (dogfood-discovered 2026-05-20): code_lang filter
|
||||
// (IN-list on metadata_json.$.code_lang). Empty Vec = no filter.
|
||||
if !filters.code_lang.is_empty() {
|
||||
let placeholders = std::iter::repeat_n("?", filters.code_lang.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
sql.push_str(&format!(
|
||||
" AND json_extract(d.metadata_json, '$.code_lang') IN ({placeholders})"
|
||||
));
|
||||
for lang in &filters.code_lang {
|
||||
params.push(Box::new(lang.clone()));
|
||||
}
|
||||
}
|
||||
|
||||
// p10-1A-1 fix (dogfood-discovered 2026-05-20): repo filter
|
||||
// (IN-list on metadata_json.$.repo). Empty Vec = no filter.
|
||||
if !filters.repo.is_empty() {
|
||||
let placeholders = std::iter::repeat_n("?", filters.repo.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
sql.push_str(&format!(
|
||||
" AND json_extract(d.metadata_json, '$.repo') IN ({placeholders})"
|
||||
));
|
||||
for repo in &filters.repo {
|
||||
params.push(Box::new(repo.clone()));
|
||||
}
|
||||
}
|
||||
|
||||
// p9-fb-36: ingested_after filter.
|
||||
// `documents.updated_at` is RFC3339 stored as TEXT (always UTC `Z` per
|
||||
// fb-32 ingest path), so lexicographic >= compare is correct — but only
|
||||
@@ -470,6 +498,8 @@ fn build_hit(
|
||||
// in `RagPipeline::ask` against the configured threshold.
|
||||
stale: false,
|
||||
score_kind: ScoreKind::Bm25,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
})
|
||||
}
|
||||
|
||||
|
||||
@@ -327,6 +327,8 @@ fn build_hit(
|
||||
// in `RagPipeline::ask` against the configured threshold.
|
||||
stale: false,
|
||||
score_kind: ScoreKind::Cosine,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
})
|
||||
}
|
||||
|
||||
|
||||
@@ -785,6 +785,19 @@ impl TestEnv {
|
||||
body: &str,
|
||||
media: MediaType,
|
||||
updated_at: OffsetDateTime,
|
||||
) -> DocumentId {
|
||||
self.insert_doc_full_with_metadata(path, body, media, updated_at, "{}")
|
||||
}
|
||||
|
||||
/// Like `insert_doc_full` but accepts an explicit `metadata_json` string
|
||||
/// so p10-1A-1 filter tests can set `metadata.code_lang` / `metadata.repo`.
|
||||
fn insert_doc_full_with_metadata(
|
||||
&self,
|
||||
path: &str,
|
||||
body: &str,
|
||||
media: MediaType,
|
||||
updated_at: OffsetDateTime,
|
||||
metadata_json: &str,
|
||||
) -> DocumentId {
|
||||
use time::format_description::well_known::Rfc3339;
|
||||
let doc_id = self.next_id("doc");
|
||||
@@ -810,10 +823,10 @@ impl TestEnv {
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version, metadata_json,
|
||||
provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, ?, NULL, 'en', 'markdown', 'primary', 'pv1', 1, 1,
|
||||
'{}', '{\"events\":[]}',
|
||||
) VALUES (?, ?, ?, NULL, 'en', 'code', 'primary', 'pv1', 1, 1,
|
||||
?, '{\"events\":[]}',
|
||||
'2024-01-01T00:00:00Z', ?)",
|
||||
rusqlite::params![doc_id, asset_id, path, updated_at_str],
|
||||
rusqlite::params![doc_id, asset_id, path, metadata_json, updated_at_str],
|
||||
)
|
||||
.expect("insert document");
|
||||
|
||||
@@ -834,6 +847,21 @@ impl TestEnv {
|
||||
DocumentId(doc_id)
|
||||
}
|
||||
|
||||
/// Insert a code doc with explicit `code_lang` and optional `repo` in metadata.
|
||||
fn insert_code_doc(&self, path: &str, body: &str, code_lang: &str, repo: Option<&str>) -> DocumentId {
|
||||
let metadata_json = match repo {
|
||||
Some(r) => format!(r#"{{"code_lang":"{code_lang}","repo":"{r}"}}"#),
|
||||
None => format!(r#"{{"code_lang":"{code_lang}"}}"#),
|
||||
};
|
||||
self.insert_doc_full_with_metadata(
|
||||
path,
|
||||
body,
|
||||
MediaType::Markdown,
|
||||
OffsetDateTime::now_utc(),
|
||||
&metadata_json,
|
||||
)
|
||||
}
|
||||
|
||||
fn run_search(&self, query: &str, filters: &SearchFilters) -> Vec<SearchHit> {
|
||||
let r = self.inner.retriever();
|
||||
let q = SearchQuery {
|
||||
@@ -934,6 +962,52 @@ fn lexical_empty_filters_match_default_behavior() {
|
||||
assert!(!with_default.is_empty());
|
||||
}
|
||||
|
||||
// ── p10-1A-1 filter tests ────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn lexical_filter_by_code_lang() {
|
||||
// Three docs: python code, rust code, markdown (no code_lang).
|
||||
// Filter code_lang=["python"] → only the python doc should match.
|
||||
let env = TestEnv::new();
|
||||
env.insert_code_doc("src/main.py", "AsyncClient session", "python", None);
|
||||
env.insert_code_doc("src/lib.rs", "AsyncClient session", "rust", None);
|
||||
env.insert_doc("docs/guide.md", "AsyncClient session");
|
||||
|
||||
let filters = SearchFilters {
|
||||
code_lang: vec!["python".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let hits = env.run_search("AsyncClient", &filters);
|
||||
assert_eq!(hits.len(), 1, "only python doc should match code_lang filter");
|
||||
assert!(
|
||||
hits[0].doc_path.0.ends_with(".py"),
|
||||
"expected python path, got: {}",
|
||||
hits[0].doc_path.0
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn lexical_filter_by_repo() {
|
||||
// Three docs: one in repo "httpx", one in repo "requests", one with no repo.
|
||||
// Filter repo=["httpx"] → only the httpx doc should match.
|
||||
let env = TestEnv::new();
|
||||
env.insert_code_doc("httpx/client.py", "session send request", "python", Some("httpx"));
|
||||
env.insert_code_doc("requests/api.py", "session send request", "python", Some("requests"));
|
||||
env.insert_code_doc("standalone.py", "session send request", "python", None);
|
||||
|
||||
let filters = SearchFilters {
|
||||
repo: vec!["httpx".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let hits = env.run_search("session", &filters);
|
||||
assert_eq!(hits.len(), 1, "only httpx doc should match repo filter");
|
||||
assert!(
|
||||
hits[0].doc_path.0.starts_with("httpx/"),
|
||||
"expected httpx path, got: {}",
|
||||
hits[0].doc_path.0
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn lexical_snapshot_run_1() {
|
||||
// Pinned snapshot. A small, deterministic corpus; the JSON shape of
|
||||
|
||||
@@ -10,6 +10,7 @@ description = "Local filesystem SourceConnector — walks workspace.root + app
|
||||
[dependencies]
|
||||
kebab-core = { path = "../kebab-core" }
|
||||
kebab-config = { path = "../kebab-config" }
|
||||
kebab-parse-code = { path = "../kebab-parse-code" }
|
||||
anyhow = { workspace = true }
|
||||
serde = { workspace = true }
|
||||
time = { workspace = true }
|
||||
@@ -17,6 +18,7 @@ blake3 = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
walkdir = "2"
|
||||
ignore = "0.4"
|
||||
globset = "0.4"
|
||||
|
||||
[dev-dependencies]
|
||||
serde_json = { workspace = true }
|
||||
|
||||
@@ -10,20 +10,20 @@
|
||||
//! }
|
||||
//! ```
|
||||
|
||||
use std::path::PathBuf;
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use time::OffsetDateTime;
|
||||
|
||||
use kebab_config::Config;
|
||||
use kebab_core::{
|
||||
AssetStorage, Checksum, RawAsset, SourceConnector, SourceScope, SourceUri,
|
||||
AssetStorage, Checksum, RawAsset, SkipExamples, SourceConnector, SourceScope, SourceUri,
|
||||
id_for_asset, to_posix,
|
||||
};
|
||||
|
||||
use crate::hash::hash_file;
|
||||
use crate::media::media_type_for;
|
||||
use crate::walker::{build_overrides, read_kbignore, walk_files};
|
||||
use crate::walker::{SkipCategory, WalkOverrides, build_overrides, read_kbignore, walk_files_with_skips};
|
||||
|
||||
/// Local-filesystem `SourceConnector`. Constructed once from `Config`,
|
||||
/// reused across `scan` calls.
|
||||
@@ -36,10 +36,16 @@ use crate::walker::{build_overrides, read_kbignore, walk_files};
|
||||
/// construction time.
|
||||
/// - `copy_threshold_bytes`: `config.storage.copy_threshold_mb * 1 MiB`
|
||||
/// pre-multiplied so we don't recompute per file.
|
||||
/// - `skip_generated_header`: `config.ingest.code.skip_generated_header`.
|
||||
/// - `max_file_bytes`: `config.ingest.code.max_file_bytes`.
|
||||
/// - `max_file_lines`: `config.ingest.code.max_file_lines`.
|
||||
pub struct FsSourceConnector {
|
||||
default_root: PathBuf,
|
||||
default_exclude: Vec<String>,
|
||||
copy_threshold_bytes: u64,
|
||||
skip_generated_header: bool,
|
||||
max_file_bytes: u64,
|
||||
max_file_lines: u32,
|
||||
}
|
||||
|
||||
impl FsSourceConnector {
|
||||
@@ -59,109 +65,229 @@ impl FsSourceConnector {
|
||||
default_root: root,
|
||||
default_exclude: config.workspace.exclude.clone(),
|
||||
copy_threshold_bytes,
|
||||
skip_generated_header: config.ingest.code.skip_generated_header,
|
||||
max_file_bytes: config.ingest.code.max_file_bytes,
|
||||
max_file_lines: config.ingest.code.max_file_lines,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl SourceConnector for FsSourceConnector {
|
||||
fn scan(&self, scope: &SourceScope) -> Result<Vec<RawAsset>> {
|
||||
// `SourceScope::root` overrides config root when non-empty. This
|
||||
// matches the design's "scope is the per-call lens; config is the
|
||||
// default" split (§7.1).
|
||||
/// Resolve the effective root and build the merged + per-source overrides.
|
||||
fn resolve_scan_params(
|
||||
&self,
|
||||
scope: &SourceScope,
|
||||
) -> Result<(PathBuf, WalkOverrides)> {
|
||||
let root = if scope.root.as_os_str().is_empty() {
|
||||
self.default_root.clone()
|
||||
} else {
|
||||
scope.root.clone()
|
||||
};
|
||||
|
||||
// Union: config.workspace.exclude ∪ scope.exclude ∪ .kebabignore.
|
||||
// Per §6.2 the union of `.kebabignore` and `config.workspace.exclude`
|
||||
// is the filter set. `scope.exclude` is added on top so a caller
|
||||
// can layer a per-call narrowing.
|
||||
let mut excludes = self.default_exclude.clone();
|
||||
excludes.extend(scope.exclude.iter().cloned());
|
||||
// .kebabignore is re-read on every scan() so users can edit it without
|
||||
// restarting any long-running process.
|
||||
let kbignore = read_kbignore(&root)?;
|
||||
|
||||
let overrides = build_overrides(&root, &excludes, &kbignore)?;
|
||||
let overrides = build_overrides(&root, &excludes, &kbignore, &scope.include)?;
|
||||
Ok((root, overrides))
|
||||
}
|
||||
|
||||
// TODO(P1-2/P1-3 router): apply SourceScope::include glob filter at the
|
||||
// extractor router layer once that crate lands. SourceConnector emits all
|
||||
// non-excluded files; routing by include-glob is a downstream concern
|
||||
// (design §6.2 + §7.2 are silent on this split, treat it as router work).
|
||||
//
|
||||
// `scope.include` is intentionally ignored at this stage of the
|
||||
// pipeline: per §6.2 the workspace-level include lives in
|
||||
// `WorkspaceCfg` and is enforced by the asset writer / extractors.
|
||||
// Surfacing it here would double-filter Markdown vs PDF before the
|
||||
// extractor router gets to see them.
|
||||
if !scope.include.is_empty() {
|
||||
tracing::debug!(
|
||||
count = scope.include.len(),
|
||||
"FsSourceConnector ignores scope.include — handled by extractor router"
|
||||
);
|
||||
}
|
||||
/// Scan the workspace and return the accepted assets together with
|
||||
/// per-category skip counts and sample paths for `IngestReport`.
|
||||
///
|
||||
/// This is the **preferred entry point** for `kebab-app`: it provides
|
||||
/// all the information needed to populate `IngestReport.skipped_gitignore`,
|
||||
/// `skipped_kebabignore`, `skipped_builtin_blacklist`, and `skip_examples`
|
||||
/// without a second walker pass.
|
||||
pub fn scan_with_skips(
|
||||
&self,
|
||||
scope: &SourceScope,
|
||||
) -> Result<(Vec<RawAsset>, FsScanSkips)> {
|
||||
let (root, overrides) = self.resolve_scan_params(scope)?;
|
||||
|
||||
let files = walk_files(&root, &overrides)?;
|
||||
let (files, skipped_entries) = walk_files_with_skips(&root, &overrides)?;
|
||||
|
||||
let mut assets = Vec::with_capacity(files.len());
|
||||
for abs in &files {
|
||||
// `to_posix` does NFC + leading `./` strip + `#` rejection.
|
||||
// Compute the workspace-relative path before handing to it so
|
||||
// emitted `WorkspacePath` is always relative.
|
||||
let rel = abs.strip_prefix(&root).unwrap_or(abs);
|
||||
let workspace_path = match to_posix(rel) {
|
||||
Ok(p) => p,
|
||||
Err(e) => {
|
||||
// A path containing `#` is the only documented reason
|
||||
// `to_posix` fails today. Drop the file with a warning
|
||||
// rather than aborting the entire scan — a single bad
|
||||
// filename should not nuke a 10 000-file ingest.
|
||||
tracing::warn!(
|
||||
path = %abs.display(),
|
||||
error = %e,
|
||||
"skipping file: path is not a valid WorkspacePath",
|
||||
// Accumulate per-category skip counts and sample paths.
|
||||
let mut fs_skips = FsScanSkips::default();
|
||||
for entry in &skipped_entries {
|
||||
match entry.category {
|
||||
SkipCategory::BuiltinBlacklist => {
|
||||
fs_skips.skipped_builtin_blacklist =
|
||||
fs_skips.skipped_builtin_blacklist.saturating_add(1);
|
||||
push_sample(
|
||||
&mut fs_skips.skip_examples.builtin_blacklist,
|
||||
&entry.path,
|
||||
&root,
|
||||
);
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
let media_type = media_type_for(abs);
|
||||
let (byte_len, full_hex) = hash_file(abs)
|
||||
.with_context(|| format!("hashing {}", abs.display()))?;
|
||||
let checksum = Checksum(full_hex.clone());
|
||||
let asset_id = id_for_asset(&full_hex);
|
||||
|
||||
// Storage variant signals *intent*, not an actual copy.
|
||||
// P1-6 (asset writer) is responsible for the on-disk copy.
|
||||
let stored = if byte_len > self.copy_threshold_bytes {
|
||||
AssetStorage::Reference {
|
||||
path: abs.clone(),
|
||||
sha: checksum.clone(),
|
||||
SkipCategory::Gitignore => {
|
||||
fs_skips.skipped_gitignore =
|
||||
fs_skips.skipped_gitignore.saturating_add(1);
|
||||
push_sample(
|
||||
&mut fs_skips.skip_examples.gitignore,
|
||||
&entry.path,
|
||||
&root,
|
||||
);
|
||||
}
|
||||
} else {
|
||||
AssetStorage::Copied { path: abs.clone() }
|
||||
};
|
||||
|
||||
assets.push(RawAsset {
|
||||
asset_id,
|
||||
source_uri: SourceUri::File(abs.clone()),
|
||||
workspace_path,
|
||||
media_type,
|
||||
byte_len,
|
||||
checksum,
|
||||
discovered_at: OffsetDateTime::now_utc(),
|
||||
stored,
|
||||
});
|
||||
SkipCategory::Kebabignore => {
|
||||
fs_skips.skipped_kebabignore =
|
||||
fs_skips.skipped_kebabignore.saturating_add(1);
|
||||
// kebabignore intentionally NOT in skip_examples per spec §5.5.
|
||||
}
|
||||
SkipCategory::Other => {
|
||||
// DEFAULT_EXCLUDES or config.workspace.exclude — no dedicated
|
||||
// IngestReport counter; these are lumped into the existing
|
||||
// `skipped` field by kebab-app.
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Determinism: sort by workspace_path. WorkspacePath is a String
|
||||
// newtype with stable lexicographic ordering. Two scans of the
|
||||
// same tree must produce identical Vec<RawAsset> modulo the
|
||||
// wall-clock `discovered_at` field.
|
||||
assets.sort_by(|a, b| a.workspace_path.0.cmp(&b.workspace_path.0));
|
||||
// p10-1A-1: apply per-file generated-header + size-cap checks on files
|
||||
// that passed the override (gitignore/builtin/kebabignore) matching.
|
||||
// These run AFTER the walk-level skip attribution, BEFORE parse dispatch.
|
||||
let mut accepted_files: Vec<PathBuf> = Vec::with_capacity(files.len());
|
||||
for abs_path in files {
|
||||
let rel_path = abs_path.strip_prefix(&root).unwrap_or(&abs_path);
|
||||
|
||||
// Generated-header sniff (config-gated).
|
||||
if self.skip_generated_header
|
||||
&& kebab_parse_code::is_generated_file(&abs_path).unwrap_or(false)
|
||||
{
|
||||
fs_skips.skipped_generated =
|
||||
fs_skips.skipped_generated.saturating_add(1);
|
||||
push_sample(
|
||||
&mut fs_skips.skip_examples.generated,
|
||||
&abs_path,
|
||||
&root,
|
||||
);
|
||||
tracing::debug!(
|
||||
path = %rel_path.display(),
|
||||
"skip: generated-file marker detected"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Size-cap check (byte or line limit).
|
||||
if kebab_parse_code::is_oversized(
|
||||
&abs_path,
|
||||
self.max_file_bytes,
|
||||
self.max_file_lines,
|
||||
)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
fs_skips.skipped_size_exceeded =
|
||||
fs_skips.skipped_size_exceeded.saturating_add(1);
|
||||
push_sample(
|
||||
&mut fs_skips.skip_examples.size_exceeded,
|
||||
&abs_path,
|
||||
&root,
|
||||
);
|
||||
tracing::debug!(
|
||||
path = %rel_path.display(),
|
||||
max_bytes = self.max_file_bytes,
|
||||
max_lines = self.max_file_lines,
|
||||
"skip: file exceeds size cap"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
accepted_files.push(abs_path);
|
||||
}
|
||||
|
||||
let assets = build_assets(&accepted_files, &root, self.copy_threshold_bytes)?;
|
||||
Ok((assets, fs_skips))
|
||||
}
|
||||
}
|
||||
|
||||
/// Per-category skip counts and sample paths returned alongside the asset list
|
||||
/// by [`FsSourceConnector::scan_with_skips`].
|
||||
///
|
||||
/// Populated from the walker's per-source matchers without a second pass.
|
||||
#[derive(Debug, Default)]
|
||||
pub struct FsScanSkips {
|
||||
pub skipped_gitignore: u32,
|
||||
pub skipped_kebabignore: u32,
|
||||
pub skipped_builtin_blacklist: u32,
|
||||
/// p10-1A-1: files skipped because their first ~512 bytes contained a
|
||||
/// generated-file marker (`@generated`, `do not edit`, …).
|
||||
pub skipped_generated: u32,
|
||||
/// p10-1A-1: files skipped because they exceeded `max_file_bytes` or
|
||||
/// `max_file_lines` in `[ingest.code]`.
|
||||
pub skipped_size_exceeded: u32,
|
||||
/// Sample paths per spec §5.5 (≤ 5 per category). Paths are
|
||||
/// workspace-relative POSIX strings when available, absolute otherwise.
|
||||
pub skip_examples: SkipExamples,
|
||||
}
|
||||
|
||||
/// Push a path into a sample vec (cap = 5) as a workspace-relative POSIX
|
||||
/// string. Falls back to the lossy absolute path if relativisation fails.
|
||||
fn push_sample(samples: &mut Vec<String>, abs: &Path, root: &Path) {
|
||||
if samples.len() >= 5 {
|
||||
return;
|
||||
}
|
||||
let rel = abs.strip_prefix(root).unwrap_or(abs);
|
||||
// Best-effort POSIX string; any non-UTF8 char → replacement char.
|
||||
let s = rel.to_string_lossy().replace('\\', "/");
|
||||
samples.push(s);
|
||||
}
|
||||
|
||||
/// Convert a list of absolute file paths to `Vec<RawAsset>`, sorted by
|
||||
/// workspace-relative POSIX path for determinism.
|
||||
fn build_assets(
|
||||
files: &[PathBuf],
|
||||
root: &Path,
|
||||
copy_threshold_bytes: u64,
|
||||
) -> Result<Vec<RawAsset>> {
|
||||
let mut assets = Vec::with_capacity(files.len());
|
||||
for abs in files {
|
||||
let rel = abs.strip_prefix(root).unwrap_or(abs);
|
||||
let workspace_path = match to_posix(rel) {
|
||||
Ok(p) => p,
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
path = %abs.display(),
|
||||
error = %e,
|
||||
"skipping file: path is not a valid WorkspacePath",
|
||||
);
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
let media_type = media_type_for(abs);
|
||||
let (byte_len, full_hex) = hash_file(abs)
|
||||
.with_context(|| format!("hashing {}", abs.display()))?;
|
||||
let checksum = Checksum(full_hex.clone());
|
||||
let asset_id = id_for_asset(&full_hex);
|
||||
|
||||
let stored = if byte_len > copy_threshold_bytes {
|
||||
AssetStorage::Reference {
|
||||
path: abs.clone(),
|
||||
sha: checksum.clone(),
|
||||
}
|
||||
} else {
|
||||
AssetStorage::Copied { path: abs.clone() }
|
||||
};
|
||||
|
||||
assets.push(RawAsset {
|
||||
asset_id,
|
||||
source_uri: SourceUri::File(abs.clone()),
|
||||
workspace_path,
|
||||
media_type,
|
||||
byte_len,
|
||||
checksum,
|
||||
discovered_at: OffsetDateTime::now_utc(),
|
||||
stored,
|
||||
});
|
||||
}
|
||||
|
||||
assets.sort_by(|a, b| a.workspace_path.0.cmp(&b.workspace_path.0));
|
||||
Ok(assets)
|
||||
}
|
||||
|
||||
|
||||
impl SourceConnector for FsSourceConnector {
|
||||
fn scan(&self, scope: &SourceScope) -> Result<Vec<RawAsset>> {
|
||||
// Delegate to scan_with_skips; discard the skip counts.
|
||||
// Callers that need skip attribution should call scan_with_skips directly.
|
||||
let (assets, _skips) = self.scan_with_skips(scope)?;
|
||||
Ok(assets)
|
||||
}
|
||||
}
|
||||
@@ -401,4 +527,241 @@ mod tests {
|
||||
let v2 = conn2.scan(&SourceScope::default()).unwrap();
|
||||
assert!(matches!(v2[0].stored, AssetStorage::Copied { .. }));
|
||||
}
|
||||
|
||||
// ── IngestReport skip counter wiring tests ───────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn scan_with_skips_counts_gitignored_files() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join(".gitignore"), "*.log\n").unwrap();
|
||||
std::fs::write(root.join("ok.md"), b"# ok").unwrap();
|
||||
std::fs::write(root.join("skipme.log"), b"x").unwrap();
|
||||
|
||||
let conn =
|
||||
FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap()))
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_gitignore >= 1,
|
||||
"skipped_gitignore should be >= 1; got {}",
|
||||
skips.skipped_gitignore
|
||||
);
|
||||
assert!(
|
||||
skips.skip_examples.gitignore.iter().any(|p| p.contains("skipme.log")),
|
||||
"skip_examples.gitignore should contain 'skipme.log'; got: {:?}",
|
||||
skips.skip_examples.gitignore
|
||||
);
|
||||
// kebabignore counter must be 0 — file matched gitignore, not kebabignore.
|
||||
assert_eq!(skips.skipped_kebabignore, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn scan_with_skips_counts_builtin_blacklist_dirs() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::create_dir_all(root.join("node_modules/foo")).unwrap();
|
||||
std::fs::write(root.join("node_modules/foo/bar.js"), b"x").unwrap();
|
||||
std::fs::write(root.join("ok.md"), b"# ok").unwrap();
|
||||
|
||||
let conn =
|
||||
FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap()))
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_builtin_blacklist >= 1,
|
||||
"skipped_builtin_blacklist should be >= 1; got {}",
|
||||
skips.skipped_builtin_blacklist
|
||||
);
|
||||
assert!(
|
||||
skips.skip_examples.builtin_blacklist.iter().any(|p| p.contains("node_modules")),
|
||||
"skip_examples.builtin_blacklist should contain a node_modules path; got: {:?}",
|
||||
skips.skip_examples.builtin_blacklist
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn scan_with_skips_kebabignore_increments_counter_no_example() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join(".kebabignore"), "*.secret\n").unwrap();
|
||||
std::fs::write(root.join("ok.md"), b"x").unwrap();
|
||||
std::fs::write(root.join("creds.secret"), b"pw").unwrap();
|
||||
|
||||
let conn =
|
||||
FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap()))
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_kebabignore >= 1,
|
||||
"skipped_kebabignore should be >= 1; got {}",
|
||||
skips.skipped_kebabignore
|
||||
);
|
||||
// Per spec §5.5: kebabignore is intentionally NOT in skip_examples.
|
||||
assert!(
|
||||
skips.skip_examples.gitignore.is_empty(),
|
||||
"gitignore examples should be empty; got: {:?}",
|
||||
skips.skip_examples.gitignore
|
||||
);
|
||||
assert!(
|
||||
skips.skip_examples.builtin_blacklist.is_empty(),
|
||||
"builtin_blacklist examples should be empty; got: {:?}",
|
||||
skips.skip_examples.builtin_blacklist
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn scan_with_skips_builtin_priority_over_gitignore() {
|
||||
// node_modules/ matches both BUILTIN_BLACKLIST and a .gitignore entry.
|
||||
// It must be attributed to builtin (spec §5.2 priority order).
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join(".gitignore"), "node_modules/\n").unwrap();
|
||||
std::fs::create_dir_all(root.join("node_modules/pkg")).unwrap();
|
||||
std::fs::write(root.join("node_modules/pkg/index.js"), b"x").unwrap();
|
||||
std::fs::write(root.join("ok.md"), b"x").unwrap();
|
||||
|
||||
let conn =
|
||||
FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap()))
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_builtin_blacklist >= 1,
|
||||
"builtin counter should be >= 1; got {}",
|
||||
skips.skipped_builtin_blacklist
|
||||
);
|
||||
assert_eq!(
|
||||
skips.skipped_gitignore, 0,
|
||||
"gitignore counter must be 0 when builtin wins; got {}",
|
||||
skips.skipped_gitignore
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skip_examples_cap_at_five() {
|
||||
// Write 7 .log files — skip_examples.gitignore must cap at 5.
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join(".gitignore"), "*.log\n").unwrap();
|
||||
for i in 0..7 {
|
||||
std::fs::write(root.join(format!("f{i}.log")), b"x").unwrap();
|
||||
}
|
||||
std::fs::write(root.join("ok.md"), b"x").unwrap();
|
||||
|
||||
let conn =
|
||||
FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap()))
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert_eq!(skips.skipped_gitignore, 7, "should count all 7");
|
||||
assert_eq!(
|
||||
skips.skip_examples.gitignore.len(),
|
||||
5,
|
||||
"skip_examples.gitignore must cap at 5; got: {:?}",
|
||||
skips.skip_examples.gitignore
|
||||
);
|
||||
}
|
||||
|
||||
// ── p10-1A-1: generated-header + size-cap skip tests ────────────────────
|
||||
|
||||
/// Helper: connector with default ingest.code settings.
|
||||
fn cfg_with_root_defaults(root: &str) -> Config {
|
||||
// cfg_with_root already uses Config::defaults() which has
|
||||
// skip_generated_header=true, max_file_bytes=262144, max_file_lines=5000.
|
||||
cfg_with_root(root)
|
||||
}
|
||||
|
||||
/// Helper: connector with overridden size caps.
|
||||
fn cfg_with_size_cap(root: &str, max_bytes: u64, max_lines: u32) -> Config {
|
||||
let mut c = cfg_with_root(root);
|
||||
c.ingest.code.max_file_bytes = max_bytes;
|
||||
c.ingest.code.max_file_lines = max_lines;
|
||||
c
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_report_counts_generated_files() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join("normal.md"), "# hi").unwrap();
|
||||
std::fs::write(root.join("autogen.rs"), "// @generated\nfn x() {}\n").unwrap();
|
||||
|
||||
let conn = FsSourceConnector::new(
|
||||
&cfg_with_root_defaults(root.to_str().unwrap()),
|
||||
)
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_generated >= 1,
|
||||
"skipped_generated should be >= 1; got {}",
|
||||
skips.skipped_generated
|
||||
);
|
||||
assert!(
|
||||
skips.skip_examples.generated.iter().any(|p| p.contains("autogen")),
|
||||
"skip_examples.generated should contain 'autogen'; got: {:?}",
|
||||
skips.skip_examples.generated
|
||||
);
|
||||
// The normal.md file must NOT be skipped.
|
||||
let asset_paths: Vec<_> = _assets
|
||||
.iter()
|
||||
.map(|a| a.workspace_path.0.clone())
|
||||
.collect();
|
||||
assert!(
|
||||
asset_paths.iter().any(|p| p.contains("normal")),
|
||||
"normal.md should still be emitted; assets: {asset_paths:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_report_counts_oversized_files_by_bytes() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
std::fs::write(root.join("normal.md"), "# hi").unwrap();
|
||||
// Write a file larger than the 1024-byte cap.
|
||||
let big: String = "x\n".repeat(1_000);
|
||||
std::fs::write(root.join("huge.rs"), &big).unwrap();
|
||||
|
||||
let conn = FsSourceConnector::new(
|
||||
&cfg_with_size_cap(root.to_str().unwrap(), 1024, 5_000),
|
||||
)
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_size_exceeded >= 1,
|
||||
"skipped_size_exceeded should be >= 1; got {}",
|
||||
skips.skipped_size_exceeded
|
||||
);
|
||||
assert!(
|
||||
skips.skip_examples.size_exceeded.iter().any(|p| p.contains("huge")),
|
||||
"skip_examples.size_exceeded should contain 'huge'; got: {:?}",
|
||||
skips.skip_examples.size_exceeded
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ingest_report_size_cap_by_line_count() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
// 6000 lines but small per-line — line cap of 5000 should trigger.
|
||||
let body: String = "x\n".repeat(6_000);
|
||||
std::fs::write(root.join("longfile.rs"), &body).unwrap();
|
||||
|
||||
let conn = FsSourceConnector::new(
|
||||
&cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000),
|
||||
)
|
||||
.unwrap();
|
||||
let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap();
|
||||
|
||||
assert!(
|
||||
skips.skipped_size_exceeded >= 1,
|
||||
"skipped_size_exceeded should be >= 1 (line cap); got {}",
|
||||
skips.skipped_size_exceeded
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,4 +13,4 @@ mod hash;
|
||||
mod media;
|
||||
mod walker;
|
||||
|
||||
pub use connector::FsSourceConnector;
|
||||
pub use connector::{FsScanSkips, FsSourceConnector};
|
||||
|
||||
@@ -19,7 +19,9 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
|
||||
.unwrap_or_default();
|
||||
|
||||
match ext.as_str() {
|
||||
"md" => MediaType::Markdown,
|
||||
// Markdown + MDX (markdown + JSX, treated as plain markdown — the
|
||||
// JSX islands are folded into raw passthrough by the md parser).
|
||||
"md" | "mdx" => MediaType::Markdown,
|
||||
"pdf" => MediaType::Pdf,
|
||||
|
||||
"png" => MediaType::Image(ImageType::Png),
|
||||
@@ -34,6 +36,16 @@ pub(crate) fn media_type_for(path: &Path) -> MediaType {
|
||||
"flac" => MediaType::Audio(AudioType::Flac),
|
||||
"ogg" => MediaType::Audio(AudioType::Ogg),
|
||||
|
||||
// p10-1A-2: Rust is the only code lang activated in 1A. Other
|
||||
// recognized code langs stay Other until their phase (1B+).
|
||||
"rs" => MediaType::Code("rust".to_string()),
|
||||
|
||||
// p10-1B: Python / TS / JS AST chunkers active.
|
||||
"py" | "pyi" => MediaType::Code("python".into()),
|
||||
// .mts / .cts are TypeScript ESM / CommonJS variants — same grammar.
|
||||
"ts" | "tsx" | "mts" | "cts" => MediaType::Code("typescript".into()),
|
||||
"js" | "mjs" | "cjs" | "jsx" => MediaType::Code("javascript".into()),
|
||||
|
||||
// Empty string (no extension) and any other extension: bucket as
|
||||
// Other and let downstream extractors decide if they support it.
|
||||
_ => MediaType::Other(ext),
|
||||
@@ -71,6 +83,42 @@ mod tests {
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rust_files_map_to_media_code_rust() {
|
||||
assert_eq!(
|
||||
media_type_for(Path::new("crates/kebab-core/src/lib.rs")),
|
||||
MediaType::Code("rust".to_string())
|
||||
);
|
||||
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn py_ts_js_files_map_to_media_code() {
|
||||
assert_eq!(media_type_for(Path::new("a/b.py")), MediaType::Code("python".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.pyi")), MediaType::Code("python".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.ts")), MediaType::Code("typescript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.tsx")), MediaType::Code("typescript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.js")), MediaType::Code("javascript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.mjs")), MediaType::Code("javascript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.cjs")), MediaType::Code("javascript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.jsx")), MediaType::Code("javascript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.rs")), MediaType::Code("rust".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ts_variants_mts_cts() {
|
||||
// .mts / .cts are TypeScript ESM / CommonJS — same grammar as .ts.
|
||||
assert_eq!(media_type_for(Path::new("a/b.mts")), MediaType::Code("typescript".into()));
|
||||
assert_eq!(media_type_for(Path::new("a/b.cts")), MediaType::Code("typescript".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn mdx_routes_to_markdown() {
|
||||
// MDX is markdown with JSX islands; the md parser folds the JSX
|
||||
// through as raw passthrough.
|
||||
assert_eq!(media_type_for(Path::new("docs/page.mdx")), MediaType::Markdown);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn unknown_and_missing_extension() {
|
||||
assert_eq!(
|
||||
|
||||
@@ -1,12 +1,15 @@
|
||||
//! Directory walker with gitignore-style filtering and symlink-cycle
|
||||
//! protection.
|
||||
//!
|
||||
//! Filter set (per task spec, design §6.2):
|
||||
//! - `config.workspace.exclude` (passed in by `FsSourceConnector`)
|
||||
//! - `<root>/.kebabignore` (optional file at workspace root)
|
||||
//! - default-excludes for `.DS_Store` and macOS resource forks (`._*`)
|
||||
//! Filter set, in order of application:
|
||||
//! - DEFAULT_EXCLUDES (constants — VCS dirs, build artifacts, never-useful)
|
||||
//! - `config.workspace.exclude` (user-supplied per workspace)
|
||||
//! - `<root>/.kebabignore` (user-supplied kebab-specific exclude)
|
||||
//! - Built-in safety-net blacklist (`node_modules/`, `target/`, etc. —
|
||||
//! spec §5.2, applied via `kebab_parse_code::BUILTIN_BLACKLIST`)
|
||||
//! - `<root>/.gitignore` (repo-root only, no nested cascade — spec §5.2)
|
||||
//!
|
||||
//! All three are merged via `ignore::overrides::OverrideBuilder`, which
|
||||
//! All five are merged via `ignore::overrides::OverrideBuilder`, which
|
||||
//! gives full gitignore semantics (anchors, `!` negation, `**`, etc.). We
|
||||
//! prepend `!` to each pattern because `OverrideBuilder` treats positive
|
||||
//! patterns as "include" and negative as "exclude" — see §"Filter set"
|
||||
@@ -18,6 +21,16 @@
|
||||
//! `follow_links(true)`; we layer our own visited-set on top, keyed by the
|
||||
//! canonical path of every entry, and skip any entry we've already seen.
|
||||
//!
|
||||
//! ## Per-source skip attribution (spec §5.5)
|
||||
//!
|
||||
//! `walk_files_with_skips` returns a `WalkOverrides` struct that carries
|
||||
//! both a `combined` matcher (used for the actual walk decision) and three
|
||||
//! per-source matchers (`gitignore`, `kebabignore`, `builtin`). When an
|
||||
//! entry is excluded, `classify_skip` probes the per-source matchers in
|
||||
//! priority order (built-in > gitignore > kebabignore) to determine which
|
||||
//! `IngestReport` counter should be incremented — without requiring a
|
||||
//! second walker pass over the filesystem.
|
||||
//!
|
||||
//! ## Why `walkdir` instead of `ignore::WalkBuilder`?
|
||||
//!
|
||||
//! `ignore::WalkBuilder` bundles gitignore semantics + cycle detection in
|
||||
@@ -31,8 +44,9 @@ use std::collections::HashSet;
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use globset::{GlobBuilder, GlobSet, GlobSetBuilder};
|
||||
use ignore::overrides::{Override, OverrideBuilder};
|
||||
use walkdir::{DirEntry, WalkDir};
|
||||
use walkdir::WalkDir;
|
||||
|
||||
/// Default-excludes baked into the connector. These are NOT configurable;
|
||||
/// they cover noise that is never useful to ingest and would otherwise need
|
||||
@@ -46,37 +60,261 @@ const DEFAULT_EXCLUDES: &[&str] = &[
|
||||
"**/._*",
|
||||
];
|
||||
|
||||
/// Build the merged `Override` from `config.workspace.exclude` ∪ `.kebabignore`
|
||||
/// ∪ baked-in default excludes.
|
||||
/// Per-source `Override` matchers for skip-counter attribution (spec §5.5).
|
||||
///
|
||||
/// `combined` is the merged union of all sources — used for the actual
|
||||
/// "is this entry excluded?" decision in the walker. The three per-source
|
||||
/// matchers (`gitignore`, `kebabignore`, `builtin`) are used ONLY when
|
||||
/// classifying an already-excluded path for `IngestReport` counter wiring;
|
||||
/// they are never consulted for every walked file.
|
||||
///
|
||||
/// `default_and_config` covers DEFAULT_EXCLUDES + `config.workspace.exclude`
|
||||
/// — these do NOT map to any of the three named `IngestReport` counters.
|
||||
///
|
||||
/// `include` is the compiled `scope.include` allow-list. When the set is
|
||||
/// empty (no patterns) every file passes; when non-empty a file must match
|
||||
/// at least one pattern to be accepted (directories always pass, so the
|
||||
/// walker can still descend into them).
|
||||
pub(crate) struct WalkOverrides {
|
||||
/// Merged matcher — same as today's `Override`; used for the walk decision.
|
||||
pub combined: Override,
|
||||
/// Matcher built from `<root>/.gitignore` patterns only.
|
||||
pub gitignore: Override,
|
||||
/// Matcher built from `<root>/.kebabignore` patterns only.
|
||||
pub kebabignore: Override,
|
||||
/// Matcher built from `kebab_parse_code::BUILTIN_BLACKLIST` only.
|
||||
pub builtin: Override,
|
||||
/// Compiled allow-list from `scope.include`. Empty set = pass all.
|
||||
pub include: GlobSet,
|
||||
}
|
||||
|
||||
/// Skip attribution category. Used by the connector when counting per-source
|
||||
/// skips for `IngestReport` (spec §5.5).
|
||||
///
|
||||
/// Priority order per spec §5.2: built-in > gitignore > kebabignore.
|
||||
/// A path matching multiple sources is attributed to the first match.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub(crate) enum SkipCategory {
|
||||
BuiltinBlacklist,
|
||||
Gitignore,
|
||||
Kebabignore,
|
||||
/// Matched DEFAULT_EXCLUDES or `config.workspace.exclude`. No dedicated
|
||||
/// counter in `IngestReport` — lumped into the existing `skipped` field.
|
||||
Other,
|
||||
}
|
||||
|
||||
/// Build a single `Override` from a list of gitignore-style patterns, all
|
||||
/// registered as excludes (prepend `!`).
|
||||
///
|
||||
/// Empty pattern list → an `Override` that matches nothing (i.e. no
|
||||
/// exclusions). Callers must strip blanks / comments before passing.
|
||||
fn build_single_matcher(root: &Path, patterns: &[&str]) -> Result<Override> {
|
||||
let mut builder = OverrideBuilder::new(root);
|
||||
for pat in patterns {
|
||||
builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("invalid pattern: {pat}"))?;
|
||||
}
|
||||
builder.build().context("failed to compile override")
|
||||
}
|
||||
|
||||
/// Build the builtin-blacklist `Override`, adding directory-level patterns in
|
||||
/// addition to the `/**`-suffix ones from `BUILTIN_BLACKLIST`.
|
||||
///
|
||||
/// BUILTIN_BLACKLIST uses `**/X/**` patterns which match files *inside* X but
|
||||
/// NOT the directory entry `X` itself (because `**/X/**` requires a path
|
||||
/// component after X). The walker prunes at the directory level (`is_dir=true`),
|
||||
/// so we need `**/X` (no trailing `/**`) to also match the directory itself
|
||||
/// for attribution purposes.
|
||||
fn build_builtin_matcher(root: &Path) -> Result<Override> {
|
||||
let mut builder = OverrideBuilder::new(root);
|
||||
for pat in kebab_parse_code::BUILTIN_BLACKLIST {
|
||||
// Register the original pattern (matches files inside the dir).
|
||||
builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("builtin pattern: {pat}"))?;
|
||||
// Also derive a directory-level match by stripping trailing `/**`.
|
||||
// This makes `is_dir=true` checks on the directory itself work.
|
||||
if let Some(dir_pat) = pat.strip_suffix("/**") {
|
||||
builder
|
||||
.add(&format!("!{dir_pat}"))
|
||||
.with_context(|| format!("builtin dir pattern: {dir_pat}"))?;
|
||||
}
|
||||
}
|
||||
builder.build().context("failed to compile builtin override")
|
||||
}
|
||||
|
||||
/// Owned-string variant of `build_single_matcher` for caller-supplied
|
||||
/// `Vec<String>` sources (config.workspace.exclude, .kebabignore).
|
||||
fn build_single_matcher_owned(root: &Path, patterns: &[String]) -> Result<Override> {
|
||||
let mut builder = OverrideBuilder::new(root);
|
||||
for pat in patterns {
|
||||
builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("invalid pattern: {pat}"))?;
|
||||
}
|
||||
builder.build().context("failed to compile override")
|
||||
}
|
||||
|
||||
/// Build the merged `WalkOverrides` from all five filter sources, in order:
|
||||
/// DEFAULT_EXCLUDES, `config.workspace.exclude`, `.kebabignore`,
|
||||
/// built-in safety-net blacklist (`kebab_parse_code::BUILTIN_BLACKLIST`),
|
||||
/// and `<root>/.gitignore` (root-only, no nested cascade).
|
||||
///
|
||||
/// Each input pattern is registered as an *exclude* (gitignore-style: a
|
||||
/// leading `!` flips a positive match to a negative one in the
|
||||
/// `OverrideBuilder` API). Order doesn't matter — the union is computed by
|
||||
/// the underlying gitignore engine.
|
||||
///
|
||||
/// The three per-source matchers (`gitignore`, `kebabignore`, `builtin`) are
|
||||
/// built in addition to the combined one so the connector can attribute skips
|
||||
/// to the correct `IngestReport` counter without a second walker pass.
|
||||
///
|
||||
/// `include_patterns` (from `scope.include`) are compiled into an allow-list
|
||||
/// `GlobSet`. Empty slice → pass-all (backward-compat); non-empty → file
|
||||
/// must match at least one pattern to be accepted.
|
||||
pub(crate) fn build_overrides(
|
||||
root: &Path,
|
||||
config_exclude: &[String],
|
||||
kbignore_patterns: &[String],
|
||||
) -> Result<Override> {
|
||||
let mut builder = OverrideBuilder::new(root);
|
||||
include_patterns: &[String],
|
||||
) -> Result<WalkOverrides> {
|
||||
let gitignore_patterns = read_gitignore(root)?;
|
||||
|
||||
// Per-source matchers (for attribution only).
|
||||
let gitignore =
|
||||
build_single_matcher(root, &gitignore_patterns.iter().map(|s| s.as_str()).collect::<Vec<_>>())?;
|
||||
let kebabignore = build_single_matcher_owned(root, kbignore_patterns)?;
|
||||
// Use the directory-aware builtin matcher so that `is_dir=true` checks on
|
||||
// directory entries (e.g., `node_modules/`) are attributed to builtin rather
|
||||
// than to an overlapping gitignore pattern.
|
||||
let builtin = build_builtin_matcher(root)?;
|
||||
|
||||
// Combined matcher — union of all five sources.
|
||||
let mut combined_builder = OverrideBuilder::new(root);
|
||||
|
||||
for pat in DEFAULT_EXCLUDES {
|
||||
builder
|
||||
combined_builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("invalid default-exclude pattern: {pat}"))?;
|
||||
}
|
||||
for pat in config_exclude {
|
||||
builder
|
||||
combined_builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("invalid workspace.exclude pattern: {pat}"))?;
|
||||
}
|
||||
for pat in kbignore_patterns {
|
||||
builder
|
||||
combined_builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("invalid .kebabignore pattern: {pat}"))?;
|
||||
}
|
||||
for pat in kebab_parse_code::BUILTIN_BLACKLIST {
|
||||
combined_builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!("built-in blacklist pattern: {pat}"))?;
|
||||
}
|
||||
for pat in &gitignore_patterns {
|
||||
combined_builder
|
||||
.add(&format!("!{pat}"))
|
||||
.with_context(|| format!(".gitignore pattern: {pat}"))?;
|
||||
}
|
||||
let combined = combined_builder
|
||||
.build()
|
||||
.context("failed to compile combined override set")?;
|
||||
|
||||
builder.build().context("failed to compile override set")
|
||||
// Allow-list GlobSet: empty Vec → matches nothing (= pass all); non-empty
|
||||
// → file must match at least one glob to be accepted. We compile with
|
||||
// `case_insensitive=false` to keep the semantics consistent with the
|
||||
// OverrideBuilder exclude patterns above.
|
||||
let include = build_include_globset(include_patterns)?;
|
||||
|
||||
Ok(WalkOverrides {
|
||||
combined,
|
||||
gitignore,
|
||||
kebabignore,
|
||||
builtin,
|
||||
include,
|
||||
})
|
||||
}
|
||||
|
||||
/// Compile `scope.include` patterns into a `GlobSet` allow-list.
|
||||
///
|
||||
/// Each pattern uses `GlobBuilder` with `literal_separator = true` so that
|
||||
/// `**` can cross directory boundaries while `*` stops at `/`, matching the
|
||||
/// gitignore convention used throughout the rest of the walker.
|
||||
///
|
||||
/// An empty slice produces an empty `GlobSet` — callers interpret that as
|
||||
/// "pass all files" (no allow-list constraint).
|
||||
fn build_include_globset(patterns: &[String]) -> Result<GlobSet> {
|
||||
let mut builder = GlobSetBuilder::new();
|
||||
for pat in patterns {
|
||||
let glob = GlobBuilder::new(pat)
|
||||
.literal_separator(true)
|
||||
.build()
|
||||
.with_context(|| format!("invalid include pattern: {pat}"))?;
|
||||
builder.add(glob);
|
||||
}
|
||||
builder.build().context("failed to compile include globset")
|
||||
}
|
||||
|
||||
/// Classify why a path was excluded, using per-source matchers in spec §5.2
|
||||
/// priority order: built-in > gitignore > kebabignore > other.
|
||||
///
|
||||
/// `rel` must be relative to the walker root (same as `Override::matched`
|
||||
/// expects). `is_dir` should match what the original walker saw.
|
||||
pub(crate) fn classify_skip(rel: &Path, is_dir: bool, ov: &WalkOverrides) -> SkipCategory {
|
||||
if ov.builtin.matched(rel, is_dir).is_ignore() {
|
||||
return SkipCategory::BuiltinBlacklist;
|
||||
}
|
||||
if ov.gitignore.matched(rel, is_dir).is_ignore() {
|
||||
return SkipCategory::Gitignore;
|
||||
}
|
||||
if ov.kebabignore.matched(rel, is_dir).is_ignore() {
|
||||
return SkipCategory::Kebabignore;
|
||||
}
|
||||
SkipCategory::Other
|
||||
}
|
||||
|
||||
/// Read `<root>/.gitignore` (single-file, root-only — nested cascade is P+).
|
||||
/// Missing file → empty Vec. Comments / blanks stripped.
|
||||
///
|
||||
/// Trailing-slash patterns (`dist/`) in real gitignore mean "match the
|
||||
/// directory AND everything inside it". `OverrideBuilder::matched(path,
|
||||
/// is_dir=false)` only checks `is_dir` for the trailing-slash variant, so
|
||||
/// `dist/bundle.js` would not be matched. We normalize by also emitting a
|
||||
/// `<stem>/**` variant so files inside the directory are caught.
|
||||
pub(crate) fn read_gitignore(root: &Path) -> Result<Vec<String>> {
|
||||
let p = root.join(".gitignore");
|
||||
if !p.exists() {
|
||||
return Ok(vec![]);
|
||||
}
|
||||
let s = std::fs::read_to_string(&p)
|
||||
.with_context(|| format!("read .gitignore at {}", p.display()))?;
|
||||
let mut out = Vec::new();
|
||||
for line in s.lines() {
|
||||
let trimmed = line.trim();
|
||||
if trimmed.is_empty() || trimmed.starts_with('#') {
|
||||
continue;
|
||||
}
|
||||
// If the pattern starts with `!` (gitignore negation/un-ignore), pass through
|
||||
// as-is. Trailing-slash normalization is unsafe here — the `!`-prefix and `/`-
|
||||
// suffix combined confuse OverrideBuilder (would produce double-`!`).
|
||||
if trimmed.starts_with('!') {
|
||||
out.push(trimmed.to_string());
|
||||
continue;
|
||||
}
|
||||
if let Some(stem) = trimmed.strip_suffix('/') {
|
||||
// Keep the dir-only form so `is_dir=true` matches are still
|
||||
// excluded (e.g., for skip_current_dir in the walker).
|
||||
out.push(trimmed.to_string());
|
||||
// Also emit a glob that catches files inside the directory,
|
||||
// since `is_dir=false` won't satisfy the trailing-slash form.
|
||||
out.push(format!("{stem}/**"));
|
||||
} else {
|
||||
out.push(trimmed.to_string());
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// Read `<root>/.kebabignore` if it exists. Each non-blank, non-comment line is
|
||||
@@ -96,51 +334,87 @@ pub(crate) fn read_kbignore(root: &Path) -> Result<Vec<String>> {
|
||||
.collect())
|
||||
}
|
||||
|
||||
/// Iterate every regular file under `root`, applying `overrides` and
|
||||
/// detecting symlink cycles. Returns absolute file paths.
|
||||
/// Skipped-path record emitted by `walk_files_with_skips`.
|
||||
///
|
||||
/// `path` is the absolute path of the excluded entry (dir or file).
|
||||
/// For excluded directories, this is the directory itself — individual
|
||||
/// files inside are not enumerated (the subtree is pruned).
|
||||
pub(crate) struct SkippedEntry {
|
||||
pub path: PathBuf,
|
||||
pub category: SkipCategory,
|
||||
}
|
||||
|
||||
/// Iterate every regular file under `root`, applying `overrides.combined` and
|
||||
/// detecting symlink cycles. Returns:
|
||||
/// - `accepted`: absolute paths of files that passed all filters.
|
||||
/// - `skipped`: entries that were excluded, with attribution.
|
||||
///
|
||||
/// For excluded *directories*, the directory path itself is returned (not the
|
||||
/// individual files inside — the subtree is pruned in one step, matching the
|
||||
/// walker's `skip_current_dir` behavior).
|
||||
///
|
||||
/// Strategy:
|
||||
/// - `walkdir::WalkDir::follow_links(true)` to traverse symlinks.
|
||||
/// - Manual per-entry check (instead of `filter_entry`) so we can capture
|
||||
/// the excluded paths for skip attribution.
|
||||
/// - Maintain `visited: HashSet<PathBuf>` of *canonical* paths. Before
|
||||
/// descending into a directory entry, canonicalize and check the set;
|
||||
/// if already present, skip. This breaks `a -> b -> a` cycles in O(n)
|
||||
/// per entry without a custom recursive walker.
|
||||
/// - For each yielded entry, ask `overrides` whether it is excluded; if
|
||||
/// so, drop it. If the entry is a directory, also short-circuit
|
||||
/// `WalkDir`'s descent via `it.skip_current_dir()`.
|
||||
pub(crate) fn walk_files(root: &Path, overrides: &Override) -> Result<Vec<PathBuf>> {
|
||||
let mut out = Vec::new();
|
||||
pub(crate) fn walk_files_with_skips(
|
||||
root: &Path,
|
||||
overrides: &WalkOverrides,
|
||||
) -> Result<(Vec<PathBuf>, Vec<SkippedEntry>)> {
|
||||
let mut accepted = Vec::new();
|
||||
let mut skipped: Vec<SkippedEntry> = Vec::new();
|
||||
let mut visited: HashSet<PathBuf> = HashSet::new();
|
||||
|
||||
// Use a non-filtering iterator so we see excluded entries too.
|
||||
let walker = WalkDir::new(root).follow_links(true).into_iter();
|
||||
let mut it = walker.filter_entry(|e| !is_excluded(e, root, overrides));
|
||||
// We still use filter_entry for the *combined* override so that walkdir
|
||||
// can short-circuit pruned directories. But we wrap it so we can capture
|
||||
// the exclusion reason before discarding the entry.
|
||||
//
|
||||
// Problem: filter_entry discards without letting us see the entry first.
|
||||
// Solution: use the raw iterator (no filter_entry) and manage skip_current_dir
|
||||
// manually, which lets us record what was excluded before pruning.
|
||||
let mut it = walker;
|
||||
|
||||
while let Some(res) = it.next() {
|
||||
let entry = match res {
|
||||
Ok(e) => e,
|
||||
Err(err) => {
|
||||
// `walkdir` surfaces I/O errors AND its own cycle detector
|
||||
// (when follow_links is on it sometimes catches them).
|
||||
// Either way: log and skip; do not abort the whole scan.
|
||||
tracing::warn!(error = %err, "walkdir entry error; skipping");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
let path = entry.path();
|
||||
let rel = match path.strip_prefix(root) {
|
||||
Ok(p) => p,
|
||||
Err(_) => path,
|
||||
};
|
||||
let is_dir = entry.file_type().is_dir();
|
||||
let excluded = overrides.combined.matched(rel, is_dir).is_ignore();
|
||||
|
||||
// Cycle guard: only canonicalize symlinks (cheap on the common case
|
||||
// of plain files/dirs) and on directories that are followed via a
|
||||
// symlink. `walkdir`'s `path_is_symlink()` is true when the entry's
|
||||
// *original* path is a symlink (it returns true for the link, not
|
||||
// for the resolved target). For non-symlinked directories we still
|
||||
// record the canonical path so a *later* symlink that points back
|
||||
// to one of them is detected.
|
||||
if entry.file_type().is_dir() {
|
||||
if excluded {
|
||||
let cat = classify_skip(rel, is_dir, overrides);
|
||||
skipped.push(SkippedEntry {
|
||||
path: path.to_path_buf(),
|
||||
category: cat,
|
||||
});
|
||||
if is_dir {
|
||||
// Prune the subtree — don't descend into excluded dirs.
|
||||
it.skip_current_dir();
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
// Cycle guard for directories.
|
||||
if is_dir {
|
||||
match std::fs::canonicalize(path) {
|
||||
Ok(canon) => {
|
||||
if !visited.insert(canon) {
|
||||
// Already visited via another path → break cycle.
|
||||
it.skip_current_dir();
|
||||
continue;
|
||||
}
|
||||
@@ -157,26 +431,20 @@ pub(crate) fn walk_files(root: &Path, overrides: &Override) -> Result<Vec<PathBu
|
||||
}
|
||||
|
||||
if entry.file_type().is_file() {
|
||||
out.push(path.to_path_buf());
|
||||
// Apply include allow-list: if non-empty, the file's path
|
||||
// relative to root must match at least one pattern.
|
||||
if !overrides.include.is_empty() && !overrides.include.is_match(rel) {
|
||||
// Not in the allow-list — silently drop (no skip counter;
|
||||
// the include filter is not a "skip" source in IngestReport).
|
||||
continue;
|
||||
}
|
||||
accepted.push(path.to_path_buf());
|
||||
}
|
||||
}
|
||||
|
||||
Ok(out)
|
||||
Ok((accepted, skipped))
|
||||
}
|
||||
|
||||
fn is_excluded(entry: &DirEntry, root: &Path, overrides: &Override) -> bool {
|
||||
// `Override::matched(path, is_dir)` uses the path *relative to* the
|
||||
// override builder's root. `walkdir` gives absolute paths when
|
||||
// `WalkDir::new` was given an absolute path — strip the root prefix
|
||||
// before consulting the override.
|
||||
let rel = match entry.path().strip_prefix(root) {
|
||||
Ok(p) => p,
|
||||
Err(_) => entry.path(),
|
||||
};
|
||||
overrides
|
||||
.matched(rel, entry.file_type().is_dir())
|
||||
.is_ignore()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
@@ -185,23 +453,23 @@ mod tests {
|
||||
#[test]
|
||||
fn empty_inputs_compile_into_an_override() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[]).unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
|
||||
// Default-excludes only; non-special files should not match.
|
||||
let m = ov.matched(Path::new("notes/alpha.md"), false);
|
||||
let m = ov.combined.matched(Path::new("notes/alpha.md"), false);
|
||||
assert!(!m.is_ignore());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn default_excludes_ds_store_and_resource_forks() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let ov = build_overrides(dir.path(), &[], &[]).unwrap();
|
||||
assert!(ov.matched(Path::new(".DS_Store"), false).is_ignore());
|
||||
let ov = build_overrides(dir.path(), &[], &[], &[]).unwrap();
|
||||
assert!(ov.combined.matched(Path::new(".DS_Store"), false).is_ignore());
|
||||
assert!(
|
||||
ov.matched(Path::new("notes/.DS_Store"), false).is_ignore()
|
||||
ov.combined.matched(Path::new("notes/.DS_Store"), false).is_ignore()
|
||||
);
|
||||
assert!(ov.matched(Path::new("._foo.md"), false).is_ignore());
|
||||
assert!(ov.combined.matched(Path::new("._foo.md"), false).is_ignore());
|
||||
assert!(
|
||||
ov.matched(Path::new("notes/._sidecar"), false).is_ignore()
|
||||
ov.combined.matched(Path::new("notes/._sidecar"), false).is_ignore()
|
||||
);
|
||||
}
|
||||
|
||||
@@ -212,15 +480,16 @@ mod tests {
|
||||
dir.path(),
|
||||
&["*.tmp".to_string(), "node_modules/**".to_string()],
|
||||
&[],
|
||||
&[],
|
||||
)
|
||||
.unwrap();
|
||||
assert!(ov.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
assert!(ov.matched(Path::new("notes/x.tmp"), false).is_ignore());
|
||||
assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
assert!(ov.combined.matched(Path::new("notes/x.tmp"), false).is_ignore());
|
||||
assert!(
|
||||
ov.matched(Path::new("node_modules/foo/bar.js"), false)
|
||||
ov.combined.matched(Path::new("node_modules/foo/bar.js"), false)
|
||||
.is_ignore()
|
||||
);
|
||||
assert!(!ov.matched(Path::new("alpha.md"), false).is_ignore());
|
||||
assert!(!ov.combined.matched(Path::new("alpha.md"), false).is_ignore());
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -231,11 +500,12 @@ mod tests {
|
||||
dir.path(),
|
||||
&["*.tmp".to_string()],
|
||||
&["secret/**".to_string()],
|
||||
&[],
|
||||
)
|
||||
.unwrap();
|
||||
assert!(ov.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore());
|
||||
assert!(
|
||||
ov.matched(Path::new("secret/key.md"), false).is_ignore()
|
||||
ov.combined.matched(Path::new("secret/key.md"), false).is_ignore()
|
||||
);
|
||||
}
|
||||
|
||||
@@ -257,4 +527,230 @@ mod tests {
|
||||
let v = read_kbignore(dir.path()).unwrap();
|
||||
assert_eq!(v, vec!["*.tmp".to_string(), "ignored/**".to_string()]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn built_in_blacklist_excludes_node_modules() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::create_dir_all(root.join("src")).unwrap();
|
||||
fs::create_dir_all(root.join("node_modules/foo")).unwrap();
|
||||
fs::write(root.join("src/main.rs"), "x").unwrap();
|
||||
fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// Override::matched expects paths relative to the builder's root.
|
||||
let m_in = overrides.combined.matched(Path::new("src/main.rs"), false);
|
||||
let m_out = overrides.combined.matched(Path::new("node_modules/foo/bar.js"), false);
|
||||
|
||||
assert!(!m_in.is_ignore(), "src/main.rs should NOT be ignored");
|
||||
assert!(m_out.is_ignore(), "node_modules/foo/bar.js SHOULD be ignored");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn built_in_blacklist_excludes_target_pycache_venv() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
for dir in ["target/x", "__pycache__/x", ".venv/x", "venv/x", "env/x"] {
|
||||
fs::create_dir_all(root.join(dir)).unwrap();
|
||||
fs::write(root.join(dir).join("y.txt"), "z").unwrap();
|
||||
}
|
||||
fs::create_dir_all(root.join("ok")).unwrap();
|
||||
fs::write(root.join("ok/z.txt"), "z").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// Override::matched expects paths relative to the builder's root.
|
||||
for blacklisted in [
|
||||
"target/x/y.txt",
|
||||
"__pycache__/x/y.txt",
|
||||
".venv/x/y.txt",
|
||||
"venv/x/y.txt",
|
||||
"env/x/y.txt",
|
||||
] {
|
||||
let m = overrides.combined.matched(Path::new(blacklisted), false);
|
||||
assert!(m.is_ignore(), "{blacklisted} should be ignored");
|
||||
}
|
||||
let m_ok = overrides.combined.matched(Path::new("ok/z.txt"), false);
|
||||
assert!(!m_ok.is_ignore(), "ok/z.txt should not be ignored");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn gitignore_at_repo_root_excludes_matching_files() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::create_dir_all(root.join("src")).unwrap();
|
||||
fs::write(root.join(".gitignore"), "*.log\ndist/\n").unwrap();
|
||||
fs::write(root.join("a.log"), "x").unwrap();
|
||||
fs::write(root.join("src/main.rs"), "x").unwrap();
|
||||
fs::create_dir_all(root.join("dist")).unwrap();
|
||||
fs::write(root.join("dist/bundle.js"), "x").unwrap();
|
||||
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
assert!(overrides.combined.matched(Path::new("a.log"), false).is_ignore());
|
||||
assert!(overrides.combined.matched(Path::new("dist/bundle.js"), false).is_ignore());
|
||||
assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn gitignore_missing_is_no_op() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::write(root.join("a.log"), "x").unwrap();
|
||||
fs::create_dir_all(root.join("src")).unwrap();
|
||||
fs::write(root.join("src/main.rs"), "x").unwrap();
|
||||
|
||||
// No .gitignore present — patterns from .gitignore should not affect overrides.
|
||||
let overrides = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
assert!(!overrides.combined.matched(Path::new("a.log"), false).is_ignore());
|
||||
assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn gitignore_negation_with_trailing_slash_passes_through() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
// Negation pattern. We don't fully implement gitignore negation
|
||||
// semantics, but at minimum it must not produce double-`!` corruption.
|
||||
fs::write(root.join(".gitignore"), "!keep/\n").unwrap();
|
||||
// Just verify build_overrides doesn't error.
|
||||
let result = build_overrides(root, &[], &[], &[]);
|
||||
assert!(result.is_ok(), "should not error on negation pattern: {:?}", result.err());
|
||||
}
|
||||
|
||||
// ── Skip attribution tests ────────────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn classify_skip_attributes_builtin_over_gitignore() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
// node_modules matches both BUILTIN_BLACKLIST and a hypothetical
|
||||
// .gitignore entry. Builtin must win (priority order §5.2).
|
||||
fs::write(root.join(".gitignore"), "node_modules/\n").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
// node_modules/ dir itself
|
||||
let cat = classify_skip(Path::new("node_modules"), true, &ov);
|
||||
assert_eq!(cat, SkipCategory::BuiltinBlacklist, "builtin must have priority");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn classify_skip_attributes_gitignore_for_log_files() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::write(root.join(".gitignore"), "*.log\n").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let cat = classify_skip(Path::new("app.log"), false, &ov);
|
||||
assert_eq!(cat, SkipCategory::Gitignore);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn classify_skip_attributes_kebabignore() {
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
|
||||
let ov = build_overrides(root, &[], &["*.secret".to_string()], &[]).unwrap();
|
||||
let cat = classify_skip(Path::new("creds.secret"), false, &ov);
|
||||
assert_eq!(cat, SkipCategory::Kebabignore);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn walk_files_with_skips_counts_gitignored_files() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::write(root.join(".gitignore"), "*.log\n").unwrap();
|
||||
fs::write(root.join("ok.md"), "# ok").unwrap();
|
||||
fs::write(root.join("skipme.log"), "x").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();
|
||||
|
||||
let accepted_names: Vec<_> = accepted
|
||||
.iter()
|
||||
.map(|p| p.file_name().unwrap().to_string_lossy().into_owned())
|
||||
.collect();
|
||||
assert!(
|
||||
accepted_names.iter().any(|n| n == "ok.md"),
|
||||
"ok.md should be accepted; got: {accepted_names:?}"
|
||||
);
|
||||
assert!(
|
||||
!accepted_names.iter().any(|n| n == "skipme.log"),
|
||||
"skipme.log should not be accepted; got: {accepted_names:?}"
|
||||
);
|
||||
|
||||
let gitignore_skipped: Vec<_> = skipped_entries
|
||||
.iter()
|
||||
.filter(|e| e.category == SkipCategory::Gitignore)
|
||||
.collect();
|
||||
assert!(
|
||||
gitignore_skipped.iter().any(|e| e.path.file_name()
|
||||
.map(|n| n == "skipme.log")
|
||||
.unwrap_or(false)),
|
||||
"skipme.log should appear in gitignore_skipped; skipped: {:?}",
|
||||
skipped_entries.iter().map(|e| &e.path).collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn walk_files_with_skips_counts_builtin_blacklist_dirs() {
|
||||
use std::fs;
|
||||
use tempfile::TempDir;
|
||||
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let root = tmp.path();
|
||||
fs::create_dir_all(root.join("node_modules/foo")).unwrap();
|
||||
fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap();
|
||||
fs::write(root.join("ok.md"), "# ok").unwrap();
|
||||
|
||||
let ov = build_overrides(root, &[], &[], &[]).unwrap();
|
||||
let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap();
|
||||
|
||||
let accepted_names: Vec<_> = accepted
|
||||
.iter()
|
||||
.map(|p| p.file_name().unwrap().to_string_lossy().into_owned())
|
||||
.collect();
|
||||
assert!(
|
||||
accepted_names.iter().any(|n| n == "ok.md"),
|
||||
"ok.md must be accepted; got: {accepted_names:?}"
|
||||
);
|
||||
|
||||
let builtin_skipped: Vec<_> = skipped_entries
|
||||
.iter()
|
||||
.filter(|e| e.category == SkipCategory::BuiltinBlacklist)
|
||||
.collect();
|
||||
assert!(
|
||||
!builtin_skipped.is_empty(),
|
||||
"node_modules/ should produce at least one BuiltinBlacklist skip"
|
||||
);
|
||||
assert!(
|
||||
builtin_skipped.iter().any(|e| e.path.components()
|
||||
.any(|c| c.as_os_str() == "node_modules")),
|
||||
"skipped path should contain node_modules; got: {:?}",
|
||||
builtin_skipped.iter().map(|e| &e.path).collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
111
crates/kebab-source-fs/tests/include_allowlist.rs
Normal file
111
crates/kebab-source-fs/tests/include_allowlist.rs
Normal file
@@ -0,0 +1,111 @@
|
||||
//! Integration test: `scope.include` enforces an allow-list.
|
||||
//!
|
||||
//! Semantics (gitignore convention):
|
||||
//! - `include` is empty Vec → all files pass through (backward-compat).
|
||||
//! - `include` is non-empty → only files matching at least one pattern
|
||||
//! are accepted. `exclude` rules still apply after include.
|
||||
//!
|
||||
//! Layout (built per-test in a TempDir):
|
||||
//! root/
|
||||
//! ├── a.md
|
||||
//! ├── b.py
|
||||
//! ├── c.png
|
||||
//! └── d.pdf
|
||||
|
||||
use std::fs;
|
||||
|
||||
use kebab_config::Config;
|
||||
use kebab_core::{SourceConnector, SourceScope};
|
||||
use kebab_source_fs::FsSourceConnector;
|
||||
|
||||
fn cfg_with_root(root: &str) -> Config {
|
||||
let mut c = Config::defaults();
|
||||
c.workspace.root = root.to_string();
|
||||
c.workspace.exclude.clear();
|
||||
// Disable size / generated caps so small test files always pass.
|
||||
c.ingest.code.max_file_bytes = u64::MAX;
|
||||
c.ingest.code.max_file_lines = u32::MAX;
|
||||
c.ingest.code.skip_generated_header = false;
|
||||
c
|
||||
}
|
||||
|
||||
fn setup_mixed_dir() -> tempfile::TempDir {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
fs::write(root.join("a.md"), b"md").unwrap();
|
||||
fs::write(root.join("b.py"), b"py").unwrap();
|
||||
fs::write(root.join("c.png"), b"\x89PNG").unwrap();
|
||||
fs::write(root.join("d.pdf"), b"%PDF").unwrap();
|
||||
dir
|
||||
}
|
||||
|
||||
/// Empty include → all 4 files pass (backward-compat).
|
||||
#[test]
|
||||
fn include_empty_accepts_all_files() {
|
||||
let dir = setup_mixed_dir();
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec![],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"a.md".to_string()), "a.md missing; got: {names:?}");
|
||||
assert!(names.contains(&"b.py".to_string()), "b.py missing; got: {names:?}");
|
||||
assert!(names.contains(&"c.png".to_string()), "c.png missing; got: {names:?}");
|
||||
assert!(names.contains(&"d.pdf".to_string()), "d.pdf missing; got: {names:?}");
|
||||
assert_eq!(names.len(), 4, "expected exactly 4 files; got: {names:?}");
|
||||
}
|
||||
|
||||
/// Non-empty include → only md + py come back; png + pdf are excluded.
|
||||
#[test]
|
||||
fn include_nonempty_is_allowlist() {
|
||||
let dir = setup_mixed_dir();
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(dir.path().to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec!["**/*.md".to_string(), "**/*.py".to_string()],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"a.md".to_string()), "a.md should be accepted; got: {names:?}");
|
||||
assert!(names.contains(&"b.py".to_string()), "b.py should be accepted; got: {names:?}");
|
||||
assert!(
|
||||
!names.contains(&"c.png".to_string()),
|
||||
"c.png must be rejected by include allowlist; got: {names:?}"
|
||||
);
|
||||
assert!(
|
||||
!names.contains(&"d.pdf".to_string()),
|
||||
"d.pdf must be rejected by include allowlist; got: {names:?}"
|
||||
);
|
||||
assert_eq!(names.len(), 2, "expected exactly 2 files; got: {names:?}");
|
||||
}
|
||||
|
||||
/// include + exclude are ANDed: a file matching include but also matching
|
||||
/// exclude must be rejected.
|
||||
#[test]
|
||||
fn include_and_exclude_are_anded() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
fs::write(root.join("keep.md"), b"keep").unwrap();
|
||||
fs::write(root.join("drop.md"), b"drop").unwrap();
|
||||
fs::write(root.join("other.py"), b"py").unwrap();
|
||||
|
||||
let conn = FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())).unwrap();
|
||||
let scope = SourceScope {
|
||||
include: vec!["**/*.md".to_string()],
|
||||
exclude: vec!["drop.md".to_string()],
|
||||
..SourceScope::default()
|
||||
};
|
||||
let assets = conn.scan(&scope).unwrap();
|
||||
let names: Vec<_> = assets.iter().map(|a| a.workspace_path.0.clone()).collect();
|
||||
assert!(names.contains(&"keep.md".to_string()), "keep.md should be accepted; got: {names:?}");
|
||||
assert!(
|
||||
!names.contains(&"drop.md".to_string()),
|
||||
"drop.md should be excluded (matched exclude); got: {names:?}"
|
||||
);
|
||||
assert!(
|
||||
!names.contains(&"other.py".to_string()),
|
||||
"other.py should be excluded (not in include); got: {names:?}"
|
||||
);
|
||||
}
|
||||
@@ -9,7 +9,7 @@
|
||||
//! Expected: `scan` returns in O(seconds), every emitted path is unique,
|
||||
//! and `alpha.md` appears at least once.
|
||||
//!
|
||||
//! The cycle guard lives in `walker::walk_files`; this test exists to
|
||||
//! The cycle guard lives in `walker::walk_files_with_skips`; this test exists to
|
||||
//! prove it catches the realistic shape (cycle through one or more
|
||||
//! symlinks) end-to-end via the public API.
|
||||
|
||||
@@ -100,7 +100,7 @@ fn two_step_directory_cycle_visited_set_breaks_loop() {
|
||||
//
|
||||
// Without the visited-set, walkdir would descend
|
||||
// a → a/loop (=b) → a/loop/loop (=a) → … forever.
|
||||
// The canonical-path visited-set in `walker::walk_files` must break
|
||||
// The canonical-path visited-set in `walker::walk_files_with_skips` must break
|
||||
// the loop and yield a finite, deterministic result.
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let root = dir.path();
|
||||
|
||||
@@ -42,8 +42,19 @@
|
||||
],
|
||||
"root": "/home/u/KB"
|
||||
},
|
||||
"skip_examples": {
|
||||
"builtin_blacklist": [],
|
||||
"generated": [],
|
||||
"gitignore": [],
|
||||
"size_exceeded": []
|
||||
},
|
||||
"skipped": 0,
|
||||
"skipped_builtin_blacklist": 0,
|
||||
"skipped_by_extension": {},
|
||||
"skipped_generated": 0,
|
||||
"skipped_gitignore": 0,
|
||||
"skipped_kebabignore": 0,
|
||||
"skipped_size_exceeded": 0,
|
||||
"unchanged": 0,
|
||||
"updated": 1
|
||||
}
|
||||
|
||||
@@ -286,6 +286,72 @@ impl kebab_core::DocumentStore for SqliteStore {
|
||||
}
|
||||
}
|
||||
|
||||
fn get_document_by_workspace_path(
|
||||
&self,
|
||||
path: &kebab_core::WorkspacePath,
|
||||
) -> Result<Option<kebab_core::CanonicalDocument>> {
|
||||
let conn = self.lock_conn();
|
||||
let row: Option<DocumentRow> = conn
|
||||
.query_row(
|
||||
"SELECT
|
||||
doc_id, asset_id, workspace_path, title, lang,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version, metadata_json,
|
||||
provenance_json, created_at, updated_at,
|
||||
last_chunker_version, last_embedding_version
|
||||
FROM documents WHERE workspace_path = ?",
|
||||
params![path.0],
|
||||
document_row_from_sql,
|
||||
)
|
||||
.map(Some)
|
||||
.or_else(rows_optional)
|
||||
.map_err(StoreError::from)?;
|
||||
let Some(row) = row else { return Ok(None) };
|
||||
|
||||
let doc_id = kebab_core::DocumentId(row.doc_id.clone());
|
||||
let mut blocks_stmt = conn
|
||||
.prepare(
|
||||
"SELECT payload_json FROM blocks
|
||||
WHERE doc_id = ? ORDER BY ordinal ASC",
|
||||
)
|
||||
.map_err(StoreError::from)?;
|
||||
let block_rows = blocks_stmt
|
||||
.query_map(params![row.doc_id], |r| {
|
||||
let payload_json: String = r.get(0)?;
|
||||
Ok(payload_json)
|
||||
})
|
||||
.map_err(StoreError::from)?;
|
||||
let mut blocks: Vec<kebab_core::Block> = Vec::new();
|
||||
for block_row in block_rows {
|
||||
let payload_json = block_row.map_err(StoreError::from)?;
|
||||
let block: kebab_core::Block = serde_json::from_str(&payload_json)
|
||||
.context("deserialize block payload_json")?;
|
||||
blocks.push(block);
|
||||
}
|
||||
|
||||
let metadata: kebab_core::Metadata = serde_json::from_str(&row.metadata_json)
|
||||
.context("deserialize metadata_json")?;
|
||||
let provenance: kebab_core::Provenance =
|
||||
serde_json::from_str(&row.provenance_json)
|
||||
.context("deserialize provenance_json")?;
|
||||
|
||||
Ok(Some(kebab_core::CanonicalDocument {
|
||||
doc_id,
|
||||
source_asset_id: kebab_core::AssetId(row.asset_id),
|
||||
workspace_path: kebab_core::WorkspacePath(row.workspace_path),
|
||||
title: row.title.unwrap_or_default(),
|
||||
lang: kebab_core::Lang(row.lang.unwrap_or_default()),
|
||||
blocks,
|
||||
metadata,
|
||||
provenance,
|
||||
parser_version: kebab_core::ParserVersion(row.parser_version),
|
||||
schema_version: row.schema_version as u32,
|
||||
doc_version: row.doc_version as u32,
|
||||
last_chunker_version: row.last_chunker_version.map(kebab_core::ChunkerVersion),
|
||||
last_embedding_version: row.last_embedding_version.map(kebab_core::EmbeddingVersion),
|
||||
}))
|
||||
}
|
||||
|
||||
fn list_documents(
|
||||
&self,
|
||||
filter: &kebab_core::DocFilter,
|
||||
|
||||
@@ -153,6 +153,34 @@ impl SqliteStore {
|
||||
}
|
||||
}
|
||||
|
||||
// p10-1A-1 fix (dogfood-discovered 2026-05-20): code_lang filter
|
||||
// (IN-list on metadata_json.$.code_lang). Empty Vec = no filter.
|
||||
if !filters.code_lang.is_empty() {
|
||||
let placeholders = std::iter::repeat_n("?", filters.code_lang.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
sql.push_str(&format!(
|
||||
" AND json_extract(d.metadata_json, '$.code_lang') IN ({placeholders})"
|
||||
));
|
||||
for lang in &filters.code_lang {
|
||||
bind.push(Box::new(lang.clone()));
|
||||
}
|
||||
}
|
||||
|
||||
// p10-1A-1 fix (dogfood-discovered 2026-05-20): repo filter
|
||||
// (IN-list on metadata_json.$.repo). Empty Vec = no filter.
|
||||
if !filters.repo.is_empty() {
|
||||
let placeholders = std::iter::repeat_n("?", filters.repo.len())
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
sql.push_str(&format!(
|
||||
" AND json_extract(d.metadata_json, '$.repo') IN ({placeholders})"
|
||||
));
|
||||
for repo in &filters.repo {
|
||||
bind.push(Box::new(repo.clone()));
|
||||
}
|
||||
}
|
||||
|
||||
// p9-fb-36: ingested_after filter.
|
||||
// `documents.updated_at` is RFC3339 TEXT (UTC `Z` per fb-32);
|
||||
// lexicographic >= compare is correct — but only when the filter
|
||||
@@ -408,6 +436,78 @@ mod tests {
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
/// Variant of `seed_committed_full` that additionally accepts a
|
||||
/// `metadata_json` string so p10-1A-1 filter tests can set
|
||||
/// `metadata.code_lang` / `metadata.repo` without going through the
|
||||
/// full ingest pipeline.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn seed_committed_with_metadata(
|
||||
store: &SqliteStore,
|
||||
chunk_id: &str,
|
||||
doc_id: &str,
|
||||
workspace_path: &str,
|
||||
media_type_json: &str,
|
||||
metadata_json: &str,
|
||||
) {
|
||||
let asset_id = format!("a{}", &doc_id[..31]);
|
||||
{
|
||||
let conn = store.lock_conn();
|
||||
conn.execute(
|
||||
"INSERT INTO assets (
|
||||
asset_id, source_uri, workspace_path, media_type, byte_len,
|
||||
checksum, storage_kind, storage_path, discovered_at
|
||||
) VALUES (?, ?, ?, ?, 0, 'deadbeefdeadbeefdeadbeefdeadbeef',
|
||||
'reference', ?, '1970-01-01T00:00:00Z')",
|
||||
params![
|
||||
asset_id,
|
||||
format!("file://{workspace_path}"),
|
||||
workspace_path,
|
||||
media_type_json,
|
||||
workspace_path,
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path, title, lang, source_type,
|
||||
trust_level, parser_version, doc_version, schema_version,
|
||||
metadata_json, provenance_json, created_at, updated_at
|
||||
) VALUES (?, ?, ?, NULL, 'en', 'code', 'primary', 'v1', 1, 1,
|
||||
?, '{}', '1970-01-01T00:00:00Z', '1970-01-01T00:00:00Z')",
|
||||
params![doc_id, asset_id, workspace_path, metadata_json],
|
||||
)
|
||||
.unwrap();
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (
|
||||
chunk_id, doc_id, text, heading_path_json, section_label,
|
||||
source_spans_json, token_estimate, chunker_version,
|
||||
policy_hash, block_ids_json, created_at
|
||||
) VALUES (?, ?, 'code snippet', '[]', NULL, '[]', 1, 'v1', 'h', '[]',
|
||||
'1970-01-01T00:00:00Z')",
|
||||
params![chunk_id, doc_id],
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let embed_row = EmbeddingRecordRow {
|
||||
embedding_id: format!("e{}", &chunk_id[..31]),
|
||||
chunk_id: chunk_id.to_string(),
|
||||
model_id: "m".to_string(),
|
||||
model_version: "v1".to_string(),
|
||||
dimensions: 4,
|
||||
lance_table: "t".to_string(),
|
||||
created_at: OffsetDateTime::UNIX_EPOCH,
|
||||
};
|
||||
store
|
||||
.put_embedding_records_pending(std::slice::from_ref(&embed_row))
|
||||
.unwrap();
|
||||
store
|
||||
.mark_embedding_records_committed(std::slice::from_ref(
|
||||
&embed_row.embedding_id,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn cid(s: &str) -> ChunkId {
|
||||
ChunkId(s.to_string())
|
||||
}
|
||||
@@ -671,6 +771,78 @@ mod tests {
|
||||
assert_eq!(out, vec![cid(c1)], "doc_id filter must scope to the target doc only");
|
||||
}
|
||||
|
||||
// ── p10-1A-1 new filter arms ─────────────────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_code_lang_keeps_matching_lang() {
|
||||
// c1 = python, c2 = rust, c3 = markdown (no code_lang).
|
||||
// Filter code_lang=["python"] → only c1 survives.
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let c1 = "11111111111111111111111111111111";
|
||||
let c2 = "22222222222222222222222222222222";
|
||||
let c3 = "33333333333333333333333333333333";
|
||||
seed_committed_with_metadata(
|
||||
&store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1",
|
||||
"src/main.py", r#""code""#,
|
||||
r#"{"code_lang":"python"}"#,
|
||||
);
|
||||
seed_committed_with_metadata(
|
||||
&store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2",
|
||||
"src/lib.rs", r#""code""#,
|
||||
r#"{"code_lang":"rust"}"#,
|
||||
);
|
||||
seed_committed_with_metadata(
|
||||
&store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3",
|
||||
"README.md", r#""markdown""#,
|
||||
r#"{}"#,
|
||||
);
|
||||
|
||||
let f = SearchFilters {
|
||||
code_lang: vec!["python".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(&[cid(c1), cid(c2), cid(c3)], &f)
|
||||
.unwrap();
|
||||
assert_eq!(out, vec![cid(c1)], "only python chunk should survive code_lang filter");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_repo_keeps_matching_repo() {
|
||||
// c1 = repo "httpx", c2 = repo "requests", c3 = no repo.
|
||||
// Filter repo=["httpx"] → only c1 survives.
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let store = open_store(&tmp);
|
||||
let c1 = "11111111111111111111111111111111";
|
||||
let c2 = "22222222222222222222222222222222";
|
||||
let c3 = "33333333333333333333333333333333";
|
||||
seed_committed_with_metadata(
|
||||
&store, c1, "d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1d1",
|
||||
"httpx/client.py", r#""code""#,
|
||||
r#"{"repo":"httpx","code_lang":"python"}"#,
|
||||
);
|
||||
seed_committed_with_metadata(
|
||||
&store, c2, "d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2",
|
||||
"requests/api.py", r#""code""#,
|
||||
r#"{"repo":"requests","code_lang":"python"}"#,
|
||||
);
|
||||
seed_committed_with_metadata(
|
||||
&store, c3, "d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3d3",
|
||||
"standalone.py", r#""code""#,
|
||||
r#"{"code_lang":"python"}"#,
|
||||
);
|
||||
|
||||
let f = SearchFilters {
|
||||
repo: vec!["httpx".to_string()],
|
||||
..Default::default()
|
||||
};
|
||||
let out = store
|
||||
.filter_chunks(&[cid(c1), cid(c2), cid(c3)], &f)
|
||||
.unwrap();
|
||||
assert_eq!(out, vec![cid(c1)], "only httpx chunk should survive repo filter");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn filter_chunks_ingested_after_non_utc_offset_compares_as_instant() {
|
||||
// Regression test for the non-UTC offset lex-compare bug.
|
||||
|
||||
@@ -669,6 +669,71 @@ impl SqliteStore {
|
||||
) -> anyhow::Result<CountSummary> {
|
||||
self.count_summary_inner(threshold_days)
|
||||
}
|
||||
|
||||
/// p10-1A-2: per-code-language doc count for `schema.v1`.
|
||||
///
|
||||
/// Reads `metadata_json->'$.code_lang'`, groups by the value, and
|
||||
/// skips rows where `code_lang` is NULL (i.e. non-code documents).
|
||||
/// Returns `BTreeMap<String, u32>` — key is the canonical lowercase
|
||||
/// language identifier (e.g. `"rust"`), value is the doc count.
|
||||
pub fn code_lang_breakdown(
|
||||
&self,
|
||||
) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
|
||||
use anyhow::Context;
|
||||
let conn = self.read_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT json_extract(metadata_json, '$.code_lang') AS cl, COUNT(*) \
|
||||
FROM documents \
|
||||
WHERE cl IS NOT NULL \
|
||||
GROUP BY cl",
|
||||
)
|
||||
.context("prepare code_lang_breakdown")?;
|
||||
let rows = stmt
|
||||
.query_map([], |r| {
|
||||
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
|
||||
})
|
||||
.context("query code_lang_breakdown")?;
|
||||
let mut out = std::collections::BTreeMap::new();
|
||||
for row in rows {
|
||||
let (k, v) = row.context("read code_lang_breakdown row")?;
|
||||
out.insert(k, v);
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// p10-1A-2 follow-up (dogfooding 2026-05-20): per-repo doc count for
|
||||
/// `schema.v1`.
|
||||
///
|
||||
/// Reads `metadata_json->'$.repo'`, groups by the value, and skips rows
|
||||
/// where `repo` is NULL (documents without an explicit repo tag).
|
||||
/// Returns `BTreeMap<String, u32>` — key is the repo name as stored in
|
||||
/// frontmatter, value is the doc count.
|
||||
pub fn repo_breakdown(
|
||||
&self,
|
||||
) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
|
||||
use anyhow::Context;
|
||||
let conn = self.read_conn();
|
||||
let mut stmt = conn
|
||||
.prepare(
|
||||
"SELECT json_extract(metadata_json, '$.repo') AS rp, COUNT(*) \
|
||||
FROM documents \
|
||||
WHERE rp IS NOT NULL \
|
||||
GROUP BY rp",
|
||||
)
|
||||
.context("prepare repo_breakdown")?;
|
||||
let rows = stmt
|
||||
.query_map([], |r| {
|
||||
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
|
||||
})
|
||||
.context("query repo_breakdown")?;
|
||||
let mut out = std::collections::BTreeMap::new();
|
||||
for row in rows {
|
||||
let (k, v) = row.context("read repo_breakdown row")?;
|
||||
out.insert(k, v);
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
}
|
||||
|
||||
/// Apply the design §5 / task-spec pragmas. Called once per connection.
|
||||
@@ -710,5 +775,154 @@ mod tests {
|
||||
assert!(s.lang_breakdown.is_empty());
|
||||
assert_eq!(s.stale_doc_count, 0);
|
||||
}
|
||||
|
||||
/// p10-1A-2: `code_lang_breakdown` counts docs by `metadata_json.code_lang`.
|
||||
///
|
||||
/// Inserts:
|
||||
/// - one doc with `code_lang = "rust"` → must appear with count 1
|
||||
/// - one doc with `code_lang = null` → must NOT appear (NULL skipped)
|
||||
///
|
||||
/// Uses a side rusqlite connection that bypasses the `assets` FK via
|
||||
/// `PRAGMA foreign_keys = OFF` so the test is self-contained.
|
||||
#[test]
|
||||
fn code_lang_breakdown_counts_by_code_lang() {
|
||||
let (dir, store) = open_fresh_store();
|
||||
|
||||
// Insert two document rows directly. Disabling FK enforcement lets
|
||||
// us skip the companion `assets` insert.
|
||||
let db_path = dir.path().join("kebab.sqlite");
|
||||
let conn = rusqlite::Connection::open(&db_path).unwrap();
|
||||
conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
|
||||
|
||||
// Doc 1: Rust code file — code_lang = "rust"
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-rust-1', 'asset-1', 'src/main.rs',
|
||||
'reference', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"code_lang\":\"rust\"}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Doc 2: Markdown doc — code_lang absent (null in JSON)
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-md-1', 'asset-2', 'notes/readme.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"code_lang\":null}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
drop(conn); // release side connection before querying via store
|
||||
|
||||
let bd = store.code_lang_breakdown().unwrap();
|
||||
|
||||
// rust must appear with count 1
|
||||
assert_eq!(
|
||||
bd.get("rust"),
|
||||
Some(&1u32),
|
||||
"expected rust=1 in code_lang_breakdown, got: {bd:?}"
|
||||
);
|
||||
// null code_lang must NOT appear as any key
|
||||
assert!(
|
||||
!bd.contains_key("null"),
|
||||
"null code_lang must not appear in breakdown, got: {bd:?}"
|
||||
);
|
||||
// only one key total
|
||||
assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
|
||||
}
|
||||
|
||||
/// p10-1A-2 follow-up: `repo_breakdown` counts docs by
|
||||
/// `metadata_json.repo`.
|
||||
///
|
||||
/// Inserts:
|
||||
/// - one doc with `repo = "my-repo"` → must appear with count 1
|
||||
/// - one doc with `repo = null` → must NOT appear (NULL skipped)
|
||||
///
|
||||
/// Uses a side rusqlite connection that bypasses the `assets` FK via
|
||||
/// `PRAGMA foreign_keys = OFF` so the test is self-contained.
|
||||
#[test]
|
||||
fn repo_breakdown_counts_by_repo() {
|
||||
let (dir, store) = open_fresh_store();
|
||||
|
||||
let db_path = dir.path().join("kebab.sqlite");
|
||||
let conn = rusqlite::Connection::open(&db_path).unwrap();
|
||||
conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
|
||||
|
||||
// Doc 1: doc with repo = "my-repo"
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-repo-1', 'asset-r1', 'my-repo/README.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"repo\":\"my-repo\"}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Doc 2: doc with repo absent (null in JSON)
|
||||
conn.execute(
|
||||
"INSERT INTO documents (
|
||||
doc_id, asset_id, workspace_path,
|
||||
source_type, trust_level, parser_version,
|
||||
doc_version, schema_version,
|
||||
metadata_json, provenance_json,
|
||||
created_at, updated_at
|
||||
) VALUES (
|
||||
'doc-norepo-1', 'asset-r2', 'standalone/notes.md',
|
||||
'markdown', 'primary', 'test-v1',
|
||||
1, 1,
|
||||
'{\"repo\":null}', '{}',
|
||||
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
|
||||
)",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
drop(conn); // release side connection before querying via store
|
||||
|
||||
let bd = store.repo_breakdown().unwrap();
|
||||
|
||||
// "my-repo" must appear with count 1
|
||||
assert_eq!(
|
||||
bd.get("my-repo"),
|
||||
Some(&1u32),
|
||||
"expected my-repo=1 in repo_breakdown, got: {bd:?}"
|
||||
);
|
||||
// null repo must NOT appear as any key
|
||||
assert!(
|
||||
!bd.contains_key("null"),
|
||||
"null repo must not appear in breakdown, got: {bd:?}"
|
||||
);
|
||||
// only one key total
|
||||
assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -42,6 +42,10 @@ fn make_metadata() -> Metadata {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -51,6 +51,10 @@ fn make_doc() -> CanonicalDocument {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
};
|
||||
CanonicalDocument {
|
||||
doc_id,
|
||||
|
||||
@@ -35,6 +35,12 @@ fn fixture_report() -> IngestReport {
|
||||
errors: 0,
|
||||
duration_ms: 187,
|
||||
skipped_by_extension: std::collections::BTreeMap::new(),
|
||||
skipped_gitignore: 0,
|
||||
skipped_kebabignore: 0,
|
||||
skipped_builtin_blacklist: 0,
|
||||
skipped_generated: 0,
|
||||
skipped_size_exceeded: 0,
|
||||
skip_examples: kebab_core::SkipExamples::default(),
|
||||
items: Some(vec![
|
||||
IngestItem {
|
||||
kind: IngestItemKind::New,
|
||||
|
||||
@@ -54,6 +54,10 @@ fn make_doc(
|
||||
trust_level: trust,
|
||||
user_id_alias: None,
|
||||
user: Default::default(),
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
};
|
||||
let doc = CanonicalDocument {
|
||||
doc_id,
|
||||
|
||||
@@ -455,6 +455,15 @@ fn describe_span(span: &kebab_core::SourceSpan) -> String {
|
||||
SourceSpan::Time { start_ms, end_ms } => {
|
||||
format!("Time {start_ms}-{end_ms} ms")
|
||||
}
|
||||
SourceSpan::Code {
|
||||
line_start,
|
||||
line_end,
|
||||
symbol,
|
||||
..
|
||||
} => match symbol {
|
||||
Some(sym) => format!("Code {line_start}-{line_end} ({sym})"),
|
||||
None => format!("Code {line_start}-{line_end}"),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -79,6 +79,10 @@ fn make_doc() -> CanonicalDocument {
|
||||
trust_level: TrustLevel::Primary,
|
||||
user_id_alias: None,
|
||||
user,
|
||||
repo: None,
|
||||
git_branch: None,
|
||||
git_commit: None,
|
||||
code_lang: None,
|
||||
},
|
||||
provenance: Provenance {
|
||||
events: vec![ProvenanceEvent {
|
||||
|
||||
@@ -56,6 +56,8 @@ fn make_hit(rank: u32, path: &str, snippet: &str, citation: Citation) -> SearchH
|
||||
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
|
||||
stale: false,
|
||||
score_kind: kebab_core::ScoreKind::Rrf,
|
||||
repo: None,
|
||||
code_lang: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -22,6 +22,8 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
|
||||
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
|
||||
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
|
||||
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
|
||||
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` — **parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). |
|
||||
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
|
||||
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |
|
||||
| Desktop | Tauri 2 + `pdfjs-dist` (native PDF render backend 금지) — P9-5 |
|
||||
| citation 형식 | URI fragment (`path#L12-L34` / `path#p=12` / `path#xywh=0,0,100,50`, W3C Media Fragments) |
|
||||
@@ -50,6 +52,7 @@ flowchart TB
|
||||
ppdf["kebab-parse-pdf"]
|
||||
pimg["kebab-parse-image"]
|
||||
paud["kebab-parse-audio<br/>(P8 보류)"]
|
||||
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B)"]
|
||||
ptypes["kebab-parse-types"]
|
||||
norm["kebab-normalize"]
|
||||
chunk["kebab-chunk"]
|
||||
@@ -80,6 +83,7 @@ flowchart TB
|
||||
app --> ppdf
|
||||
app --> pimg
|
||||
app --> paud
|
||||
app --> pcode
|
||||
app --> norm
|
||||
app --> chunk
|
||||
app --> sqlite
|
||||
@@ -95,6 +99,7 @@ flowchart TB
|
||||
ppdf --> ptypes
|
||||
pimg --> ptypes
|
||||
paud --> ptypes
|
||||
pcode --> core
|
||||
norm --> ptypes
|
||||
embedlocal --> embed
|
||||
llmlocal --> llm
|
||||
@@ -122,6 +127,8 @@ flowchart TB
|
||||
|
||||
UI → store/llm/parse 직접 의존 금지. 모든 user-facing 진입은 `kebab-app` facade 만 통한다 (frozen 설계 §8). `kebab-cli` 가 `--config <path>` flag 를 honor 하려면 `kebab_app::*_with_config(cfg, …)` companion 을 통해 Config 을 명시적으로 thread 하는 패턴 — 자세한 이유는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 의 `--config` 항목.
|
||||
|
||||
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지).
|
||||
|
||||
## 디렉토리 구조
|
||||
|
||||
```text
|
||||
@@ -158,7 +165,7 @@ kebab/
|
||||
│ ├── kebab-source-fs/ # 워크스페이스 walk + checksum (P1-1)
|
||||
│ ├── kebab-parse-md/ # Markdown frontmatter + blocks (P1-2/3)
|
||||
│ ├── kebab-normalize/ # ParsedBlock → CanonicalDocument (P1-4)
|
||||
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 chunker (P1-5, P7-2)
|
||||
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-rust-ast-v1 + code-python-ast-v1 + code-ts-ast-v1 + code-js-ast-v1 chunker (P1-5, P7-2, P10-1A-2, P10-1B)
|
||||
│ ├── kebab-store-sqlite/ # SQLite + FTS5 (V001/V002/V003) (P1-6, P2-1, P3-3)
|
||||
│ ├── kebab-search/ # Lexical + Vector + Hybrid retriever (P2-2, P3-4)
|
||||
│ ├── kebab-embed/ kebab-embed-local/ # Embedder trait + fastembed adapter (P3-1, P3-2)
|
||||
@@ -168,6 +175,7 @@ kebab/
|
||||
│ ├── kebab-eval/ # golden query runner + metrics (P5-1, P5-2)
|
||||
│ ├── kebab-parse-image/ # ImageExtractor + Ollama OCR + caption (P6)
|
||||
│ ├── kebab-parse-pdf/ # lopdf per-page text extractor (P7-1)
|
||||
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B); chunker lives in kebab-chunk
|
||||
│ ├── kebab-app/ # facade (P0 시그니처 + P3-5/P6-4/P7-3 본체)
|
||||
│ ├── kebab-tui/ # Ratatui shell + Library 패널 (P9-1)
|
||||
│ ├── kebab-mcp/ # stdio MCP server — tools: schema, doctor, search, ask (P9-FB-30)
|
||||
|
||||
126
docs/SMOKE.md
126
docs/SMOKE.md
@@ -113,6 +113,12 @@ max_context_tokens = 6000
|
||||
|
||||
[ui]
|
||||
theme = "dark" # p9-fb-14 — TUI palette ("dark" / "light", default "dark")
|
||||
|
||||
[ingest.code]
|
||||
skip_generated_header = true
|
||||
max_file_bytes = 262144
|
||||
max_file_lines = 5000
|
||||
extra_skip_globs = [] # 사용자 추가 skip 패턴 (gitignore syntax)
|
||||
```
|
||||
|
||||
`KEBAB_*` 환경변수로 override 가능 (`KEBAB_MODELS_LLM_MODEL=gemma4:26b kebab …` 등). 자세한 키 목록은 `crates/kebab-config/src/lib.rs` 의 `apply_env` 매치 암. `KEBAB_READONLY=1` — write-path 비활성화 (CI 안전망). `KEBAB_PROGRESS=plain` — non-TTY 환경에서 진행 상황을 plain 한 줄씩 stderr 출력 (spinner 대신).
|
||||
@@ -297,6 +303,104 @@ kebab --config /tmp/kebab-smoke/config.toml ask "<PDF 본문에 관한 질문>"
|
||||
|
||||
각 명령은 0 종료 코드면 정상. `kebab ask` 는 거절 시 종료 코드 1 (`RefusalSignal`) — 의도된 동작.
|
||||
|
||||
## P10-1A-2 Rust 코드 색인
|
||||
|
||||
`kebab-parse-code` 의 tree-sitter Rust AST extractor + `code-rust-ast-v1` chunker 를 격리된 TempDir KB 에서 검증하는 절차.
|
||||
|
||||
```bash
|
||||
# 1) 워크스페이스에 Rust 소스 파일 추가 (crate 하나 복사 또는 단일 .rs 파일)
|
||||
cp -r crates/kebab-parse-code /tmp/kebab-smoke/workspace/kebab-parse-code
|
||||
|
||||
# 2) ingest — .rs 가 code-rust-ast-v1 로 처리됨
|
||||
KB ingest
|
||||
|
||||
# 3) 결과 검증 — IngestReport.items 에 .rs 자산이 "new" 로 분류, parser_version = "code-rust-v1" (chunker_version = "code-rust-ast-v1")
|
||||
KB --json ingest | jq '[.items[] | select(.doc_path | endswith(".rs"))]'
|
||||
|
||||
# 4) 코드 검색 — code_lang 필터 (wire: lang 은 citation.lang, code_lang 은 SearchHit top-level)
|
||||
KB search --mode hybrid "RustAstExtractor" --code-lang rust --json | jq '{hits: [.hits[] | {symbol: .citation.symbol, code_lang: .citation.lang, repo: .repo}]}'
|
||||
|
||||
# 5) citation 확인 — kind="code", symbol 이 함수명 / 타입명, line range 가 포함
|
||||
KB search --mode lexical "pub fn extract" --code-lang rust --json | jq '.hits[0].citation'
|
||||
```
|
||||
|
||||
`[ingest.code]` 설정 (config.toml 에 이미 포함됨 — 위 격리 config 블록 참조):
|
||||
|
||||
```toml
|
||||
[ingest.code]
|
||||
skip_generated_header = true # @generated / DO NOT EDIT 감지 시 skip
|
||||
max_file_bytes = 262144 # 256 KiB cap — 초과 시 skip
|
||||
max_file_lines = 5000 # 5000 줄 cap — 초과 시 skip
|
||||
extra_skip_globs = [] # 사용자 추가 skip 패턴
|
||||
```
|
||||
|
||||
**알려진 동작 (2026-05-19 기준)**:
|
||||
|
||||
- `ast_chunk_max_lines = 200` 은 config 가 아닌 chunker 모듈 상수. 현재 기본값과 동일하므로 user-visible 차이 없음. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-19 `AST_CHUNK_MAX_LINES` 항목).
|
||||
- `.rs` 파일은 `SourceType::Note` 로 분류됨 (kebab-core `SourceType::Code` variant 미존재). `--media code` filter 는 정상 동작 — `MediaType::Code("rust")` 로 별도 분류됨. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-19 `SourceType::Code` 항목).
|
||||
- `.gitignore` 가 honor 됨 — `target/` / `node_modules/` 등은 built-in 안전망으로 자동 skip.
|
||||
|
||||
## P10-1B Python / TypeScript / JavaScript 코드 색인
|
||||
|
||||
P10-1A-2 와 동일한 격리 KB 설정으로 Python / TypeScript / JavaScript 3 언어를 검증한다. 설정 블록은 P10-1A-2 와 동일 (`[ingest.code]` 절 포함).
|
||||
|
||||
```bash
|
||||
# 1) 워크스페이스에 Python / TS / JS 파일 추가 (소규모 샘플로 충분)
|
||||
mkdir -p /tmp/kebab-smoke/workspace/sample_code
|
||||
# Python 예시
|
||||
cat > /tmp/kebab-smoke/workspace/sample_code/metrics.py <<'EOF'
|
||||
def compute_mrr(results):
|
||||
"""Mean Reciprocal Rank."""
|
||||
total = 0.0
|
||||
for i, hit in enumerate(results, 1):
|
||||
if hit:
|
||||
total += 1.0 / i
|
||||
break
|
||||
return total
|
||||
EOF
|
||||
# TypeScript 예시
|
||||
cat > /tmp/kebab-smoke/workspace/sample_code/searcher.ts <<'EOF'
|
||||
export class Searcher {
|
||||
search(query: string): string[] {
|
||||
return [];
|
||||
}
|
||||
}
|
||||
EOF
|
||||
# JavaScript 예시
|
||||
cat > /tmp/kebab-smoke/workspace/sample_code/utils.js <<'EOF'
|
||||
function formatResult(hit) {
|
||||
return `${hit.score}: ${hit.path}`;
|
||||
}
|
||||
module.exports = { formatResult };
|
||||
EOF
|
||||
|
||||
# 2) ingest
|
||||
KB ingest
|
||||
|
||||
# 3) 언어별 검색 (symbol + module path prefix 확인)
|
||||
KB search --mode hybrid "compute_mrr" --code-lang python --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
KB search --mode hybrid "search" --code-lang typescript --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
KB search --mode hybrid "formatResult" --code-lang javascript --json | \
|
||||
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
|
||||
|
||||
# 4) schema stats 에 3 언어 카운트 확인
|
||||
KB --json schema | jq '.stats.code_lang_breakdown'
|
||||
# 기대: {"python": N, "typescript": N, "javascript": N, "rust": M, ...}
|
||||
```
|
||||
|
||||
**Symbol path 컨벤션 (2026-05-20 기준)**:
|
||||
|
||||
- **Python**: workspace 경로 → dotted module path prefix. `sample_code/metrics.py` 의 `compute_mrr` → symbol `sample_code.metrics.compute_mrr`.
|
||||
- **TypeScript / JavaScript**: workspace 경로 → slash-style module path prefix. `sample_code/searcher.ts` 의 `search` → `sample_code/searcher.Searcher.search`. `.tsx` / `.mjs` / `.cjs` / `.jsx` 도 동일 처리.
|
||||
- **Rust** (1A-2): file-scope nesting 만, workspace path prefix 없음 (예: `Foo::double`). Python/TS/JS 와 비일관 — HOTFIXES 2026-05-20 참조.
|
||||
|
||||
**알려진 동작**:
|
||||
|
||||
- `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 잡힘 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20).
|
||||
- `.gitignore` honor — `node_modules/` / `__pycache__/` / `.venv/` 등 built-in 안전망 자동 skip.
|
||||
|
||||
## 검증 체크리스트
|
||||
|
||||
- `kebab doctor` 가 `--config` path 를 honor 하고 그 안의 `storage.data_dir` 를 출력 (XDG default 가 아님).
|
||||
@@ -327,6 +431,28 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
|
||||
- (P6-4) `image.ocr.enabled = true` + `image.caption.enabled = true` 인 워크스페이스에 PNG 가 N장 있으면 ingest 시간 ≈ markdown_time + N × (OCR + Caption latency). `gemma4:e4b` + 192.168.0.47 로 자산당 ~5-10초. 다수의 책 페이지를 이미지로 넣지 말 것 — 책은 P7 PDF 라인 사용 권장.
|
||||
- (P7-3) `config.chunking.chunker_version` 는 markdown 만 represent — PDF 자산은 `pdf-page-v1` 하드코딩. `config.toml` 의 `chunker_version = "md-heading-v1"` 을 봐도 PDF 는 영향 안 받음. HOTFIXES `2026-05-02 P7-3` entry 참조 (P+ chunker registry task 까지 유지).
|
||||
- (P7-3) 한 PDF 가 N 페이지면 `kebab ingest` 가 N 개 (또는 그 이상의, 페이지 길면 multi-chunk) 의 chunk 를 한 transaction 안에서 commit. 500 페이지 책 → 500+ chunk 한 번에 → embedding throughput 가 bottleneck. 임베딩 활성 워크스페이스에서 큰 PDF 를 처음 ingest 하면 분-단위 시간 + WAL 크기 증가 가능 — P+ 스케일 hardening task 까지 정상 동작이지만 비용은 측정 가능.
|
||||
- (P10-1A-2) `.rs` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `kebab search --mode hybrid "<함수명>" --code-lang rust --json` 가 `citation.kind = "code"`, `citation.lang = "rust"` (SearchHit top-level `code_lang` 도 동일), `citation.symbol` (함수/타입 이름), `citation.line_start` / `citation.line_end` 를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"rust": N` 이 나오면 chunk 가 색인됨.
|
||||
- (P10-1B) `.py` / `.ts` / `.tsx` / `.js` / `.mjs` / `.cjs` / `.jsx` 파일을 워크스페이스에 두면 `kebab ingest` 결과에 `new` 카운터에 포함. `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` 검색이 `citation.symbol` 에 module path prefix 를 포함한 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 해당 언어 카운트 등장 확인.
|
||||
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
|
||||
|
||||
### Embedding upgrade (fb-39b)
|
||||
|
||||
`multilingual-e5-small` 에서 `multilingual-e5-large` 로 업그레이드 시퀀스:
|
||||
|
||||
```bash
|
||||
# 기존 vector index 정리 (orphan table 회피)
|
||||
kebab --config /tmp/kebab-smoke/config.toml reset --vector-only
|
||||
|
||||
# config.toml 의 [models.embedding] 갱신:
|
||||
# model = "multilingual-e5-large"
|
||||
# dimensions = 1024
|
||||
|
||||
# 재-ingest — fastembed 가 첫 실행 시 e5-large ONNX (~1.3 GB) 자동 다운로드.
|
||||
# 다운로드 시간 + 모든 chunk re-embed 시간 (e5-small 대비 ~3-4×).
|
||||
kebab --config /tmp/kebab-smoke/config.toml ingest
|
||||
|
||||
# fb-39 의 P@k metric 으로 small vs large 비교:
|
||||
kebab --config /tmp/kebab-smoke/config.toml eval run
|
||||
```
|
||||
|
||||
자세한 history 와 발견된 버그는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 참조.
|
||||
|
||||
2265
docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md
Normal file
2265
docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md
Normal file
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user