From 49487dc46b3823486456ec738d60e606a75388ef Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 14:15:59 +0900 Subject: [PATCH 01/21] =?UTF-8?q?spec(p10):=20code=20ingest=20design=20?= =?UTF-8?q?=E2=80=94=20Tier=201=20AST=20+=20Tier=202=20resource=20+=20Tier?= =?UTF-8?q?=203=20fallback?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 수십 개 git repo (한 부모 dir 아래) 를 corpus 로 확장. Tier 1 (Rust/Python/TS-JS/Go/Java/Kotlin/C/C++) 은 tree-sitter AST per-language chunker, Tier 2 (k8s manifest / Dockerfile / Cargo.toml 류) 는 resource-aware chunker, Tier 3 (shell / fallback) 는 paragraph + line-window. embedding 은 multilingual-e5-large 유지 — cross-corpus 검색 위해. Phase 1A (Rust) 부터 1D (C/C++) + Phase 2 (Tier 2) + Phase 3 (Tier 3) 순으로 진행. ignore 통합 (.gitignore honor + .kebabignore 추가 + 최소 built-in safety net), generated header sniff, size cap 으로 첫 도그푸딩 비용 차단. 새 Citation variant `code`, SearchHit 의 repo/code_lang 필드, --media code / --code-lang / --repo filter — 모두 additive minor. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-15-kebab-code-ingest-design.md | 812 ++++++++++++++++++ 1 file changed, 812 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md diff --git a/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md new file mode 100644 index 0000000..49b51ac --- /dev/null +++ b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md @@ -0,0 +1,812 @@ +# kebab — Code Ingest Design + +기준일: 2026-05-15 +대상: kebab 워크스페이스를 **코드 corpus** 로 확장 (`Tier 1` AST per-language + `Tier 2` resource-aware + `Tier 3` paragraph fallback). frozen design doc `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 의 후속이자, 그 §11 비-스코프 중 "모든 파일 포맷의 완벽한 parsing" 을 **부분적으로** 깨는 첫 spec. 단, multi-workspace / watch mode 같은 다른 비-스코프는 그대로 유지. + +대상 사용자 시나리오: 한 부모 디렉토리 (`workspace.root`) 아래 *수십 개의 git repo* 를 clone 한 상태에서, 그 corpus 전체에 의미 검색 + RAG 를 한 곳에서 수행. + +--- + +## 0. 동결된 결정 요약 + +| # | 결정 | 값 | 근거 | +|---|------|-----|------| +| C1 | 스코프 단위 | 한 `workspace.root` 아래 여러 repo (multi-workspace 안 함) | frozen Q10 / §11 그대로 — 부모 디렉토리 한 줄로 커버 | +| C2 | chunking 전략 | Tier 1 = AST per-language, Tier 2 = resource-aware, Tier 3 = paragraph + line-window | 의미 단위가 언어별로 다름. 일률 적용 불가 | +| C3 | embedding 모델 | 기존 `multilingual-e5-large` 유지 | 코드 + 문서 동일 벡터 공간 → cross-corpus 검색. embedding_version cascade 회피 | +| C4 | ignore 통합 | `.gitignore` 자동 honor + `.kebabignore` 추가 layer + 최소 built-in safety net | 사용자 mental model 자연스러움. `.gitignore` 가 source of truth | +| C5 | repo 인식 | `.git/` walk-up 자동 감지, identifier = dir 이름 | 단순 / deterministic / git remote 미설정 repo 도 안 깨짐 | +| C6 | branch 처리 | working tree only. branch 변경 후 ingest 는 blake3 hash 차이로 incremental reprocess | git history aware 색인은 §3 도메인 모델 크게 흔듦 — P+ | +| C7 | Citation variant | 새 `code` variant 도입 (line/page/region/caption/time/**code**) | 의미 분리 명확 — agent / consumer 분기 깔끔 | +| C8 | search hit 추가 | `SearchHit.repo`, `SearchHit.code_lang` (optional, additive minor) | repo 격리 / 통계 / filter | +| C9 | 새 filter | `--media code`, `--code-lang `, `--repo ` | search ergonomics | +| C10 | chunker_version | per-language (`code-rust-ast-v1` 등) | 언어별 chunker 독립 진화, §9 cascade rule 깔끔 | +| C11 | crate 구조 | 새 crate `kebab-parse-code` (모든 언어 mod) + 기존 `kebab-chunk` 모듈 확장 | 22 crates 한 번만 증가. 언어 추가는 모듈 한 쌍 | +| C12 | symbol path | per-language convention (`mod::fn` / `pkg.cls.method` / `module/Class.method` …) | 각 언어 self-reference 관습 그대로 | +| C13 | RAG prompt | Phase 1A 는 `rag-v2` 유지. 측정 후 `rag-v3` 도입 검토 | YAGNI | +| C14 | 특수 파일 | manifest 류 (`Cargo.toml` 등) 는 파일 통째로 1 chunk (`manifest-file-v1`) | 작은 파일은 전체 보기가 더 유용 | +| C15 | Phase 분할 | 1A Rust → 1B Python+TS/JS → 1C Go+Java/Kotlin → 1D C/C++ → 2 Tier 2 → 3 Tier 3 | 점진 도입, dogfooding 가능 | +| C16 | built-in skip 최소 | 5 entries: `node_modules/` `target/` `__pycache__/` `.venv/` `venv/` `env/` | `.gitignore` 가 메인 — built-in 은 safety net | +| C17 | generated header sniff | `@generated` / `DO NOT EDIT` 등 marker 6 종 — 첫 ~500 byte read | 첫 도그푸딩 비용 차단 (protobuf 등) | +| C18 | size cap | `max_file_bytes = 262144` (256 KiB), `max_file_lines = 5000` default | 대용량 fixture / minified 차단 | + +--- + +## 1. 스코프 + 비-스코프 + +### 1.1 스코프 (이 spec 으로 동결되는 것) + +- 코드 / 설정 파일 ingest 파이프라인 (parse → chunk → embed → store → retrieve → answer) +- 새 Citation variant `code` +- 새 SearchHit 필드 (`repo`, `code_lang`) +- 새 search filter (`--media code`, `--code-lang`, `--repo`) +- 새 chunker_version 라벨 family (`code-{lang}-ast-v1`, `k8s-manifest-resource-v1`, `dockerfile-file-v1`, `manifest-file-v1`, `code-text-paragraph-v1`) +- 새 crate `kebab-parse-code` +- 기존 `kebab-chunk` 모듈 확장 +- repo 자동 감지 + `metadata.repo` / `git_branch` / `git_commit` +- ignore 통합 정책 (`.gitignore` honor + `.kebabignore` + built-in) +- generated / vendored / size cap skip 정책 +- IngestReport 카운트 분류 확장 +- 새 config 절 `[ingest.code]` +- Phase 분할 (1A → 1B → 1C → 1D → 2 → 3) + +### 1.2 비-스코프 (이 spec 으로 명시적으로 *안 다루는* 것) + +- **Multi-workspace** — 여전히 single `workspace.root`. 사용자가 직접 부모 디렉토리 정렬. +- **Watch mode** — 여전히 명시 ingest 만. +- **git history aware indexing** — branch / commit 별 snapshot 색인 안 함. working tree 한 시점만. +- **LSP / go-to-definition / find-references** — 코드 *내비게이션* 은 IDE / CC 가 잘 함. kebab 은 *의미 검색* + *RAG* 만. +- **Code-specific embedding 모델** — Phase 2+ 측정 후 검토. 현재 spec 에선 e5-large 유지. +- **`rag-v3` (code-aware prompt)** — Phase 2+ 측정 후 검토. +- **서브모듈 / git worktree** — `.git/` 가 dir 인 normal repo 만 인식. submodule (`.git` file) 은 metadata.repo 만 null 또는 부모 repo 이름 fallback. +- **Cross-repo 의도적 dedup** — blake3 content hash 의 우연 dedup 만 존재. 명시적 dedup 로직 안 함. +- **`kebab://` URL handler** — frozen §11 그대로 P+. + +--- + +## 2. Phase 분할 + 마일스톤 + +각 phase = 별도 task spec (`tasks/p10/p10-1A-code-rust-ast-ingest.md` 등) + 별도 PR. Phase 1A 가 *프레임워크 일체* (새 crate, 새 Citation variant, repo metadata, 새 filter, ignore 정책 전체) 를 들고 들어가는 가장 무거운 phase. 나머지는 *언어 / chunker 추가* 만. + +| Phase | 내용 | 새 crate | 새 chunker_version | 마일스톤 | +|-------|------|----------|--------------------|----------| +| **1A** | Rust AST ingest + 프레임워크 일체 | `kebab-parse-code` 신설 | `code-rust-ast-v1` | kebab 자기 자신 dogfooding | +| **1B** | Python + TS/JS AST ingest | (1A 의 crate 에 모듈 추가) | `code-python-ast-v1`, `code-ts-ast-v1`, `code-js-ast-v1` | 사내 ML 코드 + 웹 코드 검색 | +| **1C** | Go + Java + Kotlin AST ingest | 동일 crate 에 모듈 추가 | `code-go-ast-v1`, `code-java-ast-v1`, `code-kotlin-ast-v1` | 사내 backend 검색 | +| **1D** | C + C++ AST ingest | 동일 crate 에 모듈 추가 | `code-c-ast-v1`, `code-cpp-ast-v1` | system code 검색 (마지막) | +| **2** | Tier 2 resource-aware: k8s manifest + Dockerfile + 일반 manifest | 동일 crate 에 모듈 추가 | `k8s-manifest-resource-v1`, `dockerfile-file-v1`, `manifest-file-v1` | k8s 운영 / DevOps 검색 | +| **3** | Tier 3 fallback: shell + 미지원 확장자 | 동일 crate 에 모듈 추가 | `code-text-paragraph-v1` | 잡 텍스트 fallback | + +Phase 1A 끝나는 시점에 binary version bump (예: `0.6` → `0.7`) — wire schema 의 Citation variant 추가가 additive 지만 RAG 사용자 도그푸드 surface 변경. 이후 phase 의 binary bump 는 각 phase 의 task spec 에서 결정. + +--- + +## 3. 도메인 모델 영향 + +frozen design `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 의 §3 (도메인 모델) 과 §2 (wire schema) 를 *additive minor* 로 확장. breaking 변경 없음. 영향 받는 frozen design 섹션은 [§10 cascade](#10-변경-영향--cascade) 에 정리. + +### 3.1 새 Citation variant: `code` + +frozen design §2.1 의 5 variant (`line` / `page` / `region` / `caption` / `time`) 에 `code` 추가 → 총 6 variant. + +```json +{ + "schema_version": "citation.v1", + "kind": "code", + "path": "kebab/crates/kebab-chunk/src/md_heading_v1.rs", + "uri": "kebab/crates/kebab-chunk/src/md_heading_v1.rs#L142-L168", + + "code": { + "line_start": 142, + "line_end": 168, + "symbol": "MdHeadingV1Chunker::chunk_doc", + "lang": "rust" + } +} +``` + +`code.symbol` 은 nullable — Tier 1 AST chunk 면 채움, Tier 2/3 면 비움 (`null`). +`code.lang` 은 `--code-lang` filter 와 같은 식별자 (lowercase). null 가능. + +기존 5 variant 와 마찬가지로 `path` + `uri` 는 항상 채움. `uri` 는 `path#L-L` (W3C Media Fragments) 그대로. + +### 3.2 SearchHit 신규 optional 필드 + +frozen design §2.2 의 SearchHit 에 두 필드 추가, 모두 optional / nullable, additive minor: + +```json +{ + "schema_version": "search_hit.v1", + "rank": 1, + "score": 0.78, + "score_kind": "rrf", + "chunk_id": "...", + "doc_id": "...", + "doc_path": "kebab/crates/kebab-chunk/src/md_heading_v1.rs", + "heading_path": ["src", "md_heading_v1"], + "section_label": "MdHeadingV1Chunker::chunk_doc", + "snippet": "...", + "citation": { "kind": "code", "...": "citation.v1" }, + + "repo": "kebab", // ← 신규 optional. .git/ walk-up 결과. + "code_lang": "rust", // ← 신규 optional. Tier 1/2/3 모두 채움 (Tier 2 의 yaml 등 포함). + + "retrieval": { "...": "..." }, + "index_version": "v1.0", + "embedding_model": "multilingual-e5-large", + "chunker_version": "code-rust-ast-v1" +} +``` + +기존 consumer (Claude Code skill 등) 는 두 필드 미인지 시 무시 — backwards-compat. + +Markdown / PDF / 이미지 hit 는 두 필드 모두 null. 코드 hit 도 *repo 외부 single-file ingest* (`kebab ingest-file`) 인 경우 `repo` null 가능. + +### 3.3 chunker_version 명명 (per-language) + +frozen design §3.2 의 `chunker_version` 라벨 family 확장. **per-language 독립** — 언어 chunker 버그 픽스가 다른 언어 chunks 무효화 안 함. + +```text +기존: + md-heading-v1 + pdf-page-v1 + +Phase 1A 추가: + code-rust-ast-v1 + +Phase 1B 추가: + code-python-ast-v1 + code-ts-ast-v1 + code-js-ast-v1 + +Phase 1C 추가: + code-go-ast-v1 + code-java-ast-v1 + code-kotlin-ast-v1 + +Phase 1D 추가: + code-c-ast-v1 + code-cpp-ast-v1 + +Phase 2 추가: + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + +Phase 3 추가: + code-text-paragraph-v1 +``` + +cascade rule (frozen design §9): +- 한 언어 chunker 버그 픽스 → 해당 `code-{lang}-ast-vN` 만 bump → `embedding_records` 의 해당 chunk 만 invalidate → 다음 ingest 에서 해당 언어 파일만 reprocess. +- 공통 코드 (예: tree-sitter wrapper) 변경 → 영향 받는 모든 언어 chunker 동시 bump. + +### 3.4 symbol path 포맷 (per-language convention) + +`Citation.code.symbol` 의 값. 각 언어의 *self-reference 관습* 그대로. + +| 언어 | 포맷 | 예시 | +|------|------|------| +| Rust | `mod::sub::fn_name`, `impl Type::method`, `Trait::method` | `chunk::md_heading_v1::MdHeadingV1Chunker::chunk_doc` | +| Python | `pkg.module.Class.method`, `pkg.module.func` | `kebab_eval.metrics.compute_mrr` | +| TS/JS | `module/Class.method`, `module/func`, `module/default` | `src/search/retriever/Retriever.search` | +| Go | `package.Func`, `package.(Receiver).Method` | `chunk.(*MdHeadingV1Chunker).ChunkDoc` | +| Java/Kotlin | `package.Class.method` | `com.kebab.chunk.MdHeadingV1Chunker.chunkDoc` | +| C | `func_name` | `parse_blocks` | +| C++ | `namespace::Class::method`, `namespace::func` | `kebab::chunk::MdHeadingV1Chunker::chunk_doc` | + +**top-level scope** (top-level fn / struct / class 정의 외부의 code, 예: Rust `use` / Python `import` block) 는 `` 로 표기. null 아님 — chunk 가 의미 단위 *없는* 영역임을 명시. + +**module / namespace 만 있고 symbol 없는 경우** (예: Rust mod 선언만 모인 `lib.rs`): `` 로 표기. + +### 3.5 metadata 확장 + +frozen design §3.6 (Metadata / Provenance) 에 코드 ingest 시 채워지는 필드: + +```rust +pub struct Metadata { + // 기존 필드 ... + pub lang: Option, // BCP-47 (자연어). 코드 파일은 보통 null. 코드 안의 주석 dominant lang detection 안 함. + pub tags: Vec, + // ... + + // 신규 (코드 ingest) + pub repo: Option, // .git/ walk-up 결과. dir 이름. + pub git_branch: Option, // ingest 시점 HEAD branch. + pub git_commit: Option, // ingest 시점 HEAD commit SHA (full 40 hex). + pub code_lang: Option, // tree-sitter parser 이름과 매칭. lowercase. +} +``` + +`code_lang` 식별자 정규화 (이 spec 의 canonical 정의): +- Rust 파일 (`.rs`) → `rust` +- Python (`.py`, `.pyi`) → `python` +- TypeScript (`.ts`, `.tsx`) → `typescript` +- JavaScript (`.js`, `.jsx`, `.mjs`, `.cjs`) → `javascript` +- Go (`.go`) → `go` +- Java (`.java`) → `java` +- Kotlin (`.kt`, `.kts`) → `kotlin` +- C (`.c`, `.h`) → `c` +- C++ (`.cpp`, `.cc`, `.cxx`, `.hpp`, `.hh`, `.hxx`) → `cpp` +- YAML / k8s manifest (`.yaml`, `.yml`) → `yaml` +- Dockerfile (`Dockerfile`, `*.dockerfile`) → `dockerfile` +- TOML (`.toml`) → `toml` +- JSON (`.json`) → `json` +- Shell (`.sh`, `.bash`, `.zsh`) → `shell` +- Make (`Makefile`, `*.mk`) → `make` +- 미지원 / Tier 3 fallback → null + +확장자 sniff 는 `kebab-parse-code` 의 단일 함수 `code_lang_for_path(path: &Path) -> Option<&'static str>` 에서 결정. 이 함수가 *유일한 source of truth*. + +--- + +## 4. Wire schema v1 변경 (모두 additive minor) + +### 4.1 변경 요약 표 + +| schema | 변경 | 영향 | +|--------|------|------| +| `citation.v1` | `kind = "code"` variant 추가 + `code: { line_start, line_end, symbol, lang }` 키 추가 | additive minor (기존 consumer 미인지 시 빠짐) | +| `search_hit.v1` | `repo`, `code_lang` 두 optional 필드 추가 | additive minor | +| `ingest_report.v1` | `skipped_generated`, `skipped_size_exceeded`, `skipped_builtin_blacklist`, `skipped_gitignore` 카운트 + `skip_examples` 추가 | additive minor | +| `schema.v1` | `media_breakdown` 에 `code` 카테고리 추가, 새 `code_lang_breakdown` 표 추가 | additive minor | +| `doctor.v1` | (변경 없음) | — | +| `answer.v1` | (Phase 1A 변경 없음. citation 객체가 code variant 일 수 있다는 점만 implicit) | — | +| `fetch_result.v1` | (변경 없음, kind=chunk / doc / span 그대로) | — | + +### 4.2 JSON Schema 파일 수정 위치 + +``` +docs/wire-schema/v1/ + citation.schema.json ← code variant 추가 + search_hit.schema.json ← repo / code_lang 추가 + ingest_report.schema.json ← skip 카운트 + skip_examples 추가 + schema.schema.json ← code_lang_breakdown 추가 +``` + +각 schema 파일에 `"additionalProperties": false` 가 켜져 있으면 새 필드 정의 추가만으로 valid 가 안 됨 — 새 필드를 `properties` 에 명시하고 `required` 는 그대로 유지 (optional). + +### 4.3 `--json` 출력 호환성 검증 + +Phase 1A 구현 시 기존 markdown corpus 의 hit / answer 가 *예전과 byte-level identical* 한 출력 내는지 단위 테스트 추가: + +- `search_hit.v1` 의 `repo` / `code_lang` 필드는 markdown hit 에서 *output 에 등장하지 않음* (snake-case omit-null serialization). +- `ingest_report.v1` 의 새 카운트 필드는 코드 ingest 가 실행되지 않으면 `0` 으로 채워짐 (또는 omit-zero — task spec 단계에서 결정). +- `citation.v1` 의 `code` 키는 `kind != "code"` variant 에서 항상 absent. + +--- + +## 5. Ingest 파이프라인 변경 + +### 5.1 Repo 자동 감지 + +```text +fn detect_repo(path: &Path) -> Option { + // path 의 부모 디렉토리에서 위로 .git/ (dir) 만날 때까지 walk. + // workspace.root 위로는 안 올라감 (boundary). + // .git/ 가 file 인 경우 (worktree marker / submodule) → metadata.repo = None, + // metadata.git_branch / commit = None. + // .git/ 가 dir 이면: + // - repo_name = .git/ 의 부모 dir 이름 + // - branch = git symbolic-ref HEAD (없으면 detached HEAD → "detached") + // - commit = git rev-parse HEAD (40 hex 또는 None if empty repo) +} +``` + +`git` binary 호출 vs `gix` (gitoxide) library 사용 — task spec 에서 결정. 단 `git` binary 호출은 PATH 의존성 도입 (kebab 의 다른 곳엔 없음) → `gix` 선호. + +repo 감지는 ingest 시 *파일당 한 번* 만 — repo 별 캐시 (in-memory HashMap) 로 같은 repo 의 두 번째 파일부터는 lookup hit. + +### 5.2 ignore 통합 (`.gitignore` + `.kebabignore` + built-in) + +**우선순위** (앞이 강함): +1. **Built-in safety net** — 항상 적용, 사용자 negate 가능 (`.kebabignore` 의 `!pattern`) +2. **`.gitignore`** — repo 의 `.gitignore` 자동 honor. nested `.gitignore` 도 적용 (디렉토리 단위 cascade). +3. **`.kebabignore`** — kebab 만의 추가 layer. workspace.root + 각 디렉토리 별 가능 (현재 동작 그대로). + +**Built-in safety net (5 entries 만)**: +```text +**/node_modules/ +**/target/ +**/__pycache__/ +**/.venv/ +**/venv/ +**/env/ +``` + +`env/` 가 모호하지만 (사용자 자식 디렉토리가 우연히 "env" 일 수 있음) Python virtualenv 관습 강해서 포함. 사용자 override 는 `.kebabignore` 의 `!env/` 로. + +**구현**: +- 기존 `kebab-source-fs` 의 `.kebabignore` 처리 코드를 확장. +- `ignore` crate (gitignore syntax) 그대로 사용. `.gitignore` + `.kebabignore` 를 같은 `Override` 빌더에 add — `ignore` crate 가 둘 다 표준으로 처리. +- built-in 은 hardcoded `WalkBuilder.add_custom_ignore_filename` 또는 코드 내 `OverrideBuilder` 로. + +### 5.3 Generated / vendored skip 정책 + +**Generated header sniff** — `kebab-source-fs` 의 file scan 단계에서 *blake3 hash 계산 전* 에 실행 (incremental ingest 의 빠른 path 유지): + +```text +fn is_generated_file(path: &Path) -> io::Result { + let mut buf = [0u8; 512]; + let n = File::open(path)?.read(&mut buf)?; + let head = std::str::from_utf8(&buf[..n]).unwrap_or(""); + + // 줄 단위 markers — case-insensitive 매칭 (다양한 ecosystem 관습 수용). + head.lines().take(10).any(|line| { + let l = line.to_ascii_lowercase(); + l.contains("@generated") || + l.contains("code generated by") || + l.contains("do not edit") || + l.contains("do not modify") || + l.contains("automatically generated") || + l.contains("auto-generated") || + l.contains("autogenerated") + }) +} +``` + +비용: 파일당 1 read syscall (≤512 byte). 이미 `.gitignore` / built-in 으로 빠진 파일은 이 단계 도달 안 함. + +**Skip 시 IngestReport 에 sample 등록** — 디버깅 용 (사용자 "왜 X 파일이 색인 안 됐지?" 시 즉시 답): +```json +{ + "skip_examples": { + "generated": [ + "kebab/crates/proto/src/api.pb.rs", + "..." + ], + "size_exceeded": [ + "vendor/data/large-fixture.json" + ], + "builtin_blacklist": ["..."], + "gitignore": ["..."] + } +} +``` +각 카테고리당 처음 5건만. CLI text 모드에서는 카운트만 표시, `--json` 이면 위 schema 그대로. + +### 5.4 Size cap + +```text +[ingest.code] +max_file_bytes = 262144 # 256 KiB +max_file_lines = 5000 # 둘 중 먼저 hit +``` + +- byte cap 은 `fs::metadata().len()` 한 번 — 매우 빠름. +- line cap 은 byte cap 통과 후 streaming read 로 5000 line 까지 count, 초과 시 skip. +- 둘 다 `IngestReport.skipped_size_exceeded` 로 카운트, `skip_examples.size_exceeded` 에 sample. + +기본값 근거: +- 256 KiB → 보통 코드 파일 (Rust fn, Python class) 의 100배 이상. minified JS / 대용량 fixture / generated client 의 일반적 사이즈 (수 MB) 는 차단. +- 5000 line → 한 파일이 한 사람이 이해할 수 있는 한계 근처. 그 이상은 보통 generated. + +사용자 override: +```toml +[ingest.code] +max_file_bytes = 1048576 # 1 MiB 로 풀고 싶을 때 +max_file_lines = 20000 +``` + +### 5.5 IngestReport 세분화 + +기존 `skipped_by_extension` 옆에 추가: + +```json +{ + "schema_version": "ingest_report.v1", + "indexed": 1234, + "unchanged": 5678, + "updated": 12, + "deleted": 3, + "skipped_by_extension": 45, + "skipped_gitignore": 2104, + "skipped_kebabignore": 8, + "skipped_builtin_blacklist": 567, + "skipped_generated": 89, + "skipped_size_exceeded": 4, + "skip_examples": { + "generated": ["..."], + "size_exceeded": ["..."], + "builtin_blacklist": ["..."], + "gitignore": ["..."] + }, + "warnings": [], + "duration_ms": 12345 +} +``` + +`skipped_by_extension` 은 *지원 안 되는 확장자* — 코드 ingest 후로는 Tier 3 fallback (`code-text-paragraph-v1`) 이 잡아내는 폭이 넓어져서 비율이 줄 것. Tier 3 도 못 잡는 binary 등이 남음. + +human 출력 (TTY) 에서는 한 줄 요약: +```text +✓ indexed 1234 chunks (unchanged 5678, updated 12, deleted 3) + skipped: 2104 .gitignore, 567 built-in, 89 generated, 45 unsupported, 8 .kebabignore, 4 too-large + duration: 12.3s +``` + +--- + +## 6. Crate 구조 + +### 6.1 새 crate `kebab-parse-code` + +```text +crates/kebab-parse-code/ +├── Cargo.toml +└── src/ + ├── lib.rs # 공통 entry, dispatch by extension + ├── lang.rs # code_lang_for_path(), 식별자 정규화 + ├── repo.rs # detect_repo() — gix wrapper + ├── skip.rs # generated header sniff, size cap + ├── rust.rs # tree-sitter-rust → CanonicalDocument (Phase 1A) + ├── python.rs # tree-sitter-python → ... (Phase 1B) + ├── typescript.rs # ... (Phase 1B) + ├── javascript.rs # ... (Phase 1B) + ├── go.rs # ... (Phase 1C) + ├── java.rs # ... (Phase 1C) + ├── kotlin.rs # ... (Phase 1C) + ├── c.rs # ... (Phase 1D) + ├── cpp.rs # ... (Phase 1D) + ├── yaml_k8s.rs # k8s manifest resource-aware (Phase 2) + ├── dockerfile.rs # ... (Phase 2) + ├── manifest.rs # Cargo.toml / package.json 1-chunk (Phase 2) + └── text_paragraph.rs # Tier 3 fallback (Phase 3) +``` + +**의존성**: +- 각 phase 별로 `tree-sitter-*` dep 추가. Phase 1A 는 `tree-sitter-rust` + `tree-sitter` (core) 만. +- `gix` (gitoxide) — Phase 1A 부터. +- `kebab-core`, `kebab-parse-types` (CanonicalDocument / Block / SourceSpan). + +**의존성 제약** (frozen design §8 inheritance): +- `kebab-parse-code` 는 다른 `kebab-parse-*` 크레이트와 동일한 격리 규칙 — store / embed / llm / rag 직접 import 금지. +- UI crate (`kebab-cli` / `kebab-tui` / `kebab-mcp`) 는 이 crate 직접 import 금지. `kebab-app` facade 통해서만. + +### 6.2 `kebab-chunk` 모듈 확장 + +```text +crates/kebab-chunk/src/ +├── lib.rs # export 추가 (per phase 누적) +├── md_heading_v1.rs # 기존 +├── pdf_page_v1.rs # 기존 +├── code_rust_ast_v1.rs # Phase 1A +├── code_python_ast_v1.rs # Phase 1B +├── code_ts_ast_v1.rs # ... +├── code_js_ast_v1.rs # ... +├── code_go_ast_v1.rs # Phase 1C +├── code_java_ast_v1.rs # ... +├── code_kotlin_ast_v1.rs # ... +├── code_c_ast_v1.rs # Phase 1D +├── code_cpp_ast_v1.rs # ... +├── k8s_manifest_resource_v1.rs # Phase 2 +├── dockerfile_file_v1.rs # ... +├── manifest_file_v1.rs # ... +└── code_text_paragraph_v1.rs # Phase 3 +``` + +각 모듈 = 한 chunker 구현체 + `pub use` 로 lib 에 노출. 기존 패턴 (md_heading_v1 / pdf_page_v1) 그대로. + +**Chunker trait 변경 없음** — 기존 `Chunker` trait (frozen §7.2) 가 `CanonicalDocument → Vec` 시그니처라 코드도 같은 trait 로 동작. + +### 6.3 의존성 그래프 변경 + +```text +기존: + kebab-app → kebab-parse-md, kebab-parse-pdf, kebab-parse-image + → kebab-chunk + → ... + +추가 (Phase 1A): + kebab-app → kebab-parse-code (신규) + → kebab-chunk (모듈 추가) +``` + +추가 의존성: +- `kebab-app → kebab-parse-code` +- `kebab-parse-code → tree-sitter`, `tree-sitter-rust`, `gix` +- 빌드 영향: `kebab-parse-code` 추가 → workspace `cargo test -p` 단위 한 개 추가. `-j 1` 정책 (frozen CLAUDE.md) 그대로 적용. + +### 6.4 `target/` 디스크 영향 + +frozen CLAUDE.md 에 "target/ 가 90 GB+ 까지 balloon" 경고 있음. 이 spec 으로 22 → 새 모듈들 추가 시 *integration test* 마다 새 binary linkage 추가 → 더 부풀어. **각 phase 머지 후 `cargo clean` 강제 권장** — CLAUDE.md 의 기존 rule 그대로 적용, phase 끝마다 명시. + +--- + +## 7. Search / RAG 표면 + +### 7.1 새 search filter + +`kebab search` 의 기존 filter (`--tag` / `--lang` / `--path-glob` / `--media` / `--ingested-after` / `--trust-min` / `--doc-id`) 에 세 종 추가: + +```text +--media code # umbrella — 모든 code Tier 의 chunk +--code-lang # 반복 / comma — rust,python 식. OR 매칭. +--repo # 반복 가능. OR 매칭. +``` + +기존 정책 일관: +- 반복 가능 flag 는 OR 매칭 (`--repo kebab --repo other`). +- `--code-lang rs` 같은 alias 는 미지원 — *full identifier* (`rust`) 만. 일관성 위해. +- 모르는 `--code-lang` 값 → empty hits (`--media` 와 동일 정책). +- filter flags 간은 AND (`--media code --code-lang rust` → 코드이면서 Rust). + +### 7.2 `kebab schema` stats 확장 + +frozen design §2.5 / p9-fb-37 의 `stats.media_breakdown` 에 `code` 카테고리 추가: + +```json +{ + "schema_version": "schema.v1", + "stats": { + "media_breakdown": { + "markdown": 1234, + "pdf": 56, + "image": 78, + "audio": 0, + "code": 4567, // ← 신규 + "other": 12 + }, + "lang_breakdown": { // 기존 — 자연어 + "ko": 1100, + "en": 234, + "null": 134 + }, + "code_lang_breakdown": { // ← 신규 — 프로그래밍 언어 (chunk 수) + "rust": 2345, + "python": 1234, + "typescript": 567, + "yaml": 89, + "go": 332 + }, + "repo_breakdown": { // ← 신규 — repo 별 chunk 수 + "kebab": 1234, + "internal-api": 567, + "...": "..." + }, + "index_bytes": 1234567890, + "stale_doc_count": 12 + } +} +``` + +`repo_breakdown` 도 추가하기로 — 사용자가 "어느 repo 가 가장 많이 색인 됐지?" 확인 가능. + +### 7.3 RAG prompt (Phase 1A 는 `rag-v2` 그대로) + +Phase 1A 에서는 코드 chunk 가 *일반 도큐먼트* 로 prompt 에 들어감: + +```text +[#1] (code: kebab::chunk::md_heading_v1::MdHeadingV1Chunker::chunk_doc) +fn chunk_doc(&self, doc: &CanonicalDocument) -> Result> { + ... +} + +[#2] (code: kebab::chunk::pdf_page_v1::PdfPageV1Chunker::chunk_doc) +... +``` + +prompt 의 source identifier 가 *file path + symbol* 둘 다 들어가게 — symbol 이 있으면 *symbol* 을 우선 표시, 없으면 file path. + +`rag-v2` 의 기존 규칙 ("fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표") 은 코드에서 좀 어색할 수 있음 (코드의 큰따옴표는 string literal). 측정 후 어색하면 Phase 2+ 에서 `rag-v3` (code-aware) 도입. + +### 7.4 `kebab inspect` / `kebab fetch` 영향 + +기존 `kebab inspect chunk ` 출력에서 `Citation::Code` variant 의 `symbol` / `code_lang` 표시. text 모드 출력 변경: + +```text +chunk_id: abc123... +doc_path: kebab/crates/kebab-chunk/src/md_heading_v1.rs +line range: L142-L168 +symbol: MdHeadingV1Chunker::chunk_doc ← 신규 (code variant 에서만) +code_lang: rust ← 신규 +repo: kebab ← 신규 +chunker_version: code-rust-ast-v1 +``` + +`kebab fetch chunk` / `kebab fetch span` 은 변경 없음 — 본문 byte 그대로 반환. + +--- + +## 8. Config 변경 + +### 8.1 신규 `[ingest.code]` 절 + +```toml +[ingest.code] +# Generated header sniff 활성화. 첫 ~500 byte 의 6 markers 중 하나 발견 시 skip. +skip_generated_header = true + +# 파일당 max byte (bytes). 초과 시 skip. +max_file_bytes = 262144 # 256 KiB + +# 파일당 max line. 초과 시 skip. byte cap 통과 후 검사. +max_file_lines = 5000 + +# 사용자 추가 skip 패턴. gitignore 문법. built-in / .gitignore / .kebabignore 외 추가. +extra_skip_globs = [] + +# AST chunk 가 너무 길 때 fallback line-window 적용 임계. +# 단일 fn / class 가 이 라인 수 넘으면 paragraph fallback 적용. +ast_chunk_max_lines = 200 # 단일 chunk 최대 라인 + +# Tier 3 fallback (paragraph + line-window) 시 line-window 사이즈. +fallback_lines_per_chunk = 80 +fallback_lines_overlap = 20 +``` + +기본값 근거: +- `skip_generated_header = true` — 안전 default. 미스 케이스 (사용자가 generated 도 색인 원함) 는 명시적 false. +- `max_file_bytes = 262144` — minified JS / 대용량 generated 차단 충분. +- `max_file_lines = 5000` — 한 사람이 한 번에 이해할 수 있는 코드 한계 근처. +- `ast_chunk_max_lines = 200` — 사람 인지 한계 + retrieval token budget. +- `fallback_lines_per_chunk = 80`, `overlap = 20` — RAG 컨벤션의 보수적 default. + +### 8.2 기본값 + override 정책 + +- 모든 키 optional. 누락 시 위 default. +- `KEBAB_*` env override 안 지원 (이건 dev / debug 가 아닌 정책 설정). +- `--config ` 로 격리 테스트 가능 (XDG 의존 안 함). + +### 8.3 `config.toml` 의 기존 `[workspace]` 절 영향 + +변경 없음. `workspace.root`, `exclude` 그대로. `.gitignore` / `.kebabignore` honor 정책은 *기본 동작* 으로 config 키 없이 active — 사용자가 끄고 싶으면 `.kebabignore` 의 `!pattern` 으로 override. + +--- + +## 9. Tier 별 chunker 상세 + +### 9.1 Tier 1 — AST per-language + +**입력**: `CanonicalDocument` with `Block::Code { lang: Some("rust"), code: "..." }`. +**출력**: `Vec` — 각 chunk 가 AST 의 *top-level 의미 단위* 또는 fallback unit. + +**Rust 예시 (Phase 1A)**: + +tree-sitter 의미 단위: +- `function_item` → 1 chunk +- `impl_item` 의 각 `function_item` → 1 chunk per method +- `struct_item` / `enum_item` / `trait_item` → 1 chunk (선언 + doc comment) +- `mod_item` 의 *내용물* → 재귀 분해 +- top-level `use` / `extern crate` / `const` / `static` block → 한 chunk 로 모음 (`` symbol) + +**ast_chunk_max_lines 초과 시 fallback**: +- 단일 fn 이 200 line 넘으면 paragraph (blank-line) 기반으로 split. +- 각 sub-chunk 의 symbol 은 `function_name [part 1/N]` 식으로 표기. +- 이 동작은 `kebab-chunk/src/code_rust_ast_v1.rs` 내부에서. + +**citation 의 line range**: tree-sitter node 의 `start_position.row` / `end_position.row` (0-indexed → +1 로 1-based). + +**예시 input → output**: +```rust +// src/lib.rs +pub fn parse(input: &str) -> Result { + // 50 lines +} + +impl Chunker for Foo { + fn chunk_doc(&self, doc: &Doc) -> Vec { + // 80 lines + } + + fn name(&self) -> &str { "foo-v1" } +} +``` + +→ chunks: +1. `parse`, lines 1-50, symbol = `parse` +2. `Foo::chunk_doc`, lines 53-132, symbol = `Foo::chunk_doc` (impl 의 method) +3. `Foo::name`, lines 134-134, symbol = `Foo::name` + +### 9.2 Tier 2 — resource-aware + +**k8s-manifest-resource-v1**: +- YAML multi-document split (`---` separator). +- 각 document 마다 1 chunk. +- chunk metadata: `kind: Deployment`, `apiVersion`, `metadata.name`, `metadata.namespace`. +- citation 의 `symbol` 필드: `//` (e.g., `Deployment/prod/api-server`). namespace 없으면 `/`. +- yaml 파싱 실패 (invalid YAML, 또는 k8s schema 가 아닌 일반 yaml) 시: `code-text-paragraph-v1` 로 fallback 처리. + +**dockerfile-file-v1**: +- Dockerfile 전체 = 1 chunk. +- symbol = ``. +- citation 의 line range = 1 ~ EOF. +- ARG / FROM / RUN / COPY / CMD 등은 chunk 내부 plain text 로 보존. + +**manifest-file-v1** (Cargo.toml, package.json, pyproject.toml, go.mod, pom.xml, build.gradle, tsconfig.json 등): +- 파일 통째로 1 chunk. +- symbol = ``. +- citation 의 line range = 1 ~ EOF. + +### 9.3 Tier 3 — paragraph + line-window fallback + +**code-text-paragraph-v1** — shell script, 미지원 확장자, AST 실패 시 fallback: +- 빈 줄 (blank line) 기준으로 paragraph 분할. +- paragraph 가 `fallback_lines_per_chunk` (default 80) 넘으면 line-window split with `fallback_lines_overlap` (default 20). +- symbol 은 null. citation 은 `Citation::Code { symbol: None, lang: Some("shell") }` 또는 lang 미지정. + +--- + +## 10. 변경 영향 / cascade + +### 10.1 Frozen design doc 갱신 (이 spec 머지와 동시) + +`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 의 다음 섹션 갱신: + +| 섹션 | 갱신 내용 | +|------|-----------| +| §0 동결된 결정 요약 | "코드 ingest 추가" 1줄. cross-link to 2026-05-15 spec | +| §2.1 Citation | 5 → 6 variants, `code` 추가 | +| §2.2 SearchHit | `repo`, `code_lang` optional 필드 | +| §2.4 IngestReport | skip 카운트 4종 + skip_examples | +| §2 schema.v1 (fb-37 추가분) | `code` media + `code_lang_breakdown` + `repo_breakdown` | +| §3.2 Versions / labels | chunker_version family 확장 (per-language pattern) | +| §3.6 Metadata | `repo`, `git_branch`, `git_commit`, `code_lang` 필드 | +| §8 모듈 경계 | `kebab-parse-code` 추가 + 의존성 규칙 inheritance | +| §11 동결 범위 | "code ingest" 가 더 이상 비-스코프 아님 명시. 단 multi-workspace / watch / history aware 는 그대로 비-스코프 | + +### 10.2 cascade rule (frozen §9) 영향 + +- `parser_version` cascade: 각 phase 의 새 parser version (`code-rust-parse-v1` 등) 추가. 기존 markdown / pdf 무영향. +- `chunker_version` cascade: per-language 라벨 → 한 언어 chunker 변경이 다른 언어 chunks 무효화 안 함. +- `embedding_version` cascade: 변경 없음 (`multilingual-e5-large` 그대로). +- `prompt_template_version` cascade: Phase 1A 는 `rag-v2` 그대로 → 무영향. +- `index_version` cascade: SQLite DDL 변경 없으면 무영향. metadata.repo / git_branch / git_commit 필드는 *Metadata* 의 JSON blob 안에 추가 — DDL 변경 안 필요 (frozen §5 의 `documents.metadata_json TEXT` 가 free-form). + +### 10.3 V00X migration? + +SQLite DDL 변경 없음 → V00X migration 불요. `documents.metadata_json` 의 free-form 내부에 새 키 (`repo`, `git_branch`, `git_commit`, `code_lang`) 가 들어감. 기존 markdown / pdf chunk 들의 metadata_json 은 그대로. + +### 10.4 Binary version bump + +Phase 1A 머지 후 minor bump (`0.6` → `0.7`). 이유: +- wire schema additive minor 3개 동시 (citation / search_hit / ingest_report). +- 새 CLI flag (`--media code`, `--code-lang`, `--repo`). +- 도그푸드 surface 변경 (사용자가 코드 ingest 시작). + +이후 phase (1B/1C/1D/2/3) 의 bump 여부는 각 phase 의 task spec 에서 결정 — 새 wire field / flag 없으면 patch bump (`0.7.1` → `0.7.2`). + +--- + +## 11. Open questions (Phase 1A task spec 단계에서 픽스) + +이 spec 은 *프레임워크* 까지만 동결. 다음 항목은 Phase 1A 의 task spec 작성 시 결정: + +1. **AST chunk 의 minimum size** — 5-line fn 도 한 chunk? 또는 minimum threshold (예: ≥ 10 line) 미만은 인접 fn 과 merge? *영향*: chunk 수 폭증 vs retrieval miss. + +2. **doc_id 충돌 위험** — `Cargo.toml` 두 repo 의 content 가 우연히 동일 → blake3 hash 동일 → 같은 doc? frozen §4.2 의 ID recipe 확인 필요. *영향*: 한 doc 이 두 repo 에서 출처 표시. 해결: doc_id recipe 에 repo / path 포함 여부 확인. + +3. **`--code-lang` 식별자 정규화 (canonical)** — `rust` / `python` / `typescript` 의 풀네임만 vs `rs` / `py` / `ts` 짧은 alias 도 허용? 이 spec 은 풀네임만 — task spec 에서 alias 매핑 명시. + +4. **TUI surface 변경 시점** — Phase 1A 에 포함 vs 별도 Phase 4 (TUI code rendering)? *영향*: TUI 의 Library/Inspect 패널에서 code citation 의 symbol/lang/repo 렌더. 일단 Phase 1A 에 *최소 변경* (citation 표시) 만 포함, 별도 인터랙션 (예: `g` 키로 LSP 식 navigation) 은 P+. + +5. **AST chunk symbol path 의 *depth 한계*** — Rust 의 nested impl / nested mod 가 깊으면 `outer::inner::deepest::method` 식 path 가 길어짐. 60 char cap + 중간 생략 (`outer::…::method`)? Phase 1A 의 task spec 에서 cap 정책 결정. + +6. **`gix` 의 binary size 영향** — `kebab-parse-code` → `gix` dep 도입이 release binary 크기에 얼마나 영향? `git2` (libgit2) 는 C dep 이라 안 쓰기로 — `gix` 가 pure rust. binary size 영향 측정 후 결정. + +7. **k8s manifest 의 `kind` 인식 범위** — `Deployment` / `Service` / `ConfigMap` 등 표준 외 *CRD* (custom resource) 처리? Phase 2 의 task spec 에서 결정. 일단 *모든 yaml document 의 `kind` 필드 그대로* 사용 (CRD 포함 자동 처리). + +--- + +## 12. 다음 단계 + +1. **이 spec 의 사용자 검토** — 빠진 결정 / 모순 / 추가 우려 확인. +2. 검토 통과 시 `tasks/p10/` 디렉토리 신설 + `tasks/p10/INDEX.md` 추가 + `tasks/INDEX.md` 에 phase 10 entry. +3. **Phase 1A task spec 작성** — `tasks/p10/p10-1a-code-rust-ast-ingest.md`. 이 spec 이 `contract_sections` 로 `[§2.1, §2.2, §2.4, §2.5, §3.2, §3.6, §8, §11]` (frozen design 의 어느 섹션을 구현하는지) 인용. +4. Frozen design doc (2026-04-27) 갱신을 *Phase 1A PR* 에 동봉 (이 spec 의 §10.1 표 그대로). +5. writing-plans skill 로 Phase 1A 의 구현 계획 (작업 단위) 작성. +6. Phase 1A 머지 후 kebab 자기 자신 dogfooding → 측정 → 다음 phase 진행 결정. + +--- + +## 부록 A — 의사 결정 회의록 (이 spec 작성 시 사용자와의 brainstorming 요약) + +이 spec 작성에 들어간 결정들의 *왜* 를 짧게 (감사용 / 미래 재고 시 참조): + +- **시나리오**: "한 부모 dir 아래 수십 개 repo + 의미 검색 + RAG" — kebab 의 cross-corpus 가치를 코드까지 확장. +- **chunking 전략**: 사용자가 길 B (AST per-language) 명시 선택. 작성자 추천은 길 C (A 로 시작 측정 후 승급) 였으나 사용자 결정 존중. +- **언어 범위**: 사용자 초기 답 (Rust/Python/TS-JS/Go-Java-Kotlin/C/C++/Shell/Dockerfile/yaml) 을 Tier 1/2/3 으로 재분류 → AST 가 의미 있는 곳에만 AST 적용. 작성자 push back 결과. +- **embedding 모델**: e5-large 유지. cross-corpus 가치 + cascade 비용 회피. +- **Citation variant**: 사용자가 `(a)-2` (새 `code` variant) 선택. 작성자 추천은 `(a)-1` (line variant 재사용) 였으나 의미 분리 명확함이 결정 요인. +- **built-in blacklist**: 사용자가 *축소* 요청 → 5 entry 최종. `.gitignore` 가 source of truth, built-in 은 safety net 만. +- **Phase 분할**: 사용자가 "되도록 많은 디테일 spec → Phase 1A 부터 구현" 명시. 이 spec 이 그 프레임워크 동결, phase 별 구현은 별도 task spec. -- 2.49.1 From c6d61b0b37ab28bd854b30fc3bddaf1b83e87616 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 14:20:10 +0900 Subject: [PATCH 02/21] spec(p10): split Phase 1A into 1A-1 (framework) and 1A-2 (Rust chunker) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1A 가 들고 들어가는 *프레임워크 surface* (Citation `code` variant, SearchHit repo/code_lang, --media code / --code-lang / --repo filter, skip 정책, IngestReport 세분화, config 절, kebab-parse-code crate skeleton) 가 *언어 chunker 자체* 와 독립 검증 가능 — 1A-1 머지 후 기존 markdown corpus 의 wire 출력이 byte-level identical 한지 regression test 로 검증한 다음 1A-2 에서 Rust AST chunker 자체에 집중. binary version bump 트리거도 1A-2 로 미룸 (1A-1 은 wire additive minor + 사용자 surface 변경 없음). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-15-kebab-code-ingest-design.md | 37 +++++++++++-------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md index 49b51ac..f074c9c 100644 --- a/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md +++ b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md @@ -66,18 +66,24 @@ ## 2. Phase 분할 + 마일스톤 -각 phase = 별도 task spec (`tasks/p10/p10-1A-code-rust-ast-ingest.md` 등) + 별도 PR. Phase 1A 가 *프레임워크 일체* (새 crate, 새 Citation variant, repo metadata, 새 filter, ignore 정책 전체) 를 들고 들어가는 가장 무거운 phase. 나머지는 *언어 / chunker 추가* 만. +각 phase = 별도 task spec (`tasks/p10/p10-1a-1-code-ingest-framework.md` 등) + 별도 PR. **Phase 1A-1** 이 *프레임워크 일체* (새 crate skeleton, 새 Citation variant, repo metadata, 새 filter, ignore 정책 전체, skip 정책, IngestReport 세분화) 를 들고 들어가는 가장 무거운 phase. 1A-2 이후는 *언어 / chunker 추가* 만. -| Phase | 내용 | 새 crate | 새 chunker_version | 마일스톤 | -|-------|------|----------|--------------------|----------| -| **1A** | Rust AST ingest + 프레임워크 일체 | `kebab-parse-code` 신설 | `code-rust-ast-v1` | kebab 자기 자신 dogfooding | -| **1B** | Python + TS/JS AST ingest | (1A 의 crate 에 모듈 추가) | `code-python-ast-v1`, `code-ts-ast-v1`, `code-js-ast-v1` | 사내 ML 코드 + 웹 코드 검색 | +| Phase | 내용 | 새 crate / 모듈 | 새 chunker_version | 마일스톤 | +|-------|------|----------------|--------------------|----------| +| **1A-1** | 프레임워크 일체 — Citation `code` variant, SearchHit `repo`/`code_lang`, 새 filter (`--media code` / `--code-lang` / `--repo`), ignore 통합 정책, skip 정책 (built-in/generated/size), IngestReport 세분화, config `[ingest.code]` 절. `kebab-parse-code` crate **skeleton** (lang/repo/skip 모듈만, 언어 parser 없음) | `kebab-parse-code` 신설 — infrastructure only, language parser 모듈 없음 | *없음* (chunker 추가 0) | wire schema additive minor commit. 기존 markdown corpus 무영향 검증 (regression test). 코드 ingest 아직 활성 안 됨 | +| **1A-2** | Rust AST chunker 자체 + tree-sitter-rust 도입. Rust 파일 ingest 활성화 | 동일 crate 에 `rust.rs` parser 모듈 + `kebab-chunk/code_rust_ast_v1.rs` | `code-rust-ast-v1` | kebab 자기 자신 dogfooding 가능 | +| **1B** | Python + TS/JS AST ingest | 동일 crate 에 `python.rs` / `typescript.rs` / `javascript.rs` 모듈 + chunker 추가 | `code-python-ast-v1`, `code-ts-ast-v1`, `code-js-ast-v1` | 사내 ML 코드 + 웹 코드 검색 | | **1C** | Go + Java + Kotlin AST ingest | 동일 crate 에 모듈 추가 | `code-go-ast-v1`, `code-java-ast-v1`, `code-kotlin-ast-v1` | 사내 backend 검색 | | **1D** | C + C++ AST ingest | 동일 crate 에 모듈 추가 | `code-c-ast-v1`, `code-cpp-ast-v1` | system code 검색 (마지막) | | **2** | Tier 2 resource-aware: k8s manifest + Dockerfile + 일반 manifest | 동일 crate 에 모듈 추가 | `k8s-manifest-resource-v1`, `dockerfile-file-v1`, `manifest-file-v1` | k8s 운영 / DevOps 검색 | | **3** | Tier 3 fallback: shell + 미지원 확장자 | 동일 crate 에 모듈 추가 | `code-text-paragraph-v1` | 잡 텍스트 fallback | -Phase 1A 끝나는 시점에 binary version bump (예: `0.6` → `0.7`) — wire schema 의 Citation variant 추가가 additive 지만 RAG 사용자 도그푸드 surface 변경. 이후 phase 의 binary bump 는 각 phase 의 task spec 에서 결정. +**Phase 1A 가 1A-1 / 1A-2 로 쪼개진 이유**: 1A 가 들고 들어가는 *프레임워크 surface* (Citation variant, SearchHit 필드, filter 3종, skip 정책, config 절, IngestReport 세분화, 새 crate) 가 *언어 chunker 자체* 와 독립적으로 검증 가능. 1A-1 머지 후 기존 markdown corpus 가 *byte-level identical* 한 출력을 내는지 regression test 로 검증 — 코드 ingest 가 활성화되지 않은 상태에서 wire schema 변경 안전성을 별도 확인. 1A-2 는 Rust chunker 자체에만 집중, dogfooding 가능 지점 = 1A-2 머지. + +**Binary version bump 트리거 정리**: +- **1A-1 머지**: bump 없음. wire 의 additive minor 변경 (CLAUDE.md "wire 의 additive minor 변경 은 backward-compat 이라 본 트리거에 해당 안 됨" 적용). 코드 ingest 미활성 — 사용자 도그푸드 surface 변경 없음. +- **1A-2 머지**: minor bump (예: `0.6` → `0.7`). 사용자 도그푸딩 가능 = bump 트리거. +- 이후 phase (1B/1C/1D/2/3) 의 bump 여부는 각 phase 의 task spec 에서 결정 — wire / flag 추가 없으면 patch bump. --- @@ -759,12 +765,10 @@ SQLite DDL 변경 없음 → V00X migration 불요. `documents.metadata_json` ### 10.4 Binary version bump -Phase 1A 머지 후 minor bump (`0.6` → `0.7`). 이유: -- wire schema additive minor 3개 동시 (citation / search_hit / ingest_report). -- 새 CLI flag (`--media code`, `--code-lang`, `--repo`). -- 도그푸드 surface 변경 (사용자가 코드 ingest 시작). - -이후 phase (1B/1C/1D/2/3) 의 bump 여부는 각 phase 의 task spec 에서 결정 — 새 wire field / flag 없으면 patch bump (`0.7.1` → `0.7.2`). +[§2 Phase 분할 표 하단 "Binary version bump 트리거 정리"](#2-phase-분할--마일스톤) 참조. 요지: +- **1A-1 머지** → bump 없음 (wire additive minor + 사용자 surface 변경 없음). +- **1A-2 머지** → minor bump (`0.6` → `0.7`, 사용자 도그푸딩 시작). +- 이후 phase 는 각 task spec 에서 결정 (wire / flag 추가 없으면 patch bump). --- @@ -792,10 +796,11 @@ Phase 1A 머지 후 minor bump (`0.6` → `0.7`). 이유: 1. **이 spec 의 사용자 검토** — 빠진 결정 / 모순 / 추가 우려 확인. 2. 검토 통과 시 `tasks/p10/` 디렉토리 신설 + `tasks/p10/INDEX.md` 추가 + `tasks/INDEX.md` 에 phase 10 entry. -3. **Phase 1A task spec 작성** — `tasks/p10/p10-1a-code-rust-ast-ingest.md`. 이 spec 이 `contract_sections` 로 `[§2.1, §2.2, §2.4, §2.5, §3.2, §3.6, §8, §11]` (frozen design 의 어느 섹션을 구현하는지) 인용. -4. Frozen design doc (2026-04-27) 갱신을 *Phase 1A PR* 에 동봉 (이 spec 의 §10.1 표 그대로). -5. writing-plans skill 로 Phase 1A 의 구현 계획 (작업 단위) 작성. -6. Phase 1A 머지 후 kebab 자기 자신 dogfooding → 측정 → 다음 phase 진행 결정. +3. **Phase 1A-1 task spec 작성** (먼저) — `tasks/p10/p10-1a-1-code-ingest-framework.md`. `contract_sections` 로 `[§2.1, §2.2, §2.4, §2 schema.v1, §3.6, §8, §11]` (chunker 추가 없음 — §3.2 chunker_version 갱신은 1A-2 와 함께). +4. **Phase 1A-2 task spec 작성** (1A-1 머지 후) — `tasks/p10/p10-1a-2-rust-ast-chunker.md`. `contract_sections` 로 `[§2.1 (code variant 실 사용), §3.2 (code-rust-ast-v1 추가), §3.4 (Rust symbol path)]`. +5. Frozen design doc (2026-04-27) 갱신을 *Phase 1A-1 PR* 에 동봉 (이 spec 의 §10.1 표 그대로, 단 §3.2 chunker_version 부분은 1A-2 에서). +6. writing-plans skill 로 Phase 1A-1 의 구현 계획 (작업 단위) 작성. +7. Phase 1A-1 머지 후 regression test 통과 확인 → Phase 1A-2 구현 계획 작성 → 머지 → kebab 자기 자신 dogfooding → 측정 → 다음 phase 진행 결정. --- -- 2.49.1 From 005a9011ea60920a58aad571e25fe7f4b41ced27 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 14:31:22 +0900 Subject: [PATCH 03/21] plan(p10-1a-1): code ingest framework implementation plan + spec wire-shape fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 21 task plan: kebab-core 도메인 타입 (Citation::Code variant, SearchHit repo/code_lang, IngestReport skip counters, Metadata extension), 새 kebab-parse-code crate (lang/repo/skip 모듈, gix dep), kebab-source-fs gitignore+blacklist 통합, kebab-config [ingest.code] 절, kebab-cli --repo/--code-lang flag, wire schema JSON 갱신, frozen design doc 갱신, README/HANDOFF/SMOKE 갱신, task index. 각 task 가 5-step TDD cycle (test fail → impl → pass → commit). 코드 chunker 는 1A-1 에 없음 — 1A-2 에서 추가. spec 의 Citation::Code 예시가 기존 5 variants 의 flat wire 형태와 안 맞아서 (`code: {...}` 중첩이 아니라 top-level field) 같이 fix. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...26-05-15-p10-1a-1-code-ingest-framework.md | 2265 +++++++++++++++++ .../2026-05-15-kebab-code-ingest-design.md | 17 +- 2 files changed, 2273 insertions(+), 9 deletions(-) create mode 100644 docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md diff --git a/docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md b/docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md new file mode 100644 index 0000000..1171138 --- /dev/null +++ b/docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md @@ -0,0 +1,2265 @@ +# p10-1A-1 Code Ingest Framework Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Land the framework surface for code ingest — wire schema (`Citation::Code` variant, `SearchHit.repo` / `code_lang` fields, `IngestReport` skip counters), new CLI filter flags (`--media code` / `--code-lang` / `--repo`), `.gitignore` honor + built-in safety-net blacklist + generated-header sniff + size cap, `kebab-parse-code` crate skeleton (no per-language parsers), `[ingest.code]` config section — **without enabling any code chunker yet**. 1A-2 plugs in the Rust AST chunker on top of this framework. + +**Architecture:** All changes are additive minor at the wire layer (no breaking change). Domain types in `kebab-core` get new variants / optional fields. The new `kebab-parse-code` crate ships with infrastructure modules (`lang.rs`, `repo.rs`, `skip.rs`) but no per-language parser modules — those land in 1A-2. The walker (`kebab-source-fs`) integrates `.gitignore` honor + built-in blacklist + generated header sniff + size cap, surfacing new skip counters in `IngestReport`. CLI filter flags wire through `SearchFilters` to the existing retriever stack. After 1A-1 merges, ingesting the existing markdown corpus produces byte-level identical wire output (verified by regression test). + +**Tech Stack:** Rust 2024, serde, anyhow, `ignore` crate (already present), `gix` (new dep — for repo detect), JSON Schema 2020-12. + +**Spec:** `docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` + +--- + +## File map + +**Create:** +- `crates/kebab-parse-code/Cargo.toml` — new crate manifest. +- `crates/kebab-parse-code/src/lib.rs` — public surface (re-export `lang` / `repo` / `skip` items). +- `crates/kebab-parse-code/src/lang.rs` — `code_lang_for_path()` extension dispatcher. +- `crates/kebab-parse-code/src/repo.rs` — `detect_repo()` via `gix`. +- `crates/kebab-parse-code/src/skip.rs` — `is_generated_file()` + `is_oversized()` helpers + built-in blacklist patterns. +- `crates/kebab-parse-code/tests/lang.rs` — `code_lang_for_path` test fixture. +- `crates/kebab-parse-code/tests/repo.rs` — `detect_repo` test fixture (uses `gix::init` for temp repo). +- `crates/kebab-parse-code/tests/skip.rs` — `is_generated_file` + `is_oversized` test fixture. +- `tasks/p10/INDEX.md` — phase 10 task index. +- `tasks/p10/p10-1a-1-code-ingest-framework.md` — task spec for this PR. + +**Modify:** +- `Cargo.toml` (workspace root) — register `crates/kebab-parse-code` in `members`, register `gix` in workspace dependencies. +- `crates/kebab-core/src/citation.rs` — add `Citation::Code { path, line_start, line_end, symbol, lang }` variant + `to_uri()` arm + `path()` arm. +- `crates/kebab-core/src/search.rs` — add `SearchHit.repo: Option` + `SearchHit.code_lang: Option` (both `#[serde(default, skip_serializing_if = "Option::is_none")]`) and extend `SearchFilters` with `repo: Vec` + `code_lang: Vec`. +- `crates/kebab-core/src/ingest.rs` — add `IngestReport.skipped_gitignore: u32` + `skipped_kebabignore: u32` + `skipped_builtin_blacklist: u32` + `skipped_generated: u32` + `skipped_size_exceeded: u32` + `skip_examples: SkipExamples` (new struct), and a `MediaKind::Code` arm hint (`metadata.code_lang` placeholder is on `Metadata`, not `IngestItem`, so no IngestItem field change needed). +- `crates/kebab-core/src/metadata.rs` — add `Metadata.repo: Option` + `Metadata.git_branch: Option` + `Metadata.git_commit: Option` + `Metadata.code_lang: Option`. +- `crates/kebab-core/src/lib.rs` — re-export new structs. +- `crates/kebab-source-fs/src/walker.rs` — extend `build_overrides()` to also walk repo-local `.gitignore` cascade and append built-in safety-net patterns (5 entries). +- `crates/kebab-source-fs/src/lib.rs` — surface new skip counters via the connector return. +- `crates/kebab-source-fs/src/connector.rs` — wire skip counters into the per-file decision (call `kebab_parse_code::skip` helpers when relevant). +- `crates/kebab-source-fs/Cargo.toml` — add `kebab-parse-code` dep (for `skip` + `repo` helpers). +- `crates/kebab-app/src/lib.rs` — register no new modules (1A-1 is infra only); thread new skip counters through the ingest reporter. +- `crates/kebab-app/src/schema.rs` — extend `SchemaStats` with `code_lang_breakdown: BTreeMap` and `repo_breakdown: BTreeMap` (default-empty until 1A-2 produces code chunks). +- `crates/kebab-config/src/lib.rs` — add `IngestCodeCfg` struct and embed it in `IngestCfg` (or in `Config` directly if `IngestCfg` doesn't exist yet — verify path). +- `crates/kebab-cli/src/main.rs` — add `--repo` (Vec) + `--code-lang` (Vec) to `Cmd::Search`. `--media code` is automatically accepted since `--media` is already free-form Vec. +- `crates/kebab-cli/src/wire.rs` — propagate `repo` / `code_lang` fields into `wire_search_hit` output. +- `docs/wire-schema/v1/citation.schema.json` — add `code` to the `kind` enum + add `"code": { "type": "object" }` to top-level properties. +- `docs/wire-schema/v1/search_hit.schema.json` — add `repo` and `code_lang` to top-level properties (optional). +- `docs/wire-schema/v1/ingest_report.schema.json` — add five new skip counters + `skip_examples` to top-level properties. +- `docs/wire-schema/v1/schema.schema.json` — add `code_lang_breakdown` and `repo_breakdown` under `stats`. +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` — apply §10.1 of the code ingest spec (Citation 6 variants, SearchHit fields, etc.). +- `README.md` — add `--media code` / `--code-lang` / `--repo` filter rows; mention `[ingest.code]` config block; note `.gitignore` honor. +- `HANDOFF.md` — add Phase 10 row (in-progress). +- `docs/SMOKE.md` — update example config to include `[ingest.code]` block with defaults. +- `tasks/INDEX.md` — add phase 10 entry. + +**Test (regression):** +- `crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs` — confirms markdown corpus hits omit `repo` / `code_lang` from JSON output (Option::None → absent). +- `crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs` — confirms existing 5 Citation variants serialize byte-identical (no spurious `code` key). +- `crates/kebab-app/tests/ingest_report_skip_counters_zero.rs` — confirms a markdown-only corpus reports `skipped_generated = 0` etc. + +--- + +## Task 1: `Citation::Code` variant in `kebab-core` + +**Files:** +- Modify: `crates/kebab-core/src/citation.rs` +- Modify: `crates/kebab-core/src/lib.rs` (re-export not needed — already `pub use`) + +- [ ] **Step 1: Append failing test to `crates/kebab-core/src/citation.rs`'s `mod tests`** + +```rust +#[test] +fn citation_code_variant_serializes_with_kind_tag() { + let c = Citation::Code { + path: WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()), + line_start: 142, + line_end: 168, + symbol: Some("MdHeadingV1Chunker::chunk_doc".into()), + lang: Some("rust".into()), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "code"); + assert_eq!(v["line_start"], 142); + assert_eq!(v["line_end"], 168); + assert_eq!(v["symbol"], "MdHeadingV1Chunker::chunk_doc"); + assert_eq!(v["lang"], "rust"); + // Existing 5 variants must NOT pick up these fields. + let line = Citation::Line { + path: WorkspacePath("notes/foo.md".into()), + start: 1, + end: 10, + section: None, + }; + let lv = serde_json::to_value(&line).unwrap(); + assert!(lv.get("line_start").is_none()); + assert!(lv.get("symbol").is_none()); +} + +#[test] +fn citation_code_uri_format() { + let c = Citation::Code { + path: WorkspacePath("a/b.rs".into()), + line_start: 10, + line_end: 20, + symbol: None, + lang: Some("rust".into()), + }; + assert_eq!(c.to_uri(), "a/b.rs#L10-L20"); + // Single-line uses `#L10`. + let single = Citation::Code { + path: WorkspacePath("a/b.rs".into()), + line_start: 5, + line_end: 5, + symbol: None, + lang: None, + }; + assert_eq!(single.to_uri(), "a/b.rs#L5"); +} + +#[test] +fn citation_code_path_accessor() { + let c = Citation::Code { + path: WorkspacePath("x.rs".into()), + line_start: 1, + line_end: 1, + symbol: None, + lang: None, + }; + assert_eq!(c.path().0, "x.rs"); +} +``` + +- [ ] **Step 2: Run tests to verify they fail** + +Run: `cargo test -p kebab-core --lib citation_code -- --nocapture` +Expected: FAIL — `Citation::Code` variant does not exist. + +- [ ] **Step 3: Add the `Code` variant to the `Citation` enum** + +Insert after the `Time` variant in `crates/kebab-core/src/citation.rs`: + +```rust + Code { + path: WorkspacePath, + line_start: u32, + line_end: u32, + symbol: Option, + lang: Option, + }, +``` + +- [ ] **Step 4: Extend the `path()` arm** + +```rust + Citation::Line { path, .. } + | Citation::Page { path, .. } + | Citation::Region { path, .. } + | Citation::Caption { path, .. } + | Citation::Time { path, .. } + | Citation::Code { path, .. } => path, +``` + +- [ ] **Step 5: Extend the `to_uri()` arm** + +```rust + Citation::Code { path, line_start, line_end, .. } => { + if line_start == line_end { + format!("{}#L{}", path.0, line_start) + } else { + format!("{}#L{}-L{}", path.0, line_start, line_end) + } + } +``` + +- [ ] **Step 6: Run tests to verify they pass** + +Run: `cargo test -p kebab-core --lib citation_code -- --nocapture` +Expected: PASS (3 new tests). + +- [ ] **Step 7: Run full `kebab-core` test suite to catch fall-out** + +Run: `cargo test -p kebab-core --lib` +Expected: All tests pass. If a `match` somewhere errors with non-exhaustive, fix the missing arm (likely in `path()` / `to_uri()` already covered). + +- [ ] **Step 8: Commit** + +```bash +git add crates/kebab-core/src/citation.rs +git commit -m "feat(p10-1a-1): add Citation::Code variant" +``` + +--- + +## Task 2: `SearchHit.repo` / `code_lang` + `SearchFilters.repo` / `code_lang` + +**Files:** +- Modify: `crates/kebab-core/src/search.rs` + +- [ ] **Step 1: Append failing tests to `mod tests`** + +```rust +#[test] +fn search_hit_repo_and_code_lang_are_optional_and_omit_when_none() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath("a.md".into()), + heading_path: vec![], + section_label: None, + snippet: "".into(), + citation: Citation::Line { + path: WorkspacePath("a.md".into()), + start: 1, + end: 2, + section: None, + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("md-heading-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: None, + code_lang: None, + }; + let v = serde_json::to_value(&hit).unwrap(); + assert!(v.get("repo").is_none(), "repo should be omitted when None"); + assert!(v.get("code_lang").is_none(), "code_lang should be omitted when None"); +} + +#[test] +fn search_hit_repo_and_code_lang_present_when_some() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath("a.rs".into()), + heading_path: vec![], + section_label: None, + snippet: "".into(), + citation: Citation::Code { + path: WorkspacePath("a.rs".into()), + line_start: 1, + line_end: 2, + symbol: None, + lang: Some("rust".into()), + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("code-rust-ast-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: Some("kebab".into()), + code_lang: Some("rust".into()), + }; + let v = serde_json::to_value(&hit).unwrap(); + assert_eq!(v["repo"], "kebab"); + assert_eq!(v["code_lang"], "rust"); +} + +#[test] +fn search_filters_repo_and_code_lang_default_to_empty_vec() { + let f = SearchFilters::default(); + assert!(f.repo.is_empty()); + assert!(f.code_lang.is_empty()); +} +``` + +If `RetrievalDetail::default()` doesn't exist yet, derive it with `#[derive(Default)]` on the struct (it has only primitive Option / Vec fields — Default is trivially derivable). + +- [ ] **Step 2: Run tests to verify they fail** + +Run: `cargo test -p kebab-core --lib search -- --nocapture` +Expected: FAIL with "no field `repo` on type `SearchHit`". + +- [ ] **Step 3: Add the two fields to `SearchHit`** + +In `crates/kebab-core/src/search.rs`, in the `SearchHit` struct, append after `score_kind`: + +```rust + /// p10-1A-1: optional. Filled when the source file lives in a git repo + /// (`.git/` walk-up). null for markdown / pdf / image hits and for code + /// hits ingested via `kebab ingest-file` outside a repo boundary. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub repo: Option, + + /// p10-1A-1: optional. Programming language identifier (lowercase). Set for + /// every code/manifest/k8s chunk; null for markdown / pdf / image hits. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub code_lang: Option, +``` + +- [ ] **Step 4: Extend `SearchFilters`** + +Append after `doc_id`: + +```rust + /// p10-1A-1: filter by `metadata.repo`. Empty = no filter; multi-value = OR. + #[serde(default)] + pub repo: Vec, + + /// p10-1A-1: filter by `metadata.code_lang`. Empty = no filter; multi-value = OR. + /// Identifiers are lowercase canonical names (`rust`, `python`, `typescript`, ...). + /// Unknown values produce empty hits (consistent with `media` policy). + #[serde(default)] + pub code_lang: Vec, +``` + +- [ ] **Step 5: If `RetrievalDetail` doesn't derive Default, add it** + +```rust +#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)] +pub struct RetrievalDetail { + ... +} +``` + +- [ ] **Step 6: Run tests** + +Run: `cargo test -p kebab-core --lib search -- --nocapture` +Expected: PASS (3 new tests). + +- [ ] **Step 7: Build the whole workspace to find consumers that need to construct SearchHit** + +Run: `cargo build --workspace` +Expected: A handful of test files and call sites need `repo: None, code_lang: None` appended. Patch each. Common sites: +- `crates/kebab-search/src/...` — wherever `SearchHit` is constructed by the retriever +- `crates/kebab-app/tests/...` — integration test fixtures + +When patching, only add the two `None` lines; do not alter other field values. + +- [ ] **Step 8: Run full workspace test (one crate at a time per CLAUDE.md)** + +Run: `cargo test -p kebab-core && cargo test -p kebab-search && cargo test -p kebab-app && cargo test -p kebab-cli` +Expected: PASS across all four. + +- [ ] **Step 9: Commit** + +```bash +git add crates/kebab-core/src/search.rs +# include any consumer files that needed the two None fields +git commit -m "feat(p10-1a-1): add SearchHit.repo / code_lang + SearchFilters.repo / code_lang" +``` + +--- + +## Task 3: `IngestReport` skip counters + `SkipExamples` + +**Files:** +- Modify: `crates/kebab-core/src/ingest.rs` + +- [ ] **Step 1: Append failing test** + +```rust +#[test] +fn skip_examples_default_is_empty() { + let s = SkipExamples::default(); + assert!(s.generated.is_empty()); + assert!(s.size_exceeded.is_empty()); + assert!(s.builtin_blacklist.is_empty()); + assert!(s.gitignore.is_empty()); +} + +#[test] +fn ingest_report_skip_counters_serialize() { + let r = IngestReport { + scope: SourceScope::Workspace, + scanned: 100, + new: 50, + updated: 0, + skipped: 0, + unchanged: 0, + errors: 0, + duration_ms: 1234, + skipped_by_extension: Default::default(), + skipped_gitignore: 30, + skipped_kebabignore: 5, + skipped_builtin_blacklist: 10, + skipped_generated: 3, + skipped_size_exceeded: 2, + skip_examples: SkipExamples { + generated: vec!["a/b.pb.rs".into()], + size_exceeded: vec![], + builtin_blacklist: vec!["node_modules/x.js".into()], + gitignore: vec![], + }, + items: None, + }; + let v = serde_json::to_value(&r).unwrap(); + assert_eq!(v["skipped_gitignore"], 30); + assert_eq!(v["skipped_builtin_blacklist"], 10); + assert_eq!(v["skipped_generated"], 3); + assert_eq!(v["skipped_size_exceeded"], 2); + assert_eq!(v["skip_examples"]["generated"][0], "a/b.pb.rs"); +} +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cargo test -p kebab-core --lib skip_examples -- --nocapture` +Expected: FAIL. + +- [ ] **Step 3: Add `SkipExamples` struct** + +In `crates/kebab-core/src/ingest.rs`, after `IngestReport`: + +```rust +/// p10-1A-1: per-category sample of skipped file paths. Each category caps at +/// 5 entries (oldest-first). Used for debugging "why was X not indexed?" +#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)] +pub struct SkipExamples { + #[serde(default)] + pub generated: Vec, + #[serde(default)] + pub size_exceeded: Vec, + #[serde(default)] + pub builtin_blacklist: Vec, + #[serde(default)] + pub gitignore: Vec, +} +``` + +- [ ] **Step 4: Add the five new counters + `skip_examples` field to `IngestReport`** + +After `skipped_by_extension`: + +```rust + /// p10-1A-1: files skipped because they matched a repo-local `.gitignore`. + #[serde(default)] + pub skipped_gitignore: u32, + + /// p10-1A-1: files skipped because they matched a `.kebabignore` entry. + #[serde(default)] + pub skipped_kebabignore: u32, + + /// p10-1A-1: files skipped because they matched the built-in safety-net + /// blacklist (`node_modules/`, `target/`, `__pycache__/`, `.venv/`, + /// `venv/`, `env/`). + #[serde(default)] + pub skipped_builtin_blacklist: u32, + + /// p10-1A-1: files skipped because their first ~512 bytes contained a + /// generated-file marker (`@generated`, `do not edit`, …). + #[serde(default)] + pub skipped_generated: u32, + + /// p10-1A-1: files skipped because they exceeded `max_file_bytes` or + /// `max_file_lines` in `[ingest.code]`. + #[serde(default)] + pub skipped_size_exceeded: u32, + + /// p10-1A-1: sample file paths per skip category (≤ 5 each). + #[serde(default)] + pub skip_examples: SkipExamples, +``` + +- [ ] **Step 5: Run test** + +Run: `cargo test -p kebab-core --lib skip_examples -- --nocapture` +Expected: PASS. + +- [ ] **Step 6: Build workspace to find consumers constructing IngestReport** + +Run: `cargo build --workspace` +Expected: Patch sites that construct `IngestReport` to add the new fields (use `..Default::default()` style if a `Default` impl exists; otherwise spell out zeros). Typical consumers: `kebab-source-fs` connector, `kebab-app/src/lib.rs` ingest reporter. + +- [ ] **Step 7: Run test suites** + +Run: `cargo test -p kebab-core && cargo test -p kebab-source-fs && cargo test -p kebab-app` +Expected: PASS. + +- [ ] **Step 8: Commit** + +```bash +git add crates/kebab-core/src/ingest.rs +# include consumer files patched in step 6 +git commit -m "feat(p10-1a-1): add IngestReport skip counters + SkipExamples" +``` + +--- + +## Task 4: `Metadata` extension — `repo` / `git_branch` / `git_commit` / `code_lang` + +**Files:** +- Modify: `crates/kebab-core/src/metadata.rs` + +- [ ] **Step 1: Append failing test** + +```rust +#[test] +fn metadata_repo_fields_default_to_none_and_omit_when_serialized() { + let m = Metadata { + aliases: vec![], + tags: vec![], + created_at: time::OffsetDateTime::UNIX_EPOCH, + updated_at: time::OffsetDateTime::UNIX_EPOCH, + source_type: SourceType::Markdown, + trust_level: TrustLevel::Primary, + user_id_alias: None, + user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, + }; + let v = serde_json::to_value(&m).unwrap(); + assert!(v.get("repo").is_none()); + assert!(v.get("git_branch").is_none()); + assert!(v.get("git_commit").is_none()); + assert!(v.get("code_lang").is_none()); +} + +#[test] +fn metadata_repo_fields_present_when_some() { + let m = Metadata { + aliases: vec![], + tags: vec![], + created_at: time::OffsetDateTime::UNIX_EPOCH, + updated_at: time::OffsetDateTime::UNIX_EPOCH, + source_type: SourceType::Markdown, + trust_level: TrustLevel::Primary, + user_id_alias: None, + user: Default::default(), + repo: Some("kebab".into()), + git_branch: Some("main".into()), + git_commit: Some("a".repeat(40)), + code_lang: Some("rust".into()), + }; + let v = serde_json::to_value(&m).unwrap(); + assert_eq!(v["repo"], "kebab"); + assert_eq!(v["git_branch"], "main"); + assert_eq!(v["git_commit"].as_str().unwrap().len(), 40); + assert_eq!(v["code_lang"], "rust"); +} +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `cargo test -p kebab-core --lib metadata_repo_fields -- --nocapture` +Expected: FAIL. + +- [ ] **Step 3: Add four fields to `Metadata`** + +After `user`: + +```rust + /// p10-1A-1: name of the source repo if the file lives inside a git + /// working tree (`.git/` walk-up). null otherwise. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub repo: Option, + + /// p10-1A-1: HEAD branch at ingest time. null when no repo or detached HEAD. + /// Informational only — current-state observability, not a partition key. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub git_branch: Option, + + /// p10-1A-1: HEAD commit (40-hex) at ingest time. null when no repo. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub git_commit: Option, + + /// p10-1A-1: programming language identifier (lowercase canonical). null + /// for markdown / pdf / image. Set by `kebab_parse_code::lang::code_lang_for_path`. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub code_lang: Option, +``` + +- [ ] **Step 4: Run test + build workspace** + +Run: `cargo test -p kebab-core --lib metadata_repo_fields && cargo build --workspace` +Expected: Test PASS. Build will reveal `Metadata` construction sites needing the four fields. Patch with `repo: None, git_branch: None, git_commit: None, code_lang: None` — additive, no behavioral change. + +- [ ] **Step 5: Run full test suites for crates that touched** + +Run: `cargo test -p kebab-core && cargo test -p kebab-parse-md && cargo test -p kebab-parse-pdf && cargo test -p kebab-parse-image && cargo test -p kebab-app` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add crates/kebab-core/src/metadata.rs +# any consumer patches +git commit -m "feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang" +``` + +--- + +## Task 5: New crate `kebab-parse-code` skeleton + +**Files:** +- Create: `crates/kebab-parse-code/Cargo.toml` +- Create: `crates/kebab-parse-code/src/lib.rs` +- Modify: `Cargo.toml` (workspace root) + +- [ ] **Step 1: Add `gix` to workspace dependencies** + +Edit `Cargo.toml` (workspace root). In `[workspace.dependencies]`, add: + +```toml +gix = { version = "0.66", default-features = false, features = ["worktree-mutation", "blocking-network-client"] } +``` + +(Verify the latest stable version on crates.io if 0.66 has shipped; this is approximate as of 2026-05.) + +In `[workspace.members]`, append: + +```toml +"crates/kebab-parse-code", +``` + +- [ ] **Step 2: Write `crates/kebab-parse-code/Cargo.toml`** + +```toml +[package] +name = "kebab-parse-code" +version.workspace = true +edition.workspace = true +license.workspace = true + +[dependencies] +anyhow.workspace = true +gix.workspace = true +kebab-core.path = "../kebab-core" + +[dev-dependencies] +tempfile.workspace = true +``` + +(Verify `tempfile` is in workspace.dependencies. If not, add it there too.) + +- [ ] **Step 3: Write `crates/kebab-parse-code/src/lib.rs`** + +```rust +//! `kebab-parse-code` — language-aware parsing for code corpora. +//! +//! Phase 1A-1 ships infrastructure only: +//! +//! - [`lang::code_lang_for_path`] — extension → language identifier. +//! - [`repo::detect_repo`] — `.git/` walk-up → repo / branch / commit metadata. +//! - [`skip::is_generated_file`] / [`skip::is_oversized`] — pre-ingest skip +//! helpers consulted by `kebab-source-fs`. +//! - [`skip::BUILTIN_BLACKLIST`] — 5-entry safety-net pattern list. +//! +//! Per-language parser modules (`rust`, `python`, `typescript`, …) land in +//! later phases (1A-2 onwards). The crate boundary is otherwise identical to +//! `kebab-parse-md` / `kebab-parse-pdf` per design §8: must NOT depend on +//! store / embed / llm / rag. + +pub mod lang; +pub mod repo; +pub mod skip; + +pub use lang::code_lang_for_path; +pub use repo::{RepoMeta, detect_repo}; +pub use skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized}; +``` + +- [ ] **Step 4: Run `cargo build -p kebab-parse-code` to confirm the empty crate compiles** + +Run: `cargo build -p kebab-parse-code` +Expected: FAIL — `lang.rs` / `repo.rs` / `skip.rs` don't exist yet. That's fine; next task adds them. + +- [ ] **Step 5: Commit (intentionally broken until Task 6/7/8 land — keep this commit atomic with the next three or squash later)** + +```bash +git add Cargo.toml crates/kebab-parse-code/Cargo.toml crates/kebab-parse-code/src/lib.rs +git commit -m "feat(p10-1a-1): scaffold kebab-parse-code crate" +``` + +--- + +## Task 6: `kebab-parse-code::lang` — extension → language identifier + +**Files:** +- Create: `crates/kebab-parse-code/src/lang.rs` +- Create: `crates/kebab-parse-code/tests/lang.rs` + +- [ ] **Step 1: Write the test fixture** + +`crates/kebab-parse-code/tests/lang.rs`: + +```rust +use kebab_parse_code::code_lang_for_path; +use std::path::Path; + +#[test] +fn known_extensions_map_to_canonical_identifiers() { + let cases = [ + ("foo.rs", Some("rust")), + ("foo.py", Some("python")), + ("foo.pyi", Some("python")), + ("foo.ts", Some("typescript")), + ("foo.tsx", Some("typescript")), + ("foo.js", Some("javascript")), + ("foo.mjs", Some("javascript")), + ("foo.cjs", Some("javascript")), + ("foo.jsx", Some("javascript")), + ("foo.go", Some("go")), + ("foo.java", Some("java")), + ("foo.kt", Some("kotlin")), + ("foo.kts", Some("kotlin")), + ("foo.c", Some("c")), + ("foo.h", Some("c")), + ("foo.cpp", Some("cpp")), + ("foo.cc", Some("cpp")), + ("foo.cxx", Some("cpp")), + ("foo.hpp", Some("cpp")), + ("foo.hh", Some("cpp")), + ("foo.hxx", Some("cpp")), + ("foo.yaml", Some("yaml")), + ("foo.yml", Some("yaml")), + ("foo.toml", Some("toml")), + ("foo.json", Some("json")), + ("foo.sh", Some("shell")), + ("foo.bash", Some("shell")), + ("foo.zsh", Some("shell")), + ("foo.mk", Some("make")), + ]; + for (path, expected) in cases { + assert_eq!( + code_lang_for_path(Path::new(path)), + expected, + "path = {path}" + ); + } +} + +#[test] +fn special_filenames_map_to_identifiers() { + assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile")); + assert_eq!(code_lang_for_path(Path::new("foo.dockerfile")), Some("dockerfile")); + assert_eq!(code_lang_for_path(Path::new("Makefile")), Some("make")); +} + +#[test] +fn unknown_extension_returns_none() { + assert_eq!(code_lang_for_path(Path::new("foo.docx")), None); + assert_eq!(code_lang_for_path(Path::new("foo")), None); + assert_eq!(code_lang_for_path(Path::new("foo.unknown")), None); +} + +#[test] +fn case_insensitive() { + assert_eq!(code_lang_for_path(Path::new("Foo.RS")), Some("rust")); + assert_eq!(code_lang_for_path(Path::new("FOO.YAML")), Some("yaml")); +} +``` + +- [ ] **Step 2: Run test to verify it fails (module doesn't exist)** + +Run: `cargo test -p kebab-parse-code --test lang` +Expected: FAIL — `code_lang_for_path` not in scope. + +- [ ] **Step 3: Write `crates/kebab-parse-code/src/lang.rs`** + +```rust +//! Canonical extension → language identifier mapping (spec §3.5). +//! +//! Lowercase canonical identifiers, matching tree-sitter parser conventions: +//! `rust`, `python`, `typescript`, `javascript`, `go`, `java`, `kotlin`, `c`, +//! `cpp`, `yaml`, `toml`, `json`, `shell`, `make`, `dockerfile`. + +use std::path::Path; + +/// Returns the canonical language identifier for a given file path, or +/// `None` if the extension / filename is not recognized. +/// +/// Matching priority: +/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`) +/// 2. lowercase extension match +pub fn code_lang_for_path(path: &Path) -> Option<&'static str> { + if let Some(name) = path.file_name().and_then(|n| n.to_str()) { + match name { + "Dockerfile" => return Some("dockerfile"), + "Makefile" | "GNUmakefile" => return Some("make"), + _ => {} + } + } + let ext = path.extension()?.to_str()?.to_ascii_lowercase(); + match ext.as_str() { + "rs" => Some("rust"), + "py" | "pyi" => Some("python"), + "ts" | "tsx" => Some("typescript"), + "js" | "mjs" | "cjs" | "jsx" => Some("javascript"), + "go" => Some("go"), + "java" => Some("java"), + "kt" | "kts" => Some("kotlin"), + "c" | "h" => Some("c"), + "cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"), + "yaml" | "yml" => Some("yaml"), + "toml" => Some("toml"), + "json" => Some("json"), + "sh" | "bash" | "zsh" => Some("shell"), + "mk" => Some("make"), + "dockerfile" => Some("dockerfile"), + _ => None, + } +} +``` + +- [ ] **Step 4: Run test** + +Run: `cargo test -p kebab-parse-code --test lang` +Expected: PASS (4 tests). + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-parse-code/src/lang.rs crates/kebab-parse-code/tests/lang.rs +git commit -m "feat(p10-1a-1): kebab-parse-code::lang — extension dispatcher" +``` + +--- + +## Task 7: `kebab-parse-code::repo` — `.git/` walk-up via `gix` + +**Files:** +- Create: `crates/kebab-parse-code/src/repo.rs` +- Create: `crates/kebab-parse-code/tests/repo.rs` + +- [ ] **Step 1: Write the test fixture** + +`crates/kebab-parse-code/tests/repo.rs`: + +```rust +use kebab_parse_code::repo::{RepoMeta, detect_repo}; +use std::fs; +use std::path::PathBuf; +use std::process::Command; +use tempfile::TempDir; + +fn init_git_repo(root: &std::path::Path) { + // Use the `git` binary for fixture setup — the production code uses + // `gix`. We don't care which library set up the fixture, we only verify + // that the code reads it correctly. + let run = |args: &[&str]| { + Command::new("git") + .args(args) + .current_dir(root) + .status() + .expect("git command failed"); + }; + run(&["init", "-q"]); + run(&["config", "user.email", "test@test"]); + run(&["config", "user.name", "test"]); + fs::write(root.join("README.md"), "hi").unwrap(); + run(&["add", "README.md"]); + run(&["commit", "-q", "-m", "init"]); +} + +#[test] +fn detect_repo_returns_none_outside_git() { + let tmp = TempDir::new().unwrap(); + let nested = tmp.path().join("a/b/c.txt"); + fs::create_dir_all(nested.parent().unwrap()).unwrap(); + fs::write(&nested, "x").unwrap(); + assert!(detect_repo(&nested).is_none()); +} + +#[test] +fn detect_repo_walks_up_to_git_dir() { + let tmp = TempDir::new().unwrap(); + let repo_root = tmp.path().join("myrepo"); + fs::create_dir_all(&repo_root).unwrap(); + init_git_repo(&repo_root); + let nested = repo_root.join("src/deep/file.rs"); + fs::create_dir_all(nested.parent().unwrap()).unwrap(); + fs::write(&nested, "x").unwrap(); + + let meta = detect_repo(&nested).expect("should detect repo"); + assert_eq!(meta.name, "myrepo"); + assert!(meta.branch.is_some()); // could be "main" or "master" depending on git defaults + assert!(meta.commit.is_some()); + assert_eq!(meta.commit.as_ref().unwrap().len(), 40); +} + +#[test] +fn detect_repo_caches_per_path_call_for_repeated_files_in_same_repo() { + // This is an observability check rather than a hard invariant — + // detect_repo() may or may not cache internally, but it MUST be cheap + // enough that calling it once per file in a repo doesn't blow up. + // We just verify two calls in the same repo return the same name. + let tmp = TempDir::new().unwrap(); + let repo_root = tmp.path().join("myrepo"); + fs::create_dir_all(&repo_root).unwrap(); + init_git_repo(&repo_root); + let f1 = repo_root.join("a.rs"); + let f2 = repo_root.join("b.rs"); + fs::write(&f1, "x").unwrap(); + fs::write(&f2, "x").unwrap(); + let m1 = detect_repo(&f1).unwrap(); + let m2 = detect_repo(&f2).unwrap(); + assert_eq!(m1.name, m2.name); + assert_eq!(m1.commit, m2.commit); +} +``` + +- [ ] **Step 2: Run test to verify failure** + +Run: `cargo test -p kebab-parse-code --test repo` +Expected: FAIL — `detect_repo` not in scope. + +- [ ] **Step 3: Write `crates/kebab-parse-code/src/repo.rs`** + +```rust +//! Git repo auto-detection (spec §5.1). +//! +//! Walks up from `path` looking for a `.git/` directory. If found, reads +//! repo dir name, current branch, and HEAD commit using `gix` (pure Rust; +//! no `git` binary on PATH required). + +use std::path::Path; + +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct RepoMeta { + pub name: String, + pub branch: Option, + pub commit: Option, +} + +/// Walk up from `path` until a `.git/` directory is found. Returns repo +/// metadata, or `None` if no repo boundary is reached before the filesystem +/// root. +/// +/// - `name`: directory name containing `.git/`. +/// - `branch`: current HEAD branch, or `"detached"` if detached HEAD, or +/// `None` if branch can't be read. +/// - `commit`: 40-hex commit SHA at HEAD, or `None` if empty repo / read +/// failure. +/// +/// `.git/` as a file (worktree marker / submodule) returns `None` for +/// `branch` and `commit` and falls back to the parent dir name for `name`. +pub fn detect_repo(path: &Path) -> Option { + let mut cur = if path.is_dir() { path } else { path.parent()? }; + loop { + let dotgit = cur.join(".git"); + if dotgit.is_dir() { + let name = cur.file_name()?.to_string_lossy().into_owned(); + let (branch, commit) = read_head(cur); + return Some(RepoMeta { name, branch, commit }); + } else if dotgit.is_file() { + // worktree marker / submodule — name only. + let name = cur.file_name()?.to_string_lossy().into_owned(); + return Some(RepoMeta { name, branch: None, commit: None }); + } + cur = cur.parent()?; + } +} + +fn read_head(repo_dir: &Path) -> (Option, Option) { + match gix::open(repo_dir) { + Ok(repo) => { + let branch = repo + .head_name() + .ok() + .flatten() + .map(|n| n.shorten().to_string()) + .or_else(|| Some("detached".to_string())); + let commit = repo + .head_id() + .ok() + .map(|id| id.to_string()); + (branch, commit) + } + Err(_) => (None, None), + } +} +``` + +- [ ] **Step 4: Run test** + +Run: `cargo test -p kebab-parse-code --test repo` +Expected: PASS (3 tests). + +If the `gix` API differs in the available crate version (the surface around `head_name` / `head_id` evolves between minor versions), adjust the call sites — the test fixture is contract; the implementation can use whichever `gix` API achieves the same result. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-parse-code/src/repo.rs crates/kebab-parse-code/tests/repo.rs +git commit -m "feat(p10-1a-1): kebab-parse-code::repo — git walk-up via gix" +``` + +--- + +## Task 8: `kebab-parse-code::skip` — generated header + size cap + built-in blacklist + +**Files:** +- Create: `crates/kebab-parse-code/src/skip.rs` +- Create: `crates/kebab-parse-code/tests/skip.rs` + +- [ ] **Step 1: Write the test fixture** + +`crates/kebab-parse-code/tests/skip.rs`: + +```rust +use kebab_parse_code::skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized}; +use std::fs; +use tempfile::NamedTempFile; + +#[test] +fn generated_header_markers_trigger_skip() { + let cases = [ + "// @generated\nfn foo() {}\n", + "// Code generated by tonic-build. DO NOT EDIT.\nfn x() {}\n", + "/* DO NOT EDIT */\nfn x() {}\n", + "/* do not modify */\nfn x() {}\n", + "// AUTOMATICALLY GENERATED\nfn x() {}\n", + "# auto-generated\ndef x(): pass\n", + "// autogenerated\nfn x() {}\n", + ]; + for content in cases { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), content).unwrap(); + assert!(is_generated_file(f.path()).unwrap(), "content: {content:?}"); + } +} + +#[test] +fn normal_code_is_not_flagged_generated() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "fn main() {\n println!(\"hi\");\n}\n").unwrap(); + assert!(!is_generated_file(f.path()).unwrap()); +} + +#[test] +fn is_generated_returns_false_for_empty_file() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "").unwrap(); + assert!(!is_generated_file(f.path()).unwrap()); +} + +#[test] +fn oversized_by_bytes_returns_true() { + let f = NamedTempFile::new().unwrap(); + let body: String = "x".repeat(300_000); + fs::write(f.path(), &body).unwrap(); + assert!(is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn oversized_by_lines_returns_true() { + let f = NamedTempFile::new().unwrap(); + let body: String = "x\n".repeat(6_000); // 12_000 bytes, but 6_000 lines + fs::write(f.path(), &body).unwrap(); + assert!(is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn small_file_returns_false_for_oversize() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "fn foo() {}\n").unwrap(); + assert!(!is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn builtin_blacklist_has_exactly_six_entries() { + // node_modules/, target/, __pycache__/, .venv/, venv/, env/ + assert_eq!(BUILTIN_BLACKLIST.len(), 6); + let expected = [ + "**/node_modules/**", + "**/target/**", + "**/__pycache__/**", + "**/.venv/**", + "**/venv/**", + "**/env/**", + ]; + for pat in expected { + assert!(BUILTIN_BLACKLIST.contains(&pat), "missing pattern: {pat}"); + } +} +``` + +- [ ] **Step 2: Run test to verify failure** + +Run: `cargo test -p kebab-parse-code --test skip` +Expected: FAIL — `skip` module not in scope. + +- [ ] **Step 3: Write `crates/kebab-parse-code/src/skip.rs`** + +```rust +//! Pre-ingest skip helpers (spec §5.3 + §5.4 + §5.2 built-in). +//! +//! - [`BUILTIN_BLACKLIST`] — 6 gitignore-style patterns universal across +//! ecosystems. Source-of-truth list: see spec §5.2. +//! - [`is_generated_file`] — reads first ~512 bytes, checks for 7 +//! case-insensitive markers. False positives are *intentional* — we'd +//! rather skip a hand-written file with "DO NOT EDIT" in a comment than +//! index 50K lines of protobuf output. +//! - [`is_oversized`] — byte cap then line cap. Cascade is cheap because +//! most code files are well under the byte cap. + +use anyhow::Result; +use std::fs::File; +use std::io::{BufRead, BufReader, Read}; +use std::path::Path; + +/// 6 built-in gitignore-style patterns. These are applied *in addition to* +/// `.gitignore` + `.kebabignore`, and they have priority — user negation +/// (`!pattern` in `.kebabignore`) is the only way to override. +pub const BUILTIN_BLACKLIST: &[&str] = &[ + "**/node_modules/**", + "**/target/**", + "**/__pycache__/**", + "**/.venv/**", + "**/venv/**", + "**/env/**", +]; + +/// Read the first 512 bytes of `path` and check for any of the 7 +/// case-insensitive generated-file markers. Returns Ok(true) on match, +/// Ok(false) otherwise. IO errors propagate. +pub fn is_generated_file(path: &Path) -> Result { + let mut buf = [0u8; 512]; + let mut f = File::open(path)?; + let n = f.read(&mut buf)?; + if n == 0 { + return Ok(false); + } + // Only look at valid UTF-8 prefix; if the head is binary, we skip via + // size cap / extension policy elsewhere. + let head = std::str::from_utf8(&buf[..n]).unwrap_or(""); + let lower: String = head.lines().take(10).collect::>().join("\n").to_ascii_lowercase(); + Ok( + lower.contains("@generated") + || lower.contains("code generated by") + || lower.contains("do not edit") + || lower.contains("do not modify") + || lower.contains("automatically generated") + || lower.contains("auto-generated") + || lower.contains("autogenerated"), + ) +} + +/// Check if `path` exceeds `max_bytes` or `max_lines`. Byte cap is checked +/// first (cheap stat call); line cap only if byte cap passes (streaming +/// read with early exit). +pub fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result { + let meta = std::fs::metadata(path)?; + if meta.len() > max_bytes { + return Ok(true); + } + let reader = BufReader::new(File::open(path)?); + let mut count: u32 = 0; + for line in reader.lines() { + let _ = line?; + count = count.saturating_add(1); + if count > max_lines { + return Ok(true); + } + } + Ok(false) +} +``` + +- [ ] **Step 4: Run test** + +Run: `cargo test -p kebab-parse-code --test skip` +Expected: PASS (7 tests). + +- [ ] **Step 5: Build the whole crate** + +Run: `cargo build -p kebab-parse-code` +Expected: PASS (no warnings preferable but not blocking). + +- [ ] **Step 6: Commit** + +```bash +git add crates/kebab-parse-code/src/skip.rs crates/kebab-parse-code/tests/skip.rs +git commit -m "feat(p10-1a-1): kebab-parse-code::skip — generated / size / blacklist helpers" +``` + +--- + +## Task 9: `kebab-source-fs` — integrate `.gitignore` honor + built-in blacklist + +**Files:** +- Modify: `crates/kebab-source-fs/src/walker.rs` +- Modify: `crates/kebab-source-fs/Cargo.toml` (add `kebab-parse-code` dep) + +- [ ] **Step 1: Read the existing `build_overrides` to understand the integration point** + +Run: `sed -n '50,100p' crates/kebab-source-fs/src/walker.rs` +Note the function signature and where it merges `.kebabignore` patterns. We'll extend it to also merge the built-in blacklist patterns and (per repo-root .gitignore) but for simplicity at v1, we use `ignore::WalkBuilder` indirectly via the existing path — the cleanest integration is to add a separate `OverrideBuilder` pass for built-ins and rely on per-directory `.gitignore` discovery via the `ignore` crate. **Decision for 1A-1:** simplest viable path is to add the 6 built-in patterns to the same `OverrideBuilder` that already holds `.kebabignore` patterns. `.gitignore` honor (per-repo cascade) is best done by letting `walkdir` ignore `.git/` and relying on the existing `ignore::Override` mechanism with the addition of `.gitignore` files merged at the workspace.root level only. + +A full implementation reading nested `.gitignore` cascade is in **Task 10** (a follow-up step in this plan). Task 9 lands only the built-in blacklist piece. + +- [ ] **Step 2: Add `kebab-parse-code` as a dep in `crates/kebab-source-fs/Cargo.toml`** + +In the `[dependencies]` table: + +```toml +kebab-parse-code.path = "../kebab-parse-code" +``` + +- [ ] **Step 3: Append failing test to `crates/kebab-source-fs/tests/`** (or to walker.rs's `mod tests`) + +```rust +#[test] +fn built_in_blacklist_excludes_node_modules() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::create_dir_all(root.join("src")).unwrap(); + fs::create_dir_all(root.join("node_modules/foo")).unwrap(); + fs::write(root.join("src/main.rs"), "x").unwrap(); + fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap(); + + let overrides = build_overrides(root, &[], &[]).unwrap(); + let m_in = overrides.matched(root.join("src/main.rs"), false); + let m_out = overrides.matched(root.join("node_modules/foo/bar.js"), false); + + assert!(!m_in.is_ignore(), "src/main.rs should NOT be ignored"); + assert!(m_out.is_ignore(), "node_modules/foo/bar.js SHOULD be ignored"); +} + +#[test] +fn built_in_blacklist_excludes_target_pycache_venv() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + for dir in ["target/x", "__pycache__/x", ".venv/x", "venv/x", "env/x"] { + fs::create_dir_all(root.join(dir)).unwrap(); + fs::write(root.join(dir).join("y.txt"), "z").unwrap(); + } + fs::create_dir_all(root.join("ok")).unwrap(); + fs::write(root.join("ok/z.txt"), "z").unwrap(); + + let overrides = build_overrides(root, &[], &[]).unwrap(); + for blacklisted in [ + "target/x/y.txt", + "__pycache__/x/y.txt", + ".venv/x/y.txt", + "venv/x/y.txt", + "env/x/y.txt", + ] { + let m = overrides.matched(root.join(blacklisted), false); + assert!(m.is_ignore(), "{blacklisted} should be ignored"); + } + let m_ok = overrides.matched(root.join("ok/z.txt"), false); + assert!(!m_ok.is_ignore(), "ok/z.txt should not be ignored"); +} +``` + +- [ ] **Step 4: Run test** + +Run: `cargo test -p kebab-source-fs --lib built_in_blacklist` +Expected: FAIL. + +- [ ] **Step 5: Extend `build_overrides` to include built-in patterns** + +Locate the function (around line 56 of `walker.rs`). Before the loop that adds `kbignore_patterns`, add: + +```rust + // p10-1A-1: built-in safety-net blacklist (spec §5.2). 6 patterns that + // are universal across ecosystems. User can negate via `.kebabignore`. + for pat in kebab_parse_code::BUILTIN_BLACKLIST { + builder + .add(pat) + .with_context(|| format!("built-in blacklist pattern: {pat}"))?; + } +``` + +If the existing patterns are stored with `!`-prefix to make `OverrideBuilder` treat them as excludes, match that convention; the `BUILTIN_BLACKLIST` should be applied via the same convention. + +- [ ] **Step 6: Run test** + +Run: `cargo test -p kebab-source-fs --lib built_in_blacklist` +Expected: PASS. + +- [ ] **Step 7: Run the full `kebab-source-fs` suite** + +Run: `cargo test -p kebab-source-fs` +Expected: PASS. + +- [ ] **Step 8: Commit** + +```bash +git add crates/kebab-source-fs/Cargo.toml crates/kebab-source-fs/src/walker.rs +git commit -m "feat(p10-1a-1): integrate built-in blacklist into walker overrides" +``` + +--- + +## Task 10: `kebab-source-fs` — `.gitignore` honor (per-repo cascade) + +**Files:** +- Modify: `crates/kebab-source-fs/src/walker.rs` (or `connector.rs` — see below) + +Decision: use the `ignore::WalkBuilder` to walk and discover `.gitignore` cascade automatically rather than implementing it manually. The current walker uses `walkdir::WalkDir` for tighter control. The simplest path: read each repo's root `.gitignore` (one per repo boundary) at walk start, add its patterns to the `Override`. This handles the 80% case (most repos use a single repo-root `.gitignore`). Nested `.gitignore` cascade is deferred to a follow-up (open question in spec §11). + +- [ ] **Step 1: Append failing test** + +```rust +#[test] +fn gitignore_at_repo_root_excludes_matching_files() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::create_dir_all(root.join("src")).unwrap(); + fs::write(root.join(".gitignore"), "*.log\ndist/\n").unwrap(); + fs::write(root.join("a.log"), "x").unwrap(); + fs::write(root.join("src/main.rs"), "x").unwrap(); + fs::create_dir_all(root.join("dist")).unwrap(); + fs::write(root.join("dist/bundle.js"), "x").unwrap(); + + let overrides = build_overrides_with_gitignore(root, &[], &[]).unwrap(); + assert!(overrides.matched(root.join("a.log"), false).is_ignore()); + assert!(overrides.matched(root.join("dist/bundle.js"), false).is_ignore()); + assert!(!overrides.matched(root.join("src/main.rs"), false).is_ignore()); +} +``` + +(Or rename existing `build_overrides` once it includes `.gitignore` reading. For minimal disruption, introduce `build_overrides_with_gitignore` and have the old wrapper call into it with `.gitignore` enabled by default.) + +- [ ] **Step 2: Run to verify failure** + +Run: `cargo test -p kebab-source-fs --lib gitignore_at_repo_root` +Expected: FAIL. + +- [ ] **Step 3: Add a `read_gitignore` helper + extend `build_overrides`** + +```rust +/// Read `/.gitignore` (single-file, root-only — nested cascade is P+). +/// Missing file → empty Vec. Comments / blanks stripped. +pub(crate) fn read_gitignore(root: &Path) -> Result> { + let p = root.join(".gitignore"); + if !p.exists() { + return Ok(vec![]); + } + let s = std::fs::read_to_string(&p) + .with_context(|| format!("read .gitignore at {}", p.display()))?; + Ok(s.lines() + .map(str::trim) + .filter(|l| !l.is_empty() && !l.starts_with('#')) + .map(str::to_string) + .collect()) +} +``` + +Modify `build_overrides` to also accept `.gitignore` patterns and add them with the existing convention (excludes prefix). Place them *after* built-in blacklist (so `.kebabignore` can negate both) and *before* `.kebabignore`. + +- [ ] **Step 4: Run test + full suite** + +Run: `cargo test -p kebab-source-fs` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-source-fs/src/walker.rs +git commit -m "feat(p10-1a-1): honor repo-root .gitignore in walker overrides" +``` + +--- + +## Task 11: wire IngestReport skip counters through `kebab-source-fs` connector + +**Files:** +- Modify: `crates/kebab-source-fs/src/connector.rs` +- Modify: `crates/kebab-source-fs/src/walker.rs` + +This task threads the new `IngestReport.skipped_gitignore` / `skipped_kebabignore` / `skipped_builtin_blacklist` counters through the connector. `skipped_generated` / `skipped_size_exceeded` come from `kebab-parse-code::skip` and are wired in Task 13 (the per-file decision point). + +- [ ] **Step 1: Read `connector.rs` to find the IngestReport assembly site** + +Run: `grep -n "IngestReport\|skipped" crates/kebab-source-fs/src/connector.rs | head -20` + +- [ ] **Step 2: Append failing test (where the connector is tested)** + +```rust +#[test] +fn ingest_report_counts_gitignored_files_under_skipped_gitignore() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join(".gitignore"), "*.log\n").unwrap(); + fs::write(root.join("ok.md"), "# ok").unwrap(); + fs::write(root.join("skipme.log"), "x").unwrap(); + + let report = run_scan(root); // your connector entry point + assert_eq!(report.skipped_gitignore, 1); + assert!(report.skip_examples.gitignore.contains(&"skipme.log".to_string())); +} +``` + +(Adapt `run_scan` to whatever the connector's actual entry is.) + +- [ ] **Step 3: Implement the per-category increment** + +In `connector.rs`, where the walker iterator is consumed, distinguish *why* a file was excluded by checking the override matchers in order: + +```rust +// pseudocode — match the existing code style +for entry in walker { + let path = entry?.path(); + if matches_builtin_blacklist(&path) { + report.skipped_builtin_blacklist += 1; + push_sample(&mut report.skip_examples.builtin_blacklist, &path); + continue; + } + if matches_gitignore(&path) { + report.skipped_gitignore += 1; + push_sample(&mut report.skip_examples.gitignore, &path); + continue; + } + if matches_kebabignore(&path) { + report.skipped_kebabignore += 1; + // (skip_examples.kebabignore intentionally not in SkipExamples per spec) + continue; + } + // ... proceed with ingest +} +``` + +Helper: + +```rust +fn push_sample(samples: &mut Vec, path: &Path) { + if samples.len() < 5 { + samples.push(path.to_string_lossy().into_owned()); + } +} +``` + +- [ ] **Step 4: Run test + full suite** + +Run: `cargo test -p kebab-source-fs` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-source-fs/src/connector.rs crates/kebab-source-fs/src/walker.rs +git commit -m "feat(p10-1a-1): split skip counters by category in IngestReport" +``` + +--- + +## Task 12: wire generated / size cap skip checks per file + +**Files:** +- Modify: `crates/kebab-source-fs/src/connector.rs` +- Modify: `crates/kebab-config/src/lib.rs` (need `IngestCodeCfg` first — Task 14) + +Note: This task depends on Task 14's config struct. Reorder execution: do Task 14 first, then return to Task 12. + +- [ ] **Step 1: Read connector to find the per-file decision point** (after walker yield, before parse dispatch) + +- [ ] **Step 2: Append failing test** + +```rust +#[test] +fn ingest_report_counts_generated_files() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join("normal.md"), "# hi").unwrap(); + fs::write(root.join("autogen.rs"), "// @generated\nfn x() {}\n").unwrap(); + + let report = run_scan_with_code_cfg(root, &IngestCodeCfg { + skip_generated_header: true, + ..Default::default() + }); + assert_eq!(report.skipped_generated, 1); + assert!(report.skip_examples.generated.contains(&"autogen.rs".to_string())); +} + +#[test] +fn ingest_report_counts_oversized_files() { + use tempfile::TempDir; + use std::fs; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join("normal.md"), "# hi").unwrap(); + let big: String = "x\n".repeat(100_000); + fs::write(root.join("huge.rs"), &big).unwrap(); + + let report = run_scan_with_code_cfg(root, &IngestCodeCfg { + max_file_bytes: 1024, + max_file_lines: 5_000, + ..Default::default() + }); + assert_eq!(report.skipped_size_exceeded, 1); +} +``` + +- [ ] **Step 3: Add the per-file check between walker yield and parse dispatch** + +```rust +// After: file passed gitignore / kebabignore / built-in checks. +// Before: parse dispatch by media type. +if cfg.code.skip_generated_header + && kebab_parse_code::is_generated_file(&path).unwrap_or(false) +{ + report.skipped_generated += 1; + push_sample(&mut report.skip_examples.generated, &path); + continue; +} +if kebab_parse_code::is_oversized(&path, cfg.code.max_file_bytes, cfg.code.max_file_lines) + .unwrap_or(false) +{ + report.skipped_size_exceeded += 1; + push_sample(&mut report.skip_examples.size_exceeded, &path); + continue; +} +``` + +- [ ] **Step 4: Run tests** + +Run: `cargo test -p kebab-source-fs && cargo test -p kebab-app` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-source-fs/src/connector.rs +git commit -m "feat(p10-1a-1): apply generated-header + size-cap skip per file" +``` + +--- + +## Task 13: regression test — markdown corpus output unchanged + +**Files:** +- Create: `crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs` +- Create: `crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs` + +These tests prove the wire is byte-identical for the existing markdown corpus after 1A-1 lands. They are the *gate* on the framework changes. + +- [ ] **Step 1: Write `wire_search_hit_no_code_fields.rs`** + +```rust +use kebab_core::{Citation, SearchHit, RetrievalDetail, ScoreKind}; +use kebab_core::{ChunkId, ChunkerVersion, DocumentId, IndexVersion, WorkspacePath}; + +#[test] +fn markdown_hit_omits_repo_and_code_lang() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath("notes/foo.md".into()), + heading_path: vec!["A".into(), "B".into()], + section_label: Some("B".into()), + snippet: "hi".into(), + citation: Citation::Line { + path: WorkspacePath("notes/foo.md".into()), + start: 1, + end: 2, + section: None, + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("md-heading-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: None, + code_lang: None, + }; + let s = serde_json::to_string(&hit).unwrap(); + assert!(!s.contains("\"repo\""), "repo should be absent: {s}"); + assert!(!s.contains("\"code_lang\""), "code_lang should be absent: {s}"); +} +``` + +- [ ] **Step 2: Write `wire_citation_5_variants_unchanged.rs`** + +```rust +use kebab_core::{Citation, WorkspacePath}; + +#[test] +fn line_variant_serialization_unchanged() { + let c = Citation::Line { + path: WorkspacePath("a.md".into()), + start: 1, + end: 2, + section: Some("§14".into()), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "line"); + assert_eq!(v["start"], 1); + assert_eq!(v["end"], 2); + assert_eq!(v["section"], "§14"); + assert!(v.get("line_start").is_none()); + assert!(v.get("symbol").is_none()); + assert!(v.get("code").is_none()); +} + +#[test] +fn page_variant_serialization_unchanged() { + let c = Citation::Page { + path: WorkspacePath("a.pdf".into()), + page: 13, + section: None, + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "page"); + assert_eq!(v["page"], 13); + assert!(v.get("line_start").is_none()); +} + +#[test] +fn caption_variant_serialization_unchanged() { + let c = Citation::Caption { + path: WorkspacePath("a.png".into()), + model: "qwen2.5-vl:7b".into(), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "caption"); + assert_eq!(v["model"], "qwen2.5-vl:7b"); +} +``` + +- [ ] **Step 3: Run regression tests** + +Run: `cargo test -p kebab-cli --test wire_search_hit_no_code_fields && cargo test -p kebab-cli --test wire_citation_5_variants_unchanged` +Expected: PASS. (If you forgot `#[serde(skip_serializing_if = "Option::is_none")]` in Task 2, the regression FAILS here — fix and commit.) + +- [ ] **Step 4: Commit** + +```bash +git add crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs +git commit -m "test(p10-1a-1): regression — markdown wire output unchanged" +``` + +--- + +## Task 14: `kebab-config` — `[ingest.code]` section + +**Files:** +- Modify: `crates/kebab-config/src/lib.rs` + +- [ ] **Step 1: Append failing test** + +```rust +#[test] +fn ingest_code_cfg_defaults() { + let cfg: IngestCodeCfg = toml::from_str("").unwrap(); + assert_eq!(cfg.max_file_bytes, 262_144); + assert_eq!(cfg.max_file_lines, 5_000); + assert!(cfg.skip_generated_header); + assert!(cfg.extra_skip_globs.is_empty()); + assert_eq!(cfg.ast_chunk_max_lines, 200); + assert_eq!(cfg.fallback_lines_per_chunk, 80); + assert_eq!(cfg.fallback_lines_overlap, 20); +} + +#[test] +fn ingest_code_cfg_user_override() { + let toml = r#" + max_file_bytes = 1048576 + max_file_lines = 20000 + skip_generated_header = false + extra_skip_globs = ["**/fixtures/**", "**/snapshots/**"] + "#; + let cfg: IngestCodeCfg = toml::from_str(toml).unwrap(); + assert_eq!(cfg.max_file_bytes, 1_048_576); + assert_eq!(cfg.max_file_lines, 20_000); + assert!(!cfg.skip_generated_header); + assert_eq!(cfg.extra_skip_globs.len(), 2); +} + +#[test] +fn config_with_ingest_code_section() { + let toml = r#" + [workspace] + root = "~/Notes" + + [ingest.code] + max_file_bytes = 524288 + "#; + let cfg: Config = toml::from_str(toml).unwrap(); + assert_eq!(cfg.ingest.code.max_file_bytes, 524_288); +} +``` + +- [ ] **Step 2: Run failing test** + +Run: `cargo test -p kebab-config --lib ingest_code -- --nocapture` +Expected: FAIL. + +- [ ] **Step 3: Add `IngestCodeCfg` struct and `IngestCfg` wrapper (if absent)** + +In `crates/kebab-config/src/lib.rs`: + +```rust +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +#[serde(default)] +pub struct IngestCfg { + pub code: IngestCodeCfg, +} + +impl Default for IngestCfg { + fn default() -> Self { + Self { code: IngestCodeCfg::default() } + } +} + +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +#[serde(default)] +pub struct IngestCodeCfg { + /// Generated header sniff. Reads first ~512 bytes, checks 7 markers. + pub skip_generated_header: bool, + /// Max byte size per file. Bigger files skipped. + pub max_file_bytes: u64, + /// Max line count per file. Bigger files skipped (byte cap checked first). + pub max_file_lines: u32, + /// User extra skip globs (gitignore syntax). Applied on top of built-in + /// + `.gitignore` + `.kebabignore`. + pub extra_skip_globs: Vec, + /// AST chunk size cap. Functions/classes longer than this fall back to + /// paragraph-based split (1A-2 and later). + pub ast_chunk_max_lines: u32, + /// Tier 3 fallback chunker: lines per chunk. + pub fallback_lines_per_chunk: u32, + /// Tier 3 fallback chunker: line overlap between adjacent chunks. + pub fallback_lines_overlap: u32, +} + +impl Default for IngestCodeCfg { + fn default() -> Self { + Self { + skip_generated_header: true, + max_file_bytes: 262_144, + max_file_lines: 5_000, + extra_skip_globs: vec![], + ast_chunk_max_lines: 200, + fallback_lines_per_chunk: 80, + fallback_lines_overlap: 20, + } + } +} +``` + +Then add `pub ingest: IngestCfg` to the `Config` struct with `#[serde(default)]`. + +- [ ] **Step 4: Run test** + +Run: `cargo test -p kebab-config --lib ingest_code -- --nocapture` +Expected: PASS. + +- [ ] **Step 5: Build the workspace; expect config consumers to need updating** + +Run: `cargo build --workspace` +Expected: PASS (the new field has Default → no breakage). + +- [ ] **Step 6: Commit** + +```bash +git add crates/kebab-config/src/lib.rs +git commit -m "feat(p10-1a-1): add [ingest.code] config section" +``` + +--- + +## Task 15: `kebab-cli` — `--repo` / `--code-lang` flags + +**Files:** +- Modify: `crates/kebab-cli/src/main.rs` + +Note: `--media code` works automatically because `--media` is already a free-form Vec. We just document it. + +- [ ] **Step 1: Append failing test (CLI integration)** + +`crates/kebab-cli/tests/wire_search_filters_code.rs`: + +```rust +use assert_cmd::Command; +use predicates::str::contains; + +#[test] +fn cli_accepts_repo_flag_repeated() { + let mut cmd = Command::cargo_bin("kebab").unwrap(); + let assert = cmd + .args(["search", "--repo", "foo", "--repo", "bar", "--help"]) + .assert() + .success(); + // --help short-circuits — we're just verifying the flag parses. + let _ = assert; +} + +#[test] +fn cli_accepts_code_lang_flag_repeated() { + let mut cmd = Command::cargo_bin("kebab").unwrap(); + cmd.args(["search", "--code-lang", "rust", "--code-lang", "python", "--help"]) + .assert() + .success(); +} + +#[test] +fn cli_accepts_media_code_value() { + let mut cmd = Command::cargo_bin("kebab").unwrap(); + cmd.args(["search", "--media", "code", "--help"]) + .assert() + .success(); +} +``` + +- [ ] **Step 2: Run failing test** + +Run: `cargo test -p kebab-cli --test wire_search_filters_code` +Expected: FAIL — `--repo` and `--code-lang` not recognized. + +- [ ] **Step 3: Add the flags to `Cmd::Search` in `crates/kebab-cli/src/main.rs`** + +After the existing `media: Vec` field: + +```rust + /// p10-1A-1: filter by repo name (`metadata.repo`). Repeatable; + /// multi-value = OR. + #[arg(long = "repo", value_name = "NAME", num_args = 1)] + repo: Vec, + + /// p10-1A-1: filter by code language identifier (lowercase canonical). + /// Repeatable or comma-separated. Examples: rust,python,typescript. + /// Unknown values produce empty hits. + #[arg(long = "code-lang", value_name = "LANG", num_args = 1, value_delimiter = ',')] + code_lang: Vec, +``` + +- [ ] **Step 4: Propagate to `SearchFilters` in the dispatch site** + +In the `Cmd::Search` arm where `SearchFilters` is constructed (around `media: media_norm,`): + +```rust + SearchFilters { + // ... existing fields ... + media: media_norm, + ingested_after, + doc_id: doc_id_parsed, + repo, + code_lang, + } +``` + +- [ ] **Step 5: Run test + full CLI suite** + +Run: `cargo test -p kebab-cli --test wire_search_filters_code && cargo test -p kebab-cli` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add crates/kebab-cli/src/main.rs crates/kebab-cli/tests/wire_search_filters_code.rs +git commit -m "feat(p10-1a-1): add --repo / --code-lang CLI flags" +``` + +--- + +## Task 16: `kebab-app::schema` — `code_lang_breakdown` + `repo_breakdown` stats + +**Files:** +- Modify: `crates/kebab-app/src/schema.rs` + +- [ ] **Step 1: Append failing test** + +```rust +#[test] +fn schema_stats_includes_code_lang_and_repo_breakdown() { + let stats = SchemaStats { + // existing fields with sensible defaults + ..Default::default() + }; + let v = serde_json::to_value(&stats).unwrap(); + assert!(v.get("code_lang_breakdown").is_some(), "stats must include code_lang_breakdown"); + assert!(v.get("repo_breakdown").is_some(), "stats must include repo_breakdown"); +} +``` + +- [ ] **Step 2: Run failing test** + +Run: `cargo test -p kebab-app --lib schema_stats_includes -- --nocapture` +Expected: FAIL. + +- [ ] **Step 3: Add the two BTreeMaps to `SchemaStats`** + +```rust + /// p10-1A-1: code language breakdown (chunk counts by canonical lowercase + /// language identifier). Empty until 1A-2 produces code chunks. + #[serde(default)] + pub code_lang_breakdown: std::collections::BTreeMap, + + /// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value). + /// Empty until 1A-2 produces code chunks. + #[serde(default)] + pub repo_breakdown: std::collections::BTreeMap, +``` + +Also add `code` to the `media_breakdown` if it's an enumerated set (verify the existing impl — it may be free-form `BTreeMap` already). + +- [ ] **Step 4: Run test + full app suite** + +Run: `cargo test -p kebab-app` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-app/src/schema.rs +git commit -m "feat(p10-1a-1): SchemaStats — code_lang_breakdown + repo_breakdown" +``` + +--- + +## Task 17: wire schema JSON files + +**Files:** +- Modify: `docs/wire-schema/v1/citation.schema.json` +- Modify: `docs/wire-schema/v1/search_hit.schema.json` +- Modify: `docs/wire-schema/v1/ingest_report.schema.json` +- Modify: `docs/wire-schema/v1/schema.schema.json` + +- [ ] **Step 1: Update `citation.schema.json`** + +Add `"code"` to the `kind` enum and `"code": { "type": "object" }` to the properties: + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://kb.local/wire/v1/citation.schema.json", + "title": "Citation v1", + "description": "Stub schema — declares the schema_version label and the always-present fields. Variant-discriminated property validation lands in a later phase.", + "type": "object", + "required": ["schema_version", "kind", "path", "uri", "indexed_at", "stale"], + "properties": { + "schema_version": { "const": "citation.v1" }, + "kind": { "enum": ["line", "page", "region", "caption", "time", "code"] }, + "path": { "type": "string" }, + "uri": { "type": "string" }, + "line": { "type": "object" }, + "page": { "type": "object" }, + "region": { "type": "object" }, + "caption": { "type": "object" }, + "time": { "type": "object" }, + "code": { "type": "object" }, + "indexed_at": { "type": "string", "format": "date-time" }, + "stale": { "type": "boolean" } + } +} +``` + +- [ ] **Step 2: Update `search_hit.schema.json`** + +Add to `properties`: + +```json + "repo": { "type": ["string", "null"] }, + "code_lang": { "type": ["string", "null"] } +``` + +(Verify the file's existing structure first via `cat docs/wire-schema/v1/search_hit.schema.json`.) + +- [ ] **Step 3: Update `ingest_report.schema.json`** + +Add to `properties`: + +```json + "skipped_gitignore": { "type": "integer", "minimum": 0 }, + "skipped_kebabignore": { "type": "integer", "minimum": 0 }, + "skipped_builtin_blacklist": { "type": "integer", "minimum": 0 }, + "skipped_generated": { "type": "integer", "minimum": 0 }, + "skipped_size_exceeded": { "type": "integer", "minimum": 0 }, + "skip_examples": { + "type": "object", + "properties": { + "generated": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "size_exceeded": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "builtin_blacklist": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "gitignore": { "type": "array", "items": { "type": "string" }, "maxItems": 5 } + } + } +``` + +- [ ] **Step 4: Update `schema.schema.json`** + +Add to `stats.properties`: + +```json + "code_lang_breakdown": { + "type": "object", + "additionalProperties": { "type": "integer", "minimum": 0 } + }, + "repo_breakdown": { + "type": "object", + "additionalProperties": { "type": "integer", "minimum": 0 } + } +``` + +- [ ] **Step 5: Verify JSON validity** + +Run: `for f in docs/wire-schema/v1/*.json; do python3 -m json.tool < "$f" > /dev/null && echo "$f OK" || echo "$f BAD"; done` +Expected: All files report OK. + +- [ ] **Step 6: Commit** + +```bash +git add docs/wire-schema/v1/ +git commit -m "feat(p10-1a-1): wire schema v1 — code variant + repo/code_lang + skip counters" +``` + +--- + +## Task 18: Frozen design doc update + +**Files:** +- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` + +- [ ] **Step 1: Read §10.1 of the code ingest spec** for the exact list of frozen-design sections that need updating + +Run: `sed -n '/^### 10.1/,/^### 10.2/p' docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` + +- [ ] **Step 2: Update §0 (동결된 결정 요약)** — add one row at the bottom + +``` +| C+ | code ingest 추가 | Tier 1/2/3 fan-out, e5-large 유지, 새 Citation `code` variant | 2026-05-15 spec cross-link | +``` + +- [ ] **Step 3: Update §2.1 Citation** — change "5-variant" to "6-variant" and add the `code` example block + +Find the existing `Citation (5 variants — discriminated by `kind`)` heading and update. + +- [ ] **Step 4: Update §2.2 SearchHit** — add `repo` / `code_lang` rows to the example JSON, with a one-line note "p10-1A-1: optional, omitted when null". + +- [ ] **Step 5: Update §2.4 IngestReport** — add the 5 new skip counters and `skip_examples` to the example. + +- [ ] **Step 6: Update §3.2 Versions / labels** — note "chunker_version family extended in phase 10 (per-language pattern). See 2026-05-15 spec §3.3 for canonical list." + +- [ ] **Step 7: Update §3.6 Metadata** — add the four new fields with one-line notes. + +- [ ] **Step 8: Update §8 모듈 경계** — add `kebab-parse-code` to the crate inventory and inheritance rules (same boundary as other `kebab-parse-*`). + +- [ ] **Step 9: Update §11 동결 범위** — add one line: "코드 ingest 는 더 이상 비-스코프 아님 (2026-05-15 spec). 단 multi-workspace / watch mode / history aware 는 그대로 비-스코프." + +- [ ] **Step 10: Commit** + +```bash +git add docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +git commit -m "docs(p10-1a-1): apply code ingest framework to frozen design" +``` + +--- + +## Task 19: README / HANDOFF / SMOKE updates + +**Files:** +- Modify: `README.md` +- Modify: `HANDOFF.md` +- Modify: `docs/SMOKE.md` + +- [ ] **Step 1: Update README's `kebab search` command row** + +Append to the existing flag list inside the `search` row table: + +``` +[--repo NAME ...] [--code-lang LIST] [--media code] +``` + +After the existing `--media md` / `--media markdown` alias paragraph (around the "filter flags" block), add: + +```markdown +**code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. +``` + +- [ ] **Step 2: Add Configuration row about `[ingest.code]`** + +Under the existing Configuration section, add: + +```markdown +- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값. + - `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip. + - `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip. + - `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법). + - `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`. +``` + +- [ ] **Step 3: Update HANDOFF.md** + +Add a row to the phase status table: + +``` +| 10 | code ingest framework | 🟡 진행 중 (1A-1) | 1A-1 머지 시점 wire schema + 새 crate skeleton 동결, code chunker 는 1A-2 부터 | +``` + +- [ ] **Step 4: Update docs/SMOKE.md config example** + +In the `/tmp/kebab-smoke/config.toml` block, append: + +```toml +[ingest.code] +skip_generated_header = true +max_file_bytes = 262144 +max_file_lines = 5000 +``` + +(Default values — same as the in-code defaults. Smoke workflow doesn't need overrides; the block exists for discoverability.) + +- [ ] **Step 5: Commit** + +```bash +git add README.md HANDOFF.md docs/SMOKE.md +git commit -m "docs(p10-1a-1): README + HANDOFF + SMOKE — code ingest framework" +``` + +--- + +## Task 20: tasks index + p10 directory + +**Files:** +- Create: `tasks/p10/INDEX.md` +- Create: `tasks/p10/p10-1a-1-code-ingest-framework.md` +- Modify: `tasks/INDEX.md` + +- [ ] **Step 1: Create `tasks/p10/INDEX.md`** + +```markdown +# Phase 10 — Code Ingest + +| ID | Subject | Status | +|----|---------|--------| +| 1A-1 | code ingest framework (wire schema, parse-code crate skeleton, filter flags, skip policy, config 절) | 🟡 진행 중 | +| 1A-2 | Rust AST chunker | ⏳ | +| 1B | Python + TS/JS AST chunkers | ⏳ | +| 1C | Go + Java + Kotlin AST chunkers | ⏳ | +| 1D | C + C++ AST chunkers | ⏳ | +| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ⏳ | +| 3 | Tier 3 paragraph + line-window fallback | ⏳ | + +Design: [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) +``` + +- [ ] **Step 2: Create `tasks/p10/p10-1a-1-code-ingest-framework.md`** + +```markdown +# p10-1A-1 — code ingest framework + +**Status:** 🟡 진행 중 +**Contract sections:** §2.1 (Citation `code` variant), §2.2 (SearchHit repo/code_lang), §2.4 (IngestReport skip counters), §2 schema.v1 (code_lang_breakdown + repo_breakdown), §3.6 (Metadata fields), §8 (kebab-parse-code crate boundary), §11 (code ingest no longer 비-스코프). +**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1A-1. +**Plan:** [2026-05-15-p10-1a-1-code-ingest-framework.md](../../docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md). + +## Goal + +Land the *framework surface* for code ingest — wire schema (additive minor), CLI filter flags, ignore policy, skip policy infrastructure, `kebab-parse-code` crate skeleton, `[ingest.code]` config section — without enabling any code chunker. 1A-2 plugs the Rust AST chunker on top. + +## Acceptance criteria + +- `cargo test --workspace --no-fail-fast -j 1` passes. +- Regression test (`wire_search_hit_no_code_fields`, `wire_citation_5_variants_unchanged`) passes — markdown corpus wire output unchanged. +- `cargo clippy --workspace --all-targets -- -D warnings` passes. +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` updated per design §10.1. +- README + HANDOFF + SMOKE updated. + +## Allowed dependencies + +- `kebab-parse-code` may depend on `kebab-core`, `anyhow`, `gix`. NOT on store / embed / llm / rag / UI. +- Source-fs may depend on `kebab-parse-code`. + +## Forbidden dependencies + +- UI crates (cli / mcp / tui) must NOT import `kebab-parse-code` directly. + +## Risks / notes + +- `.gitignore` honor changes existing behavior for markdown corpora whose files live in gitignored areas. Regression test covers the standard case (no overlap). If a user reports missing docs after 1A-1 lands, log to HOTFIXES. +``` + +- [ ] **Step 3: Update `tasks/INDEX.md`** — add a phase 10 row + +- [ ] **Step 4: Commit** + +```bash +git add tasks/p10/ tasks/INDEX.md +git commit -m "docs(p10-1a-1): task index + framework task spec" +``` + +--- + +## Task 21: final clippy + full workspace test + +- [ ] **Step 1: Run clippy across the workspace** + +Run: `cargo clippy --workspace --all-targets -- -D warnings` +Expected: PASS. Fix any new warnings the framework code introduced. + +- [ ] **Step 2: Run the full workspace test with -j 1 (per CLAUDE.md)** + +Run: `cargo test --workspace --no-fail-fast -j 1` +Expected: PASS. + +- [ ] **Step 3: Manually run a smoke ingest against the temp workspace** + +```bash +mkdir -p /tmp/kebab-p10-smoke +cat > /tmp/kebab-p10-smoke/config.toml <<'EOF' +[workspace] +root = "/tmp/kebab-p10-smoke/notes" +EOF +mkdir -p /tmp/kebab-p10-smoke/notes +echo "# hello" > /tmp/kebab-p10-smoke/notes/a.md +cargo run --release -p kebab-cli -- --config /tmp/kebab-p10-smoke/config.toml init +cargo run --release -p kebab-cli -- --config /tmp/kebab-p10-smoke/config.toml ingest --json | jq '.skipped_gitignore, .skipped_generated, .skipped_size_exceeded' +``` + +Expected: All three values = `0`. Wire output includes the new fields (even when zero — verify via `--json`). + +- [ ] **Step 4: `cargo clean` to recover disk** (per CLAUDE.md routine-after-merge rule, but do it now before opening the PR — keeps the work-tree light) + +Optional but recommended after the test run. + +- [ ] **Step 5: Final commit if any clippy fixes were needed; otherwise skip** + +```bash +git commit -m "chore(p10-1a-1): final clippy pass" +``` + +--- + +## Self-Review + +After all 21 tasks land, do a final sanity check before opening the PR: + +**Spec coverage:** +- §2 Phase 1A-1 row → Tasks 1-21 cover every bullet in the table. +- §3.1 Citation::Code variant → Task 1 + 17. +- §3.2 SearchHit fields → Task 2 + 17. +- §3.5 Metadata extension → Task 4. +- §4 wire schema → Task 17. +- §5.1 repo detect → Task 7. +- §5.2 ignore integration → Task 9 + 10. +- §5.3 generated header → Task 8 + 12. +- §5.4 size cap → Task 8 + 12. +- §5.5 IngestReport skip counters → Task 3 + 11 + 12. +- §6 crate structure → Task 5. +- §7.1 CLI filter flags → Task 15. +- §7.2 schema stats → Task 16. +- §8 config section → Task 14. +- §10.1 frozen design update → Task 18. +- §10.4 no binary bump for 1A-1 → respected (no bump commit). + +**Placeholder scan:** Search for `TBD` / `TODO` / `XXX` / `FIXME` in the plan body — none should remain. (`open question` references in the spec are intentional and don't carry forward into the plan.) + +**Type consistency:** +- `Citation::Code` field names match across Task 1, 13, 17, 18. +- `SearchHit.repo` / `code_lang` match across Task 2, 13, 17. +- `Metadata` field names match Task 4 + Task 18. +- `IngestReport` skip counter names match Task 3, 11, 12, 17. +- `IngestCodeCfg` field names match Task 14, 12. + +If any name drift, fix before opening the PR. diff --git a/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md index f074c9c..3a780eb 100644 --- a/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md +++ b/docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md @@ -101,18 +101,17 @@ frozen design §2.1 의 5 variant (`line` / `page` / `region` / `caption` / `tim "kind": "code", "path": "kebab/crates/kebab-chunk/src/md_heading_v1.rs", "uri": "kebab/crates/kebab-chunk/src/md_heading_v1.rs#L142-L168", - - "code": { - "line_start": 142, - "line_end": 168, - "symbol": "MdHeadingV1Chunker::chunk_doc", - "lang": "rust" - } + "line_start": 142, + "line_end": 168, + "symbol": "MdHeadingV1Chunker::chunk_doc", + "lang": "rust" } ``` -`code.symbol` 은 nullable — Tier 1 AST chunk 면 채움, Tier 2/3 면 비움 (`null`). -`code.lang` 은 `--code-lang` filter 와 같은 식별자 (lowercase). null 가능. +**Wire 형태 — flat**: 기존 5 variants 와 동일한 패턴 (`Citation::Line` 도 `start` / `end` / `section` 이 top-level, 중첩 없음). serde `#[serde(tag = "kind")]` 외부 tag enum 이라 variant 별 필드가 top-level 에 들어감. + +`symbol` 은 nullable — Tier 1 AST chunk 면 채움, Tier 2/3 면 `null`. +`lang` 은 `--code-lang` filter 와 같은 식별자 (lowercase). null 가능. 기존 5 variant 와 마찬가지로 `path` + `uri` 는 항상 채움. `uri` 는 `path#L-L` (W3C Media Fragments) 그대로. -- 2.49.1 From 3b1e878aed3e851c9743a10518dbe2e8a1257ab0 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 14:39:18 +0900 Subject: [PATCH 04/21] feat(p10-1a-1): add Citation::Code variant Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-core/src/citation.rs | 82 ++++++++++++++++++++++++++++++- 1 file changed, 81 insertions(+), 1 deletion(-) diff --git a/crates/kebab-core/src/citation.rs b/crates/kebab-core/src/citation.rs index 80adb67..a27b28d 100644 --- a/crates/kebab-core/src/citation.rs +++ b/crates/kebab-core/src/citation.rs @@ -37,6 +37,13 @@ pub enum Citation { end_ms: u64, speaker: Option, }, + Code { + path: WorkspacePath, + line_start: u32, + line_end: u32, + symbol: Option, + lang: Option, + }, } impl Citation { @@ -46,7 +53,8 @@ impl Citation { | Citation::Page { path, .. } | Citation::Region { path, .. } | Citation::Caption { path, .. } - | Citation::Time { path, .. } => path, + | Citation::Time { path, .. } + | Citation::Code { path, .. } => path, } } @@ -80,6 +88,18 @@ impl Citation { None => format!("{}#t={},{}", path.0, s, e), } } + Citation::Code { + path, + line_start, + line_end, + .. + } => { + if line_start == line_end { + format!("{}#L{}", path.0, line_start) + } else { + format!("{}#L{}-L{}", path.0, line_start, line_end) + } + } } } @@ -354,4 +374,64 @@ mod tests { let r = Citation::parse("notes/x#evil.md#L7"); assert!(r.is_err(), "path with embedded '#' must be rejected"); } + + #[test] + fn citation_code_variant_serializes_with_kind_tag() { + let c = Citation::Code { + path: WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()), + line_start: 142, + line_end: 168, + symbol: Some("MdHeadingV1Chunker::chunk_doc".into()), + lang: Some("rust".into()), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "code"); + assert_eq!(v["line_start"], 142); + assert_eq!(v["line_end"], 168); + assert_eq!(v["symbol"], "MdHeadingV1Chunker::chunk_doc"); + assert_eq!(v["lang"], "rust"); + // Existing 5 variants must NOT pick up these fields. + let line = Citation::Line { + path: WorkspacePath("notes/foo.md".into()), + start: 1, + end: 10, + section: None, + }; + let lv = serde_json::to_value(&line).unwrap(); + assert!(lv.get("line_start").is_none()); + assert!(lv.get("symbol").is_none()); + } + + #[test] + fn citation_code_uri_format() { + let c = Citation::Code { + path: WorkspacePath("a/b.rs".into()), + line_start: 10, + line_end: 20, + symbol: None, + lang: Some("rust".into()), + }; + assert_eq!(c.to_uri(), "a/b.rs#L10-L20"); + // Single-line uses `#L10`. + let single = Citation::Code { + path: WorkspacePath("a/b.rs".into()), + line_start: 5, + line_end: 5, + symbol: None, + lang: None, + }; + assert_eq!(single.to_uri(), "a/b.rs#L5"); + } + + #[test] + fn citation_code_path_accessor() { + let c = Citation::Code { + path: WorkspacePath("x.rs".into()), + line_start: 1, + line_end: 1, + symbol: None, + lang: None, + }; + assert_eq!(c.path().0, "x.rs"); + } } -- 2.49.1 From fa4eeb5a87fc3196ed0d2e2671155b064f69a236 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 15:04:23 +0900 Subject: [PATCH 05/21] feat(p10-1a-1): add SearchHit.repo / code_lang + SearchFilters.repo / code_lang Wire two new optional fields onto SearchHit (skip_serializing_if = None) and two Vec filter fields onto SearchFilters (serde default). Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as sentinel). Patch all downstream SearchHit / SearchFilters literal constructors with repo: None / code_lang: None / vec![] as appropriate. Also covers Citation::Code arm in kebab-eval metrics match. --- crates/kebab-app/src/bulk.rs | 2 + crates/kebab-cli/src/main.rs | 2 + crates/kebab-core/src/search.rs | 102 ++++++++++++++++++ crates/kebab-eval/src/metrics.rs | 5 +- .../kebab-eval/tests/metrics_and_compare.rs | 2 + crates/kebab-mcp/src/tools/search.rs | 2 + crates/kebab-search/src/hybrid.rs | 4 + crates/kebab-search/src/lexical.rs | 2 + crates/kebab-search/src/vector.rs | 2 + 9 files changed, 122 insertions(+), 1 deletion(-) diff --git a/crates/kebab-app/src/bulk.rs b/crates/kebab-app/src/bulk.rs index 50676b4..36be6c4 100644 --- a/crates/kebab-app/src/bulk.rs +++ b/crates/kebab-app/src/bulk.rs @@ -197,6 +197,8 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> { media, ingested_after, doc_id, + repo: vec![], + code_lang: vec![], }; let opts = SearchOpts { diff --git a/crates/kebab-cli/src/main.rs b/crates/kebab-cli/src/main.rs index db257ff..3ca6e63 100644 --- a/crates/kebab-cli/src/main.rs +++ b/crates/kebab-cli/src/main.rs @@ -828,6 +828,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> { media: media_norm, ingested_after: ingested_after_parsed, doc_id: doc_id.as_ref().map(|s| kebab_core::DocumentId(s.clone())), + repo: vec![], + code_lang: vec![], }; let q = kebab_core::SearchQuery { diff --git a/crates/kebab-core/src/search.rs b/crates/kebab-core/src/search.rs index 137370c..eaf8470 100644 --- a/crates/kebab-core/src/search.rs +++ b/crates/kebab-core/src/search.rs @@ -61,6 +61,14 @@ pub struct SearchFilters { /// p9-fb-36: restrict hits to a single document. None = no filter. #[serde(default)] pub doc_id: Option, + /// p10-1A-1: filter by `metadata.repo`. Empty = no filter; multi-value = OR. + #[serde(default)] + pub repo: Vec, + /// p10-1A-1: filter by `metadata.code_lang`. Empty = no filter; multi-value = OR. + /// Identifiers are lowercase canonical names (`rust`, `python`, `typescript`, ...). + /// Unknown values produce empty hits (consistent with `media` policy). + #[serde(default)] + pub code_lang: Vec, } #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] @@ -89,6 +97,15 @@ pub struct SearchHit { /// 옛 wire (fb-38 미만) 부재 시 `Rrf` default — hybrid 가 기본 mode. #[serde(default)] pub score_kind: ScoreKind, + /// p10-1A-1: optional. Filled when the source file lives in a git repo + /// (`.git/` walk-up). null for markdown / pdf / image hits and for code + /// hits ingested via `kebab ingest-file` outside a repo boundary. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub repo: Option, + /// p10-1A-1: optional. Programming language identifier (lowercase). Set for + /// every code/manifest/k8s chunk; null for markdown / pdf / image hits. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub code_lang: Option, } #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] @@ -101,6 +118,19 @@ pub struct RetrievalDetail { pub vector_rank: Option, } +impl Default for RetrievalDetail { + fn default() -> Self { + Self { + method: SearchMode::Hybrid, + fusion_score: 0.0, + lexical_score: None, + vector_score: None, + lexical_rank: None, + vector_rank: None, + } + } +} + /// Filter for `kb-app::list_docs` (§7.2 DocumentStore::list_documents). #[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)] pub struct DocFilter { @@ -257,6 +287,8 @@ mod tests { indexed_at: datetime!(2026-05-09 12:00:00 UTC), stale: true, score_kind: ScoreKind::Rrf, + repo: None, + code_lang: None, }; let v = serde_json::to_value(&hit).unwrap(); assert_eq!(v["indexed_at"], "2026-05-09T12:00:00Z"); @@ -429,4 +461,74 @@ mod tests { assert!(v["response"].is_null()); assert_eq!(v["error"]["code"], "config_invalid"); } + + #[test] + fn search_hit_repo_and_code_lang_are_optional_and_omit_when_none() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath("a.md".into()), + heading_path: vec![], + section_label: None, + snippet: "".into(), + citation: Citation::Line { + path: WorkspacePath("a.md".into()), + start: 1, + end: 2, + section: None, + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("md-heading-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: None, + code_lang: None, + }; + let v = serde_json::to_value(&hit).unwrap(); + assert!(v.get("repo").is_none(), "repo should be omitted when None"); + assert!(v.get("code_lang").is_none(), "code_lang should be omitted when None"); + } + + #[test] + fn search_hit_repo_and_code_lang_present_when_some() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath("a.rs".into()), + heading_path: vec![], + section_label: None, + snippet: "".into(), + citation: Citation::Code { + path: WorkspacePath("a.rs".into()), + line_start: 1, + line_end: 2, + symbol: None, + lang: Some("rust".into()), + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("code-rust-ast-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: Some("kebab".into()), + code_lang: Some("rust".into()), + }; + let v = serde_json::to_value(&hit).unwrap(); + assert_eq!(v["repo"], "kebab"); + assert_eq!(v["code_lang"], "rust"); + } + + #[test] + fn search_filters_repo_and_code_lang_default_to_empty_vec() { + let f = SearchFilters::default(); + assert!(f.repo.is_empty()); + assert!(f.code_lang.is_empty()); + } } diff --git a/crates/kebab-eval/src/metrics.rs b/crates/kebab-eval/src/metrics.rs index f138845..6a80ed0 100644 --- a/crates/kebab-eval/src/metrics.rs +++ b/crates/kebab-eval/src/metrics.rs @@ -338,7 +338,8 @@ pub(crate) fn aggregate_from_rows( | Citation::Page { path, .. } | Citation::Region { path, .. } | Citation::Caption { path, .. } - | Citation::Time { path, .. } => !path.0.is_empty(), + | Citation::Time { path, .. } + | Citation::Code { path, .. } => !path.0.is_empty(), }); if covered { citation_num += 1; @@ -472,6 +473,8 @@ mod tests { indexed_at: OffsetDateTime::UNIX_EPOCH, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } diff --git a/crates/kebab-eval/tests/metrics_and_compare.rs b/crates/kebab-eval/tests/metrics_and_compare.rs index 7cd7355..17b6e56 100644 --- a/crates/kebab-eval/tests/metrics_and_compare.rs +++ b/crates/kebab-eval/tests/metrics_and_compare.rs @@ -87,6 +87,8 @@ fn hit(rank: u32, chunk_id: &str, doc_id: &str) -> SearchHit { indexed_at: OffsetDateTime::UNIX_EPOCH, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } diff --git a/crates/kebab-mcp/src/tools/search.rs b/crates/kebab-mcp/src/tools/search.rs index 722dbdd..2586294 100644 --- a/crates/kebab-mcp/src/tools/search.rs +++ b/crates/kebab-mcp/src/tools/search.rs @@ -110,6 +110,8 @@ pub fn handle(state: &KebabAppState, input: SearchInput) -> CallToolResult { media, ingested_after, doc_id: input.doc_id.clone().map(kebab_core::DocumentId), + repo: vec![], + code_lang: vec![], }; let query = kebab_core::SearchQuery { diff --git a/crates/kebab-search/src/hybrid.rs b/crates/kebab-search/src/hybrid.rs index 6d9286b..3378f51 100644 --- a/crates/kebab-search/src/hybrid.rs +++ b/crates/kebab-search/src/hybrid.rs @@ -509,6 +509,8 @@ mod tests { indexed_at: time::OffsetDateTime::UNIX_EPOCH, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } @@ -760,6 +762,8 @@ mod tests { indexed_at: time::OffsetDateTime::UNIX_EPOCH, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } diff --git a/crates/kebab-search/src/lexical.rs b/crates/kebab-search/src/lexical.rs index 9d83b8f..43b4d26 100644 --- a/crates/kebab-search/src/lexical.rs +++ b/crates/kebab-search/src/lexical.rs @@ -470,6 +470,8 @@ fn build_hit( // in `RagPipeline::ask` against the configured threshold. stale: false, score_kind: ScoreKind::Bm25, + repo: None, + code_lang: None, }) } diff --git a/crates/kebab-search/src/vector.rs b/crates/kebab-search/src/vector.rs index 47eda97..3975c2e 100644 --- a/crates/kebab-search/src/vector.rs +++ b/crates/kebab-search/src/vector.rs @@ -327,6 +327,8 @@ fn build_hit( // in `RagPipeline::ask` against the configured threshold. stale: false, score_kind: ScoreKind::Cosine, + repo: None, + code_lang: None, }) } -- 2.49.1 From 7329ba96eeaf0a81232b4d4abc3fed72a660a32e Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 15:17:10 +0900 Subject: [PATCH 06/21] fix(p10-1a-1): patch missed SearchHit test-only construction sites Add repo: None, code_lang: None to the 3 SearchHit struct literals inside #[cfg(test)] blocks that were missed by the fa4eeb5 sweep. --- crates/kebab-rag/src/pipeline.rs | 2 ++ crates/kebab-rag/tests/common/mod.rs | 2 ++ crates/kebab-tui/tests/search.rs | 2 ++ 3 files changed, 6 insertions(+) diff --git a/crates/kebab-rag/src/pipeline.rs b/crates/kebab-rag/src/pipeline.rs index f6ee676..3f9aaa0 100644 --- a/crates/kebab-rag/src/pipeline.rs +++ b/crates/kebab-rag/src/pipeline.rs @@ -1170,6 +1170,8 @@ mod stream_event_serde_tests { indexed_at: datetime!(2026-05-09 12:00:00 UTC), stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } diff --git a/crates/kebab-rag/tests/common/mod.rs b/crates/kebab-rag/tests/common/mod.rs index 022176c..d7d64b1 100644 --- a/crates/kebab-rag/tests/common/mod.rs +++ b/crates/kebab-rag/tests/common/mod.rs @@ -171,6 +171,8 @@ pub fn mk_hit_with_indexed_at( indexed_at, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } diff --git a/crates/kebab-tui/tests/search.rs b/crates/kebab-tui/tests/search.rs index b3dd31b..2a0119f 100644 --- a/crates/kebab-tui/tests/search.rs +++ b/crates/kebab-tui/tests/search.rs @@ -56,6 +56,8 @@ fn make_hit(rank: u32, path: &str, snippet: &str, citation: Citation) -> SearchH indexed_at: time::OffsetDateTime::UNIX_EPOCH, stale: false, score_kind: kebab_core::ScoreKind::Rrf, + repo: None, + code_lang: None, } } -- 2.49.1 From 351c7a0826930fc863deb58ec15489ed870595f6 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 15:28:19 +0900 Subject: [PATCH 07/21] feat(p10-1a-1): add IngestReport skip counters + SkipExamples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds five new u32 counters (skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded) and a SkipExamples struct (≤5 sample paths per category) to IngestReport. All new fields are #[serde(default)] for backward-compat deserialization. Downstream literal construction sites patched with zeros/empty; snapshot re-baked. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-app/src/lib.rs | 8 +- crates/kebab-cli/src/wire.rs | 8 +- crates/kebab-core/src/ingest.rs | 88 +++++++++++++++++++ crates/kebab-core/src/lib.rs | 2 +- .../snapshots/ingest_report.snapshot.json | 11 +++ .../tests/ingest_report_snapshot.rs | 6 ++ 6 files changed, 120 insertions(+), 3 deletions(-) diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index ebb538d..3c8e03c 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -44,7 +44,7 @@ use kebab_core::{ Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker, DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput, EmbeddingKind, ExtractContext, Extractor, IngestReport, Lang, LanguageModel, MediaType, - ParserVersion, RawAsset, SearchHit, SearchQuery, SourceConnector, SourceScope, + ParserVersion, RawAsset, SearchHit, SearchQuery, SkipExamples, SourceConnector, SourceScope, SourceUri, VectorRecord, VectorStore, }; use kebab_llm_local::OllamaLanguageModel; @@ -675,6 +675,12 @@ pub fn ingest_with_config_opts( errors: error_count, duration_ms, skipped_by_extension, + skipped_gitignore: 0, + skipped_kebabignore: 0, + skipped_builtin_blacklist: 0, + skipped_generated: 0, + skipped_size_exceeded: 0, + skip_examples: SkipExamples::default(), items: if summary_only { None } else { Some(items) }, }) } diff --git a/crates/kebab-cli/src/wire.rs b/crates/kebab-cli/src/wire.rs index a71397a..3fab435 100644 --- a/crates/kebab-cli/src/wire.rs +++ b/crates/kebab-cli/src/wire.rs @@ -239,7 +239,7 @@ mod tests { #[test] fn ingest_wrapper_tags_schema_version() { - use kebab_core::SourceScope; + use kebab_core::{SkipExamples, SourceScope}; let r = IngestReport { scope: SourceScope { root: std::path::PathBuf::from("/tmp"), @@ -254,6 +254,12 @@ mod tests { errors: 0, duration_ms: 0, skipped_by_extension: std::collections::BTreeMap::new(), + skipped_gitignore: 0, + skipped_kebabignore: 0, + skipped_builtin_blacklist: 0, + skipped_generated: 0, + skipped_size_exceeded: 0, + skip_examples: SkipExamples::default(), items: None, }; let v = wire_ingest(&r); diff --git a/crates/kebab-core/src/ingest.rs b/crates/kebab-core/src/ingest.rs index 8ada477..a3a9916 100644 --- a/crates/kebab-core/src/ingest.rs +++ b/crates/kebab-core/src/ingest.rs @@ -25,10 +25,46 @@ pub struct IngestReport { /// extension key under "". `BTreeMap` so the wire JSON /// has stable key order across runs. pub skipped_by_extension: std::collections::BTreeMap, + /// p10-1A-1: files skipped because they matched a repo-local `.gitignore`. + #[serde(default)] + pub skipped_gitignore: u32, + /// p10-1A-1: files skipped because they matched a `.kebabignore` entry. + #[serde(default)] + pub skipped_kebabignore: u32, + /// p10-1A-1: files skipped because they matched the built-in safety-net + /// blacklist (`node_modules/`, `target/`, `__pycache__/`, `.venv/`, + /// `venv/`, `env/`). + #[serde(default)] + pub skipped_builtin_blacklist: u32, + /// p10-1A-1: files skipped because their first ~512 bytes contained a + /// generated-file marker (`@generated`, `do not edit`, …). + #[serde(default)] + pub skipped_generated: u32, + /// p10-1A-1: files skipped because they exceeded `max_file_bytes` or + /// `max_file_lines` in `[ingest.code]`. + #[serde(default)] + pub skipped_size_exceeded: u32, + /// p10-1A-1: sample file paths per skip category (≤ 5 each). + #[serde(default)] + pub skip_examples: SkipExamples, /// `None` ↔ wire `items: null` (`--summary-only`). pub items: Option>, } +/// p10-1A-1: per-category sample of skipped file paths. Each category caps at +/// 5 entries (oldest-first). Used for debugging "why was X not indexed?" +#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)] +pub struct SkipExamples { + #[serde(default)] + pub generated: Vec, + #[serde(default)] + pub size_exceeded: Vec, + #[serde(default)] + pub builtin_blacklist: Vec, + #[serde(default)] + pub gitignore: Vec, +} + #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] pub struct IngestItem { pub kind: IngestItemKind, @@ -58,3 +94,55 @@ pub enum IngestItemKind { Unchanged, Error, } + +#[cfg(test)] +mod tests { + use super::*; + use crate::traits::SourceScope; + + #[test] + fn skip_examples_default_is_empty() { + let s = SkipExamples::default(); + assert!(s.generated.is_empty()); + assert!(s.size_exceeded.is_empty()); + assert!(s.builtin_blacklist.is_empty()); + assert!(s.gitignore.is_empty()); + } + + #[test] + fn ingest_report_skip_counters_serialize() { + let r = IngestReport { + scope: SourceScope { + root: std::path::PathBuf::from("/tmp"), + include: vec![], + exclude: vec![], + }, + scanned: 100, + new: 50, + updated: 0, + skipped: 0, + unchanged: 0, + errors: 0, + duration_ms: 1234, + skipped_by_extension: Default::default(), + skipped_gitignore: 30, + skipped_kebabignore: 5, + skipped_builtin_blacklist: 10, + skipped_generated: 3, + skipped_size_exceeded: 2, + skip_examples: SkipExamples { + generated: vec!["a/b.pb.rs".into()], + size_exceeded: vec![], + builtin_blacklist: vec!["node_modules/x.js".into()], + gitignore: vec![], + }, + items: None, + }; + let v = serde_json::to_value(&r).unwrap(); + assert_eq!(v["skipped_gitignore"], 30); + assert_eq!(v["skipped_builtin_blacklist"], 10); + assert_eq!(v["skipped_generated"], 3); + assert_eq!(v["skipped_size_exceeded"], 2); + assert_eq!(v["skip_examples"]["generated"][0], "a/b.pb.rs"); + } +} diff --git a/crates/kebab-core/src/lib.rs b/crates/kebab-core/src/lib.rs index ba0ceb0..3b3b285 100644 --- a/crates/kebab-core/src/lib.rs +++ b/crates/kebab-core/src/lib.rs @@ -59,7 +59,7 @@ pub use answer::{ Answer, AnswerCitation, AnswerRetrievalSummary, ModelRef, RefusalReason, TokenUsage, TraceId, Turn, }; -pub use ingest::{IngestItem, IngestItemKind, IngestReport}; +pub use ingest::{IngestItem, IngestItemKind, IngestReport, SkipExamples}; pub use jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus}; pub use vector::{VectorHit, VectorRecord}; pub use errors::CoreError; diff --git a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json index 133aad3..f637f94 100644 --- a/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json +++ b/crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json @@ -42,8 +42,19 @@ ], "root": "/home/u/KB" }, + "skip_examples": { + "builtin_blacklist": [], + "generated": [], + "gitignore": [], + "size_exceeded": [] + }, "skipped": 0, + "skipped_builtin_blacklist": 0, "skipped_by_extension": {}, + "skipped_generated": 0, + "skipped_gitignore": 0, + "skipped_kebabignore": 0, + "skipped_size_exceeded": 0, "unchanged": 0, "updated": 1 } diff --git a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs index 458c6c6..5caf9dc 100644 --- a/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs +++ b/crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs @@ -35,6 +35,12 @@ fn fixture_report() -> IngestReport { errors: 0, duration_ms: 187, skipped_by_extension: std::collections::BTreeMap::new(), + skipped_gitignore: 0, + skipped_kebabignore: 0, + skipped_builtin_blacklist: 0, + skipped_generated: 0, + skipped_size_exceeded: 0, + skip_examples: kebab_core::SkipExamples::default(), items: Some(vec![ IngestItem { kind: IngestItemKind::New, -- 2.49.1 From bf4ebf8d2ac532b77dce0aa611ceb66dbe17854c Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 15:44:18 +0900 Subject: [PATCH 08/21] feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang Four optional, serde-skipped-when-None fields added to `Metadata` for code ingest context. All 11 downstream construction sites patched with `repo: None, git_branch: None, git_commit: None, code_lang: None`. Full workspace check (`--tests`) and per-crate test suite pass clean. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-chunk/src/md_heading_v1.rs | 4 ++ crates/kebab-chunk/src/pdf_page_v1.rs | 8 +++ crates/kebab-core/src/metadata.rs | 70 +++++++++++++++++++ crates/kebab-normalize/src/lib.rs | 4 ++ crates/kebab-parse-image/src/lib.rs | 4 ++ crates/kebab-parse-md/src/frontmatter.rs | 4 ++ crates/kebab-parse-pdf/src/lib.rs | 4 ++ .../kebab-store-sqlite/tests/idempotency.rs | 4 ++ .../tests/incremental_ingest.rs | 4 ++ crates/kebab-store-sqlite/tests/list_docs.rs | 4 ++ crates/kebab-tui/tests/inspect.rs | 4 ++ 11 files changed, 114 insertions(+) diff --git a/crates/kebab-chunk/src/md_heading_v1.rs b/crates/kebab-chunk/src/md_heading_v1.rs index fa0578f..b6094bd 100644 --- a/crates/kebab-chunk/src/md_heading_v1.rs +++ b/crates/kebab-chunk/src/md_heading_v1.rs @@ -472,6 +472,10 @@ mod tests { trust_level: TrustLevel::Primary, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }, provenance: Provenance { events: vec![] }, parser_version: kebab_core::ParserVersion("test-parser-0".into()), diff --git a/crates/kebab-chunk/src/pdf_page_v1.rs b/crates/kebab-chunk/src/pdf_page_v1.rs index cab61aa..41dfe83 100644 --- a/crates/kebab-chunk/src/pdf_page_v1.rs +++ b/crates/kebab-chunk/src/pdf_page_v1.rs @@ -347,6 +347,10 @@ mod tests { trust_level: TrustLevel::Primary, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }, provenance: Provenance { events: vec![] }, parser_version, @@ -512,6 +516,10 @@ mod tests { trust_level: TrustLevel::Primary, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }, provenance: Provenance { events: vec![] }, parser_version, diff --git a/crates/kebab-core/src/metadata.rs b/crates/kebab-core/src/metadata.rs index 229ee0d..bed5cc2 100644 --- a/crates/kebab-core/src/metadata.rs +++ b/crates/kebab-core/src/metadata.rs @@ -17,6 +17,25 @@ pub struct Metadata { pub user_id_alias: Option, /// Frontmatter keys we don't recognise are preserved here per §0 Q9. pub user: Map, + + /// p10-1A-1: name of the source repo if the file lives inside a git + /// working tree (`.git/` walk-up). null otherwise. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub repo: Option, + + /// p10-1A-1: HEAD branch at ingest time. null when no repo or detached HEAD. + /// Informational only — current-state observability, not a partition key. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub git_branch: Option, + + /// p10-1A-1: HEAD commit (40-hex) at ingest time. null when no repo. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub git_commit: Option, + + /// p10-1A-1: programming language identifier (lowercase canonical). null + /// for markdown / pdf / image. Set by `kebab_parse_code::lang::code_lang_for_path`. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub code_lang: Option, } #[derive(Clone, Copy, Debug, Eq, Hash, PartialEq, Serialize, Deserialize)] @@ -66,3 +85,54 @@ pub enum ProvenanceKind { Warning, Error, } + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn metadata_repo_fields_default_to_none_and_omit_when_serialized() { + let m = Metadata { + aliases: vec![], + tags: vec![], + created_at: time::OffsetDateTime::UNIX_EPOCH, + updated_at: time::OffsetDateTime::UNIX_EPOCH, + source_type: SourceType::Markdown, + trust_level: TrustLevel::Primary, + user_id_alias: None, + user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, + }; + let v = serde_json::to_value(&m).unwrap(); + assert!(v.get("repo").is_none()); + assert!(v.get("git_branch").is_none()); + assert!(v.get("git_commit").is_none()); + assert!(v.get("code_lang").is_none()); + } + + #[test] + fn metadata_repo_fields_present_when_some() { + let m = Metadata { + aliases: vec![], + tags: vec![], + created_at: time::OffsetDateTime::UNIX_EPOCH, + updated_at: time::OffsetDateTime::UNIX_EPOCH, + source_type: SourceType::Markdown, + trust_level: TrustLevel::Primary, + user_id_alias: None, + user: Default::default(), + repo: Some("kebab".into()), + git_branch: Some("main".into()), + git_commit: Some("a".repeat(40)), + code_lang: Some("rust".into()), + }; + let v = serde_json::to_value(&m).unwrap(); + assert_eq!(v["repo"], "kebab"); + assert_eq!(v["git_branch"], "main"); + assert_eq!(v["git_commit"].as_str().unwrap().len(), 40); + assert_eq!(v["code_lang"], "rust"); + } +} diff --git a/crates/kebab-normalize/src/lib.rs b/crates/kebab-normalize/src/lib.rs index d3edf54..bc1e988 100644 --- a/crates/kebab-normalize/src/lib.rs +++ b/crates/kebab-normalize/src/lib.rs @@ -467,6 +467,10 @@ mod tests { trust_level: TrustLevel::Primary, user_id_alias: None, user, + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, } } diff --git a/crates/kebab-parse-image/src/lib.rs b/crates/kebab-parse-image/src/lib.rs index 5f1fc6f..a8d1be5 100644 --- a/crates/kebab-parse-image/src/lib.rs +++ b/crates/kebab-parse-image/src/lib.rs @@ -190,6 +190,10 @@ impl Extractor for ImageExtractor { trust_level: TrustLevel::Primary, user_id_alias: None, user, + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }; tracing::debug!( diff --git a/crates/kebab-parse-md/src/frontmatter.rs b/crates/kebab-parse-md/src/frontmatter.rs index 86d3f80..92c8a3c 100644 --- a/crates/kebab-parse-md/src/frontmatter.rs +++ b/crates/kebab-parse-md/src/frontmatter.rs @@ -471,6 +471,10 @@ fn derive_metadata( trust_level, user_id_alias, user, + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, } } diff --git a/crates/kebab-parse-pdf/src/lib.rs b/crates/kebab-parse-pdf/src/lib.rs index 963a0fe..5f1b90e 100644 --- a/crates/kebab-parse-pdf/src/lib.rs +++ b/crates/kebab-parse-pdf/src/lib.rs @@ -194,6 +194,10 @@ impl Extractor for PdfTextExtractor { trust_level: TrustLevel::Primary, user_id_alias: None, user, + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }; tracing::debug!( diff --git a/crates/kebab-store-sqlite/tests/idempotency.rs b/crates/kebab-store-sqlite/tests/idempotency.rs index 8ff482b..85471e7 100644 --- a/crates/kebab-store-sqlite/tests/idempotency.rs +++ b/crates/kebab-store-sqlite/tests/idempotency.rs @@ -42,6 +42,10 @@ fn make_metadata() -> Metadata { trust_level: TrustLevel::Primary, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, } } diff --git a/crates/kebab-store-sqlite/tests/incremental_ingest.rs b/crates/kebab-store-sqlite/tests/incremental_ingest.rs index 3c544a2..ef67706 100644 --- a/crates/kebab-store-sqlite/tests/incremental_ingest.rs +++ b/crates/kebab-store-sqlite/tests/incremental_ingest.rs @@ -51,6 +51,10 @@ fn make_doc() -> CanonicalDocument { trust_level: TrustLevel::Primary, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }; CanonicalDocument { doc_id, diff --git a/crates/kebab-store-sqlite/tests/list_docs.rs b/crates/kebab-store-sqlite/tests/list_docs.rs index 01bb626..acfad1c 100644 --- a/crates/kebab-store-sqlite/tests/list_docs.rs +++ b/crates/kebab-store-sqlite/tests/list_docs.rs @@ -54,6 +54,10 @@ fn make_doc( trust_level: trust, user_id_alias: None, user: Default::default(), + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }; let doc = CanonicalDocument { doc_id, diff --git a/crates/kebab-tui/tests/inspect.rs b/crates/kebab-tui/tests/inspect.rs index 7fb3413..039fa53 100644 --- a/crates/kebab-tui/tests/inspect.rs +++ b/crates/kebab-tui/tests/inspect.rs @@ -79,6 +79,10 @@ fn make_doc() -> CanonicalDocument { trust_level: TrustLevel::Primary, user_id_alias: None, user, + repo: None, + git_branch: None, + git_commit: None, + code_lang: None, }, provenance: Provenance { events: vec![ProvenanceEvent { -- 2.49.1 From ff11f81f7f15ee66bdd073c15063f5e3fa5dd0e8 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 15:57:59 +0900 Subject: [PATCH 09/21] feat(p10-1a-1): kebab-parse-code crate (lang + repo + skip) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tasks 5-8: new `kebab-parse-code` crate with three infrastructure modules for the code ingest framework. Ships lang.rs (extension→language identifier mapping), repo.rs (.git walk-up via gix 0.70 for RepoMeta), and skip.rs (BUILTIN_BLACKLIST, is_generated_file, is_oversized). 14 integration tests across three test files, all passing; clippy -D warnings clean. Note: gix pinned to 0.70 (not 0.83 as originally suggested) because 0.83 fails to compile against Rust 1.94.1 due to non-exhaustive match patterns in gix-hash. 0.70 resolves cleanly and has identical head_name/head_id API. Co-Authored-By: Claude Sonnet 4.6 --- Cargo.lock | 660 ++++++++++++++++++++++++++ Cargo.toml | 5 + crates/kebab-parse-code/Cargo.toml | 13 + crates/kebab-parse-code/src/lang.rs | 42 ++ crates/kebab-parse-code/src/lib.rs | 22 + crates/kebab-parse-code/src/repo.rs | 61 +++ crates/kebab-parse-code/src/skip.rs | 65 +++ crates/kebab-parse-code/tests/lang.rs | 64 +++ crates/kebab-parse-code/tests/repo.rs | 62 +++ crates/kebab-parse-code/tests/skip.rs | 74 +++ 10 files changed, 1068 insertions(+) create mode 100644 crates/kebab-parse-code/Cargo.toml create mode 100644 crates/kebab-parse-code/src/lang.rs create mode 100644 crates/kebab-parse-code/src/lib.rs create mode 100644 crates/kebab-parse-code/src/repo.rs create mode 100644 crates/kebab-parse-code/src/skip.rs create mode 100644 crates/kebab-parse-code/tests/lang.rs create mode 100644 crates/kebab-parse-code/tests/repo.rs create mode 100644 crates/kebab-parse-code/tests/skip.rs diff --git a/Cargo.lock b/Cargo.lock index 332d306..499f233 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -755,6 +755,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "63044e1ae8e69f3b5a92c736ca6269b8d12fa7efe39bf34ddb06d102cf0e2cab" dependencies = [ "memchr", + "regex-automata", "serde", ] @@ -931,6 +932,15 @@ version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c8d4a3bb8b1e0c1050499d1815f5ab16d04f0959b233085fb31653fbfc9d98f9" +[[package]] +name = "clru" +version = "0.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "197fd99cb113a8d5d9b6376f3aa817f32c1078f2343b714fff7d2ca44fdf67d5" +dependencies = [ + "hashbrown 0.16.1", +] + [[package]] name = "color_quant" version = "1.1.0" @@ -2140,6 +2150,12 @@ version = "2.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "117240f60069e65410b3ae1bb213295bd828f707b5bec6596a1afc8793ce0cbc" +[[package]] +name = "dunce" +version = "1.0.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "92773504d58c093f6de2459af4af33faa518c13451eb8f2b5698ed3d36e7c813" + [[package]] name = "dyn-clone" version = "1.0.20" @@ -2302,6 +2318,15 @@ dependencies = [ "tokenizers", ] +[[package]] +name = "faster-hex" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a2a2b11eda1d40935b26cf18f6833c526845ae8c41e58d09af6adeb6f0269183" +dependencies = [ + "serde", +] + [[package]] name = "fastrand" version = "2.4.1" @@ -2738,6 +2763,583 @@ dependencies = [ "weezl", ] +[[package]] +name = "gix" +version = "0.70.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "736f14636705f3a56ea52b553e67282519418d9a35bb1e90b3a9637a00296b68" +dependencies = [ + "gix-actor", + "gix-commitgraph", + "gix-config", + "gix-date", + "gix-diff", + "gix-discover", + "gix-features", + "gix-fs", + "gix-glob", + "gix-hash", + "gix-hashtable", + "gix-index", + "gix-lock", + "gix-object", + "gix-odb", + "gix-pack", + "gix-path", + "gix-protocol", + "gix-ref", + "gix-refspec", + "gix-revision", + "gix-revwalk", + "gix-sec", + "gix-shallow", + "gix-tempfile", + "gix-trace", + "gix-traverse", + "gix-url", + "gix-utils", + "gix-validate 0.9.4", + "once_cell", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-actor" +version = "0.33.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "20018a1a6332e065f1fcc8305c1c932c6b8c9985edea2284b3c79dc6fa3ee4b2" +dependencies = [ + "bstr", + "gix-date", + "gix-utils", + "itoa", + "thiserror 2.0.18", + "winnow 0.6.26", +] + +[[package]] +name = "gix-bitmap" +version = "0.2.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d982fc7ef0608e669851d0d2a6141dae74c60d5a27e8daa451f2a4857bbf41e2" +dependencies = [ + "thiserror 2.0.18", +] + +[[package]] +name = "gix-chunk" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5c356b3825677cb6ff579551bb8311a81821e184453cbd105e2fc5311b288eeb" +dependencies = [ + "thiserror 2.0.18", +] + +[[package]] +name = "gix-command" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cb410b84d6575db45e62025a9118bdbf4d4b099ce7575a76161e898d9ca98df1" +dependencies = [ + "bstr", + "gix-path", + "gix-trace", + "shell-words", +] + +[[package]] +name = "gix-commitgraph" +version = "0.26.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e23a8ec2d8a16026a10dafdb6ed51bcfd08f5d97f20fa52e200bc50cb72e4877" +dependencies = [ + "bstr", + "gix-chunk", + "gix-features", + "gix-hash", + "memmap2", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-config" +version = "0.43.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "377c1efd2014d5d469e0b3cd2952c8097bce9828f634e04d5665383249f1d9e9" +dependencies = [ + "bstr", + "gix-config-value", + "gix-features", + "gix-glob", + "gix-path", + "gix-ref", + "gix-sec", + "memchr", + "once_cell", + "smallvec", + "thiserror 2.0.18", + "unicode-bom", + "winnow 0.6.26", +] + +[[package]] +name = "gix-config-value" +version = "0.14.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8dc2c844c4cf141884678cabef736fd91dd73068b9146e6f004ba1a0457944b6" +dependencies = [ + "bitflags", + "bstr", + "gix-path", + "libc", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-date" +version = "0.9.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "daa30058ec7d3511fbc229e4f9e696a35abd07ec5b82e635eff864a2726217e4" +dependencies = [ + "bstr", + "itoa", + "jiff", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-diff" +version = "0.50.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62afb7f4ca0acdf4e9dad92065b2eb1bf2993bcc5014b57bc796e3a365b17c4d" +dependencies = [ + "bstr", + "gix-hash", + "gix-object", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-discover" +version = "0.38.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d0c2414bdf04064e0f5a5aa029dfda1e663cf9a6c4bfc8759f2d369299bb65d8" +dependencies = [ + "bstr", + "dunce", + "gix-fs", + "gix-hash", + "gix-path", + "gix-ref", + "gix-sec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-features" +version = "0.40.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8bfdd4838a8d42bd482c9f0cb526411d003ee94cc7c7b08afe5007329c71d554" +dependencies = [ + "crc32fast", + "flate2", + "gix-hash", + "gix-trace", + "gix-utils", + "libc", + "once_cell", + "prodash", + "sha1_smol", + "thiserror 2.0.18", + "walkdir", +] + +[[package]] +name = "gix-fs" +version = "0.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "182e7fa7bfdf44ffb7cfe7451b373cdf1e00870ac9a488a49587a110c562063d" +dependencies = [ + "fastrand", + "gix-features", + "gix-utils", +] + +[[package]] +name = "gix-glob" +version = "0.18.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4e9c7249fa0a78f9b363aa58323db71e0a6161fd69860ed6f48dedf0ef3a314e" +dependencies = [ + "bitflags", + "bstr", + "gix-features", + "gix-path", +] + +[[package]] +name = "gix-hash" +version = "0.16.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e81c5ec48649b1821b3ed066a44efb95f1a268b35c1d91295e61252539fbe9f8" +dependencies = [ + "faster-hex", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-hashtable" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "189130bc372accd02e0520dc5ab1cef318dcc2bc829b76ab8d84bbe90ac212d1" +dependencies = [ + "gix-hash", + "hashbrown 0.14.5", + "parking_lot", +] + +[[package]] +name = "gix-index" +version = "0.38.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "acd12e3626879369310fffe2ac61acc828613ef656b50c4ea984dd59d7dc85d8" +dependencies = [ + "bitflags", + "bstr", + "filetime", + "fnv", + "gix-bitmap", + "gix-features", + "gix-fs", + "gix-hash", + "gix-lock", + "gix-object", + "gix-traverse", + "gix-utils", + "gix-validate 0.9.4", + "hashbrown 0.14.5", + "itoa", + "libc", + "memmap2", + "rustix 0.38.44", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-lock" +version = "16.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9739815270ff6940968441824d162df9433db19211ca9ba8c3fc1b50b849c642" +dependencies = [ + "gix-tempfile", + "gix-utils", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-object" +version = "0.47.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ddc4b3a0044244f0fe22347fb7a79cca165e37829d668b41b85ff46a43e5fd68" +dependencies = [ + "bstr", + "gix-actor", + "gix-date", + "gix-features", + "gix-hash", + "gix-hashtable", + "gix-path", + "gix-utils", + "gix-validate 0.9.4", + "itoa", + "smallvec", + "thiserror 2.0.18", + "winnow 0.6.26", +] + +[[package]] +name = "gix-odb" +version = "0.67.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3e93457df69cd09573608ce9fa4f443fbd84bc8d15d8d83adecd471058459c1b" +dependencies = [ + "arc-swap", + "gix-date", + "gix-features", + "gix-fs", + "gix-hash", + "gix-hashtable", + "gix-object", + "gix-pack", + "gix-path", + "gix-quote", + "parking_lot", + "tempfile", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-pack" +version = "0.57.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc13a475b3db735617017fb35f816079bf503765312d4b1913b18cf96f3fa515" +dependencies = [ + "clru", + "gix-chunk", + "gix-features", + "gix-hash", + "gix-hashtable", + "gix-object", + "gix-path", + "memmap2", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-packetline" +version = "0.18.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "123844a70cf4d5352441dc06bab0da8aef61be94ec239cb631e0ba01dc6d3a04" +dependencies = [ + "bstr", + "faster-hex", + "gix-trace", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-path" +version = "0.10.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7cb06c3e4f8eed6e24fd915fa93145e28a511f4ea0e768bae16673e05ed3f366" +dependencies = [ + "bstr", + "gix-trace", + "gix-validate 0.10.1", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-protocol" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c61bd61afc6b67d213241e2100394c164be421e3f7228d3521b04f48ca5ba90" +dependencies = [ + "bstr", + "gix-date", + "gix-features", + "gix-hash", + "gix-ref", + "gix-shallow", + "gix-transport", + "gix-utils", + "maybe-async", + "thiserror 2.0.18", + "winnow 0.6.26", +] + +[[package]] +name = "gix-quote" +version = "0.4.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e49357fccdb0c85c0d3a3292a9f6db32d9b3535959b5471bb9624908f4a066c6" +dependencies = [ + "bstr", + "gix-utils", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-ref" +version = "0.50.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "47adf4c5f933429f8554e95d0d92eee583cfe4b95d2bf665cd6fd4a1531ee20c" +dependencies = [ + "gix-actor", + "gix-features", + "gix-fs", + "gix-hash", + "gix-lock", + "gix-object", + "gix-path", + "gix-tempfile", + "gix-utils", + "gix-validate 0.9.4", + "memmap2", + "thiserror 2.0.18", + "winnow 0.6.26", +] + +[[package]] +name = "gix-refspec" +version = "0.28.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "59650228d8f612f68e7f7a25f517fcf386c5d0d39826085492e94766858b0a90" +dependencies = [ + "bstr", + "gix-hash", + "gix-revision", + "gix-validate 0.9.4", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-revision" +version = "0.32.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3fe28bbccca55da6d66e6c6efc6bb4003c29d407afd8178380293729733e6b53" +dependencies = [ + "bitflags", + "bstr", + "gix-commitgraph", + "gix-date", + "gix-hash", + "gix-hashtable", + "gix-object", + "gix-revwalk", + "gix-trace", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-revwalk" +version = "0.18.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d4ecb80c235b1e9ef2b99b23a81ea50dd569a88a9eb767179793269e0e616247" +dependencies = [ + "gix-commitgraph", + "gix-date", + "gix-hash", + "gix-hashtable", + "gix-object", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-sec" +version = "0.10.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "47aeb0f13de9ef2f3033f5ff218de30f44db827ac9f1286f9ef050aacddd5888" +dependencies = [ + "bitflags", + "gix-path", + "libc", + "windows-sys 0.52.0", +] + +[[package]] +name = "gix-shallow" +version = "0.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ab72543011e303e52733c85bef784603ef39632ddf47f69723def52825e35066" +dependencies = [ + "bstr", + "gix-hash", + "gix-lock", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-tempfile" +version = "16.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2558f423945ef24a8328c55d1fd6db06b8376b0e7013b1bb476cc4ffdf678501" +dependencies = [ + "gix-fs", + "libc", + "once_cell", + "parking_lot", + "tempfile", +] + +[[package]] +name = "gix-trace" +version = "0.1.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6f23569e55f2ffaf958617353b9734a7d52a7c19c439eeaa5e3efc217fd2270e" + +[[package]] +name = "gix-transport" +version = "0.45.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "11187418489477b1b5b862ae1aedbbac77e582f2c4b0ef54280f20cfe5b964d9" +dependencies = [ + "bstr", + "gix-command", + "gix-features", + "gix-packetline", + "gix-quote", + "gix-sec", + "gix-url", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-traverse" +version = "0.44.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2bec70e53896586ef32a3efa7e4427b67308531ed186bb6120fb3eca0f0d61b4" +dependencies = [ + "bitflags", + "gix-commitgraph", + "gix-date", + "gix-hash", + "gix-hashtable", + "gix-object", + "gix-revwalk", + "smallvec", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-url" +version = "0.29.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "29218c768b53dd8f116045d87fec05b294c731a4b2bdd257eeca2084cc150b13" +dependencies = [ + "bstr", + "gix-features", + "gix-path", + "percent-encoding", + "thiserror 2.0.18", + "url", +] + +[[package]] +name = "gix-utils" +version = "0.1.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff08f24e03ac8916c478c8419d7d3c33393da9bb41fa4c24455d5406aeefd35f" +dependencies = [ + "fastrand", + "unicode-normalization", +] + +[[package]] +name = "gix-validate" +version = "0.9.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "34b5f1253109da6c79ed7cf6e1e38437080bb6d704c76af14c93e2f255234084" +dependencies = [ + "bstr", + "thiserror 2.0.18", +] + +[[package]] +name = "gix-validate" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5b1e63a5b516e970a594f870ed4571a8fdcb8a344e7bd407a20db8bd61dbfde4" +dependencies = [ + "bstr", + "thiserror 2.0.18", +] + [[package]] name = "glob" version = "0.3.3" @@ -3737,6 +4339,16 @@ dependencies = [ "unicode-normalization", ] +[[package]] +name = "kebab-parse-code" +version = "0.6.0" +dependencies = [ + "anyhow", + "gix", + "kebab-core", + "tempfile", +] + [[package]] name = "kebab-parse-image" version = "0.6.0" @@ -4846,6 +5458,17 @@ dependencies = [ "thread-tree", ] +[[package]] +name = "maybe-async" +version = "0.2.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "746873a384ad60adc5db74471dfaba74bd278afbdcfd81db93fafcdfc8b5ca0c" +dependencies = [ + "proc-macro2", + "quote", + "syn 2.0.117", +] + [[package]] name = "maybe-rayon" version = "0.1.1" @@ -5702,6 +6325,16 @@ dependencies = [ "unicode-ident", ] +[[package]] +name = "prodash" +version = "29.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f04bb108f648884c23b98a0e940ebc2c93c0c3b89f04dbaf7eb8256ce617d1bc" +dependencies = [ + "log", + "parking_lot", +] + [[package]] name = "profiling" version = "1.0.17" @@ -6841,6 +7474,12 @@ dependencies = [ "unsafe-libyaml", ] +[[package]] +name = "sha1_smol" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bbfa15b3dddfee50a0fff136974b3e1bde555604ba463834a7eb7deb6417705d" + [[package]] name = "sha2" version = "0.10.9" @@ -6861,6 +7500,12 @@ dependencies = [ "lazy_static", ] +[[package]] +name = "shell-words" +version = "1.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dc6fe69c597f9c37bfeeeeeb33da3530379845f10be461a66d16d03eca2ded77" + [[package]] name = "shellexpand" version = "3.1.2" @@ -7889,6 +8534,12 @@ version = "2.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dbc4bc3a9f746d862c45cb89d705aa10f187bb96c76001afab07a0d35ce60142" +[[package]] +name = "unicode-bom" +version = "2.0.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7eec5d1121208364f6793f7d2e222bf75a915c19557537745b195b253dd64217" + [[package]] name = "unicode-ident" version = "1.0.24" @@ -8587,6 +9238,15 @@ version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d6bbff5f0aada427a1e5a6da5f1f98158182f26556f345ac9e04d36d0ebed650" +[[package]] +name = "winnow" +version = "0.6.26" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e90edd2ac1aa278a5c4599b1d89cf03074b610800f866d4026dc199d7929a28" +dependencies = [ + "memchr", +] + [[package]] name = "winnow" version = "0.7.15" diff --git a/Cargo.toml b/Cargo.toml index 3731f22..7b2527c 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -23,6 +23,7 @@ members = [ "crates/kebab-parse-pdf", "crates/kebab-tui", "crates/kebab-mcp", + "crates/kebab-parse-code", ] [workspace.package] @@ -81,6 +82,10 @@ rmcp = { version = "1.6", default-features = false, features = ["server" # sync via reqwest::blocking — wiremock is dev-only there). wiremock = "0.6" base64 = "0.22" +# Pure-Rust git library for repo metadata detection (kebab-parse-code). +# No `git` binary required. Default features include thread-safety + most +# object-reading capabilities needed for HEAD name + commit SHA queries. +gix = { version = "0.70", default-features = false, features = ["revision"] } # Disk-footprint trim for dev / test builds. Codegen, opt-level, and # behavior are unchanged — only DWARF debug info is reduced (line diff --git a/crates/kebab-parse-code/Cargo.toml b/crates/kebab-parse-code/Cargo.toml new file mode 100644 index 0000000..ac76da0 --- /dev/null +++ b/crates/kebab-parse-code/Cargo.toml @@ -0,0 +1,13 @@ +[package] +name = "kebab-parse-code" +version = { workspace = true } +edition = { workspace = true } +license = { workspace = true } + +[dependencies] +anyhow = { workspace = true } +gix = { workspace = true } +kebab-core = { path = "../kebab-core" } + +[dev-dependencies] +tempfile = { workspace = true } diff --git a/crates/kebab-parse-code/src/lang.rs b/crates/kebab-parse-code/src/lang.rs new file mode 100644 index 0000000..bd850f6 --- /dev/null +++ b/crates/kebab-parse-code/src/lang.rs @@ -0,0 +1,42 @@ +//! Canonical extension → language identifier mapping (spec §3.5). +//! +//! Lowercase canonical identifiers, matching tree-sitter parser conventions: +//! `rust`, `python`, `typescript`, `javascript`, `go`, `java`, `kotlin`, `c`, +//! `cpp`, `yaml`, `toml`, `json`, `shell`, `make`, `dockerfile`. + +use std::path::Path; + +/// Returns the canonical language identifier for a given file path, or +/// `None` if the extension / filename is not recognized. +/// +/// Matching priority: +/// 1. exact filename match (e.g. `Dockerfile`, `Makefile`) +/// 2. lowercase extension match +pub fn code_lang_for_path(path: &Path) -> Option<&'static str> { + if let Some(name) = path.file_name().and_then(|n| n.to_str()) { + match name { + "Dockerfile" => return Some("dockerfile"), + "Makefile" | "GNUmakefile" => return Some("make"), + _ => {} + } + } + let ext = path.extension()?.to_str()?.to_ascii_lowercase(); + match ext.as_str() { + "rs" => Some("rust"), + "py" | "pyi" => Some("python"), + "ts" | "tsx" => Some("typescript"), + "js" | "mjs" | "cjs" | "jsx" => Some("javascript"), + "go" => Some("go"), + "java" => Some("java"), + "kt" | "kts" => Some("kotlin"), + "c" | "h" => Some("c"), + "cpp" | "cc" | "cxx" | "hpp" | "hh" | "hxx" => Some("cpp"), + "yaml" | "yml" => Some("yaml"), + "toml" => Some("toml"), + "json" => Some("json"), + "sh" | "bash" | "zsh" => Some("shell"), + "mk" => Some("make"), + "dockerfile" => Some("dockerfile"), + _ => None, + } +} diff --git a/crates/kebab-parse-code/src/lib.rs b/crates/kebab-parse-code/src/lib.rs new file mode 100644 index 0000000..3b699c1 --- /dev/null +++ b/crates/kebab-parse-code/src/lib.rs @@ -0,0 +1,22 @@ +//! `kebab-parse-code` — language-aware parsing for code corpora. +//! +//! Phase 1A-1 ships infrastructure only: +//! +//! - [`lang::code_lang_for_path`] — extension → language identifier. +//! - [`repo::detect_repo`] — `.git/` walk-up → repo / branch / commit metadata. +//! - [`skip::is_generated_file`] / [`skip::is_oversized`] — pre-ingest skip +//! helpers consulted by `kebab-source-fs`. +//! - [`skip::BUILTIN_BLACKLIST`] — 6-entry safety-net pattern list. +//! +//! Per-language parser modules (`rust`, `python`, `typescript`, …) land in +//! later phases (1A-2 onwards). The crate boundary follows other +//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed +//! / llm / rag. + +pub mod lang; +pub mod repo; +pub mod skip; + +pub use lang::code_lang_for_path; +pub use repo::{RepoMeta, detect_repo}; +pub use skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized}; diff --git a/crates/kebab-parse-code/src/repo.rs b/crates/kebab-parse-code/src/repo.rs new file mode 100644 index 0000000..6798fbe --- /dev/null +++ b/crates/kebab-parse-code/src/repo.rs @@ -0,0 +1,61 @@ +//! Git repo auto-detection (spec §5.1). +//! +//! Walks up from `path` looking for a `.git/` directory. If found, reads +//! repo dir name, current branch, and HEAD commit using `gix` (pure Rust; +//! no `git` binary on PATH required). + +use std::path::Path; + +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct RepoMeta { + pub name: String, + pub branch: Option, + pub commit: Option, +} + +/// Walk up from `path` until a `.git/` directory is found. Returns repo +/// metadata, or `None` if no repo boundary is reached before the filesystem +/// root. +/// +/// - `name`: directory name containing `.git/`. +/// - `branch`: current HEAD branch, or `"detached"` if detached HEAD, or +/// `None` if branch can't be read. +/// - `commit`: 40-hex commit SHA at HEAD, or `None` if empty repo / read +/// failure. +/// +/// `.git/` as a file (worktree marker / submodule) returns `None` for +/// `branch` and `commit` and falls back to the parent dir name for `name`. +pub fn detect_repo(path: &Path) -> Option { + let mut cur = if path.is_dir() { path } else { path.parent()? }; + loop { + let dotgit = cur.join(".git"); + if dotgit.is_dir() { + let name = cur.file_name()?.to_string_lossy().into_owned(); + let (branch, commit) = read_head(cur); + return Some(RepoMeta { name, branch, commit }); + } else if dotgit.is_file() { + let name = cur.file_name()?.to_string_lossy().into_owned(); + return Some(RepoMeta { name, branch: None, commit: None }); + } + cur = cur.parent()?; + } +} + +fn read_head(repo_dir: &Path) -> (Option, Option) { + match gix::open(repo_dir) { + Ok(repo) => { + let branch = repo + .head_name() + .ok() + .flatten() + .map(|n| n.shorten().to_string()) + .or_else(|| Some("detached".to_string())); + let commit = repo + .head_id() + .ok() + .map(|id| id.to_string()); + (branch, commit) + } + Err(_) => (None, None), + } +} diff --git a/crates/kebab-parse-code/src/skip.rs b/crates/kebab-parse-code/src/skip.rs new file mode 100644 index 0000000..eafecf8 --- /dev/null +++ b/crates/kebab-parse-code/src/skip.rs @@ -0,0 +1,65 @@ +//! Pre-ingest skip helpers (spec §5.2 + §5.3 + §5.4). +//! +//! - [`BUILTIN_BLACKLIST`] — 6 gitignore-style patterns universal across +//! ecosystems. Source of truth: spec §5.2. +//! - [`is_generated_file`] — reads first ~512 bytes, checks for 7 +//! case-insensitive markers. +//! - [`is_oversized`] — byte cap then line cap. + +use anyhow::Result; +use std::fs::File; +use std::io::{BufRead, BufReader, Read}; +use std::path::Path; + +/// 6 built-in gitignore-style patterns. Applied in addition to `.gitignore` +/// + `.kebabignore`. User can override via `.kebabignore` negation +/// (`!pattern`). +pub const BUILTIN_BLACKLIST: &[&str] = &[ + "**/node_modules/**", + "**/target/**", + "**/__pycache__/**", + "**/.venv/**", + "**/venv/**", + "**/env/**", +]; + +/// Read first 512 bytes, check for any of 7 case-insensitive generated-file +/// markers. Returns Ok(true) on match, Ok(false) otherwise. +pub fn is_generated_file(path: &Path) -> Result { + let mut buf = [0u8; 512]; + let mut f = File::open(path)?; + let n = f.read(&mut buf)?; + if n == 0 { + return Ok(false); + } + let head = std::str::from_utf8(&buf[..n]).unwrap_or(""); + let lower: String = head.lines().take(10).collect::>().join("\n").to_ascii_lowercase(); + Ok( + lower.contains("@generated") + || lower.contains("code generated by") + || lower.contains("do not edit") + || lower.contains("do not modify") + || lower.contains("automatically generated") + || lower.contains("auto-generated") + || lower.contains("autogenerated"), + ) +} + +/// Check if `path` exceeds `max_bytes` or `max_lines`. Byte cap first +/// (cheap), then line cap (streaming with early exit). +pub fn is_oversized(path: &Path, max_bytes: u64, max_lines: u32) -> Result { + let meta = std::fs::metadata(path)?; + if meta.len() > max_bytes { + return Ok(true); + } + let reader = BufReader::new(File::open(path)?); + let mut count: u32 = 0; + for line in reader.lines() { + let _ = line?; + count = count.saturating_add(1); + if count > max_lines { + return Ok(true); + } + } + Ok(false) +} diff --git a/crates/kebab-parse-code/tests/lang.rs b/crates/kebab-parse-code/tests/lang.rs new file mode 100644 index 0000000..73a1551 --- /dev/null +++ b/crates/kebab-parse-code/tests/lang.rs @@ -0,0 +1,64 @@ +use kebab_parse_code::code_lang_for_path; +use std::path::Path; + +#[test] +fn known_extensions_map_to_canonical_identifiers() { + let cases = [ + ("foo.rs", Some("rust")), + ("foo.py", Some("python")), + ("foo.pyi", Some("python")), + ("foo.ts", Some("typescript")), + ("foo.tsx", Some("typescript")), + ("foo.js", Some("javascript")), + ("foo.mjs", Some("javascript")), + ("foo.cjs", Some("javascript")), + ("foo.jsx", Some("javascript")), + ("foo.go", Some("go")), + ("foo.java", Some("java")), + ("foo.kt", Some("kotlin")), + ("foo.kts", Some("kotlin")), + ("foo.c", Some("c")), + ("foo.h", Some("c")), + ("foo.cpp", Some("cpp")), + ("foo.cc", Some("cpp")), + ("foo.cxx", Some("cpp")), + ("foo.hpp", Some("cpp")), + ("foo.hh", Some("cpp")), + ("foo.hxx", Some("cpp")), + ("foo.yaml", Some("yaml")), + ("foo.yml", Some("yaml")), + ("foo.toml", Some("toml")), + ("foo.json", Some("json")), + ("foo.sh", Some("shell")), + ("foo.bash", Some("shell")), + ("foo.zsh", Some("shell")), + ("foo.mk", Some("make")), + ]; + for (path, expected) in cases { + assert_eq!( + code_lang_for_path(Path::new(path)), + expected, + "path = {path}" + ); + } +} + +#[test] +fn special_filenames_map_to_identifiers() { + assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile")); + assert_eq!(code_lang_for_path(Path::new("foo.dockerfile")), Some("dockerfile")); + assert_eq!(code_lang_for_path(Path::new("Makefile")), Some("make")); +} + +#[test] +fn unknown_extension_returns_none() { + assert_eq!(code_lang_for_path(Path::new("foo.docx")), None); + assert_eq!(code_lang_for_path(Path::new("foo")), None); + assert_eq!(code_lang_for_path(Path::new("foo.unknown")), None); +} + +#[test] +fn case_insensitive() { + assert_eq!(code_lang_for_path(Path::new("Foo.RS")), Some("rust")); + assert_eq!(code_lang_for_path(Path::new("FOO.YAML")), Some("yaml")); +} diff --git a/crates/kebab-parse-code/tests/repo.rs b/crates/kebab-parse-code/tests/repo.rs new file mode 100644 index 0000000..68365a1 --- /dev/null +++ b/crates/kebab-parse-code/tests/repo.rs @@ -0,0 +1,62 @@ +use kebab_parse_code::repo::detect_repo; +use std::fs; +use std::process::Command; +use tempfile::TempDir; + +fn init_git_repo(root: &std::path::Path) { + let run = |args: &[&str]| { + Command::new("git") + .args(args) + .current_dir(root) + .status() + .expect("git command failed"); + }; + run(&["init", "-q"]); + run(&["config", "user.email", "test@test"]); + run(&["config", "user.name", "test"]); + fs::write(root.join("README.md"), "hi").unwrap(); + run(&["add", "README.md"]); + run(&["commit", "-q", "-m", "init"]); +} + +#[test] +fn detect_repo_returns_none_outside_git() { + let tmp = TempDir::new().unwrap(); + let nested = tmp.path().join("a/b/c.txt"); + fs::create_dir_all(nested.parent().unwrap()).unwrap(); + fs::write(&nested, "x").unwrap(); + assert!(detect_repo(&nested).is_none()); +} + +#[test] +fn detect_repo_walks_up_to_git_dir() { + let tmp = TempDir::new().unwrap(); + let repo_root = tmp.path().join("myrepo"); + fs::create_dir_all(&repo_root).unwrap(); + init_git_repo(&repo_root); + let nested = repo_root.join("src/deep/file.rs"); + fs::create_dir_all(nested.parent().unwrap()).unwrap(); + fs::write(&nested, "x").unwrap(); + + let meta = detect_repo(&nested).expect("should detect repo"); + assert_eq!(meta.name, "myrepo"); + assert!(meta.branch.is_some()); + assert!(meta.commit.is_some()); + assert_eq!(meta.commit.as_ref().unwrap().len(), 40); +} + +#[test] +fn detect_repo_caches_per_path_call_for_repeated_files_in_same_repo() { + let tmp = TempDir::new().unwrap(); + let repo_root = tmp.path().join("myrepo"); + fs::create_dir_all(&repo_root).unwrap(); + init_git_repo(&repo_root); + let f1 = repo_root.join("a.rs"); + let f2 = repo_root.join("b.rs"); + fs::write(&f1, "x").unwrap(); + fs::write(&f2, "x").unwrap(); + let m1 = detect_repo(&f1).unwrap(); + let m2 = detect_repo(&f2).unwrap(); + assert_eq!(m1.name, m2.name); + assert_eq!(m1.commit, m2.commit); +} diff --git a/crates/kebab-parse-code/tests/skip.rs b/crates/kebab-parse-code/tests/skip.rs new file mode 100644 index 0000000..b85dafe --- /dev/null +++ b/crates/kebab-parse-code/tests/skip.rs @@ -0,0 +1,74 @@ +use kebab_parse_code::skip::{BUILTIN_BLACKLIST, is_generated_file, is_oversized}; +use std::fs; +use tempfile::NamedTempFile; + +#[test] +fn generated_header_markers_trigger_skip() { + let cases = [ + "// @generated\nfn foo() {}\n", + "// Code generated by tonic-build. DO NOT EDIT.\nfn x() {}\n", + "/* DO NOT EDIT */\nfn x() {}\n", + "/* do not modify */\nfn x() {}\n", + "// AUTOMATICALLY GENERATED\nfn x() {}\n", + "# auto-generated\ndef x(): pass\n", + "// autogenerated\nfn x() {}\n", + ]; + for content in cases { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), content).unwrap(); + assert!(is_generated_file(f.path()).unwrap(), "content: {content:?}"); + } +} + +#[test] +fn normal_code_is_not_flagged_generated() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "fn main() {\n println!(\"hi\");\n}\n").unwrap(); + assert!(!is_generated_file(f.path()).unwrap()); +} + +#[test] +fn is_generated_returns_false_for_empty_file() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "").unwrap(); + assert!(!is_generated_file(f.path()).unwrap()); +} + +#[test] +fn oversized_by_bytes_returns_true() { + let f = NamedTempFile::new().unwrap(); + let body: String = "x".repeat(300_000); + fs::write(f.path(), &body).unwrap(); + assert!(is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn oversized_by_lines_returns_true() { + let f = NamedTempFile::new().unwrap(); + let body: String = "x\n".repeat(6_000); + fs::write(f.path(), &body).unwrap(); + assert!(is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn small_file_returns_false_for_oversize() { + let f = NamedTempFile::new().unwrap(); + fs::write(f.path(), "fn foo() {}\n").unwrap(); + assert!(!is_oversized(f.path(), 262_144, 5_000).unwrap()); +} + +#[test] +fn builtin_blacklist_has_exactly_six_entries() { + assert_eq!(BUILTIN_BLACKLIST.len(), 6); + let expected = [ + "**/node_modules/**", + "**/target/**", + "**/__pycache__/**", + "**/.venv/**", + "**/venv/**", + "**/env/**", + ]; + for pat in expected { + assert!(BUILTIN_BLACKLIST.contains(&pat), "missing pattern: {pat}"); + } +} -- 2.49.1 From 2a8451c033dfc97bc8f0fdaf9630f5828b75bf9a Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:05:34 +0900 Subject: [PATCH 10/21] fix(p10-1a-1): tighten kebab-parse-code manifest + tests Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-parse-code/Cargo.toml | 4 +++- crates/kebab-parse-code/tests/lang.rs | 1 + crates/kebab-parse-code/tests/repo.rs | 2 +- 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/crates/kebab-parse-code/Cargo.toml b/crates/kebab-parse-code/Cargo.toml index ac76da0..ad6aef5 100644 --- a/crates/kebab-parse-code/Cargo.toml +++ b/crates/kebab-parse-code/Cargo.toml @@ -2,12 +2,14 @@ name = "kebab-parse-code" version = { workspace = true } edition = { workspace = true } +rust-version = { workspace = true } license = { workspace = true } +repository = { workspace = true } +description = "Language-aware code parsing infrastructure (lang dispatch, .git/ detect, skip helpers) for the kebab pipeline (P10-1A-1)" [dependencies] anyhow = { workspace = true } gix = { workspace = true } -kebab-core = { path = "../kebab-core" } [dev-dependencies] tempfile = { workspace = true } diff --git a/crates/kebab-parse-code/tests/lang.rs b/crates/kebab-parse-code/tests/lang.rs index 73a1551..f7db0a9 100644 --- a/crates/kebab-parse-code/tests/lang.rs +++ b/crates/kebab-parse-code/tests/lang.rs @@ -48,6 +48,7 @@ fn special_filenames_map_to_identifiers() { assert_eq!(code_lang_for_path(Path::new("Dockerfile")), Some("dockerfile")); assert_eq!(code_lang_for_path(Path::new("foo.dockerfile")), Some("dockerfile")); assert_eq!(code_lang_for_path(Path::new("Makefile")), Some("make")); + assert_eq!(code_lang_for_path(Path::new("GNUmakefile")), Some("make")); } #[test] diff --git a/crates/kebab-parse-code/tests/repo.rs b/crates/kebab-parse-code/tests/repo.rs index 68365a1..0c11133 100644 --- a/crates/kebab-parse-code/tests/repo.rs +++ b/crates/kebab-parse-code/tests/repo.rs @@ -46,7 +46,7 @@ fn detect_repo_walks_up_to_git_dir() { } #[test] -fn detect_repo_caches_per_path_call_for_repeated_files_in_same_repo() { +fn detect_repo_returns_consistent_metadata_for_paths_in_same_repo() { let tmp = TempDir::new().unwrap(); let repo_root = tmp.path().join("myrepo"); fs::create_dir_all(&repo_root).unwrap(); -- 2.49.1 From 69d1593bc53d681f8399de1d26dda2121e7a7b05 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:13:39 +0900 Subject: [PATCH 11/21] feat(p10-1a-1): integrate built-in blacklist into walker overrides Wires `kebab_parse_code::BUILTIN_BLACKLIST` (6 patterns: node_modules, target, __pycache__, .venv, venv, env) into `build_overrides()` so the walker automatically excludes these directories even when the user has no `.kebabignore`. TDD cycle: 2 failing tests added first, then the pattern-add loop inserted after the existing kbignore block. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-source-fs/Cargo.toml | 1 + crates/kebab-source-fs/src/walker.rs | 59 ++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+) diff --git a/crates/kebab-source-fs/Cargo.toml b/crates/kebab-source-fs/Cargo.toml index 7429315..5c976cd 100644 --- a/crates/kebab-source-fs/Cargo.toml +++ b/crates/kebab-source-fs/Cargo.toml @@ -10,6 +10,7 @@ description = "Local filesystem SourceConnector — walks workspace.root + app [dependencies] kebab-core = { path = "../kebab-core" } kebab-config = { path = "../kebab-config" } +kebab-parse-code = { path = "../kebab-parse-code" } anyhow = { workspace = true } serde = { workspace = true } time = { workspace = true } diff --git a/crates/kebab-source-fs/src/walker.rs b/crates/kebab-source-fs/src/walker.rs index 1ba14fa..529963d 100644 --- a/crates/kebab-source-fs/src/walker.rs +++ b/crates/kebab-source-fs/src/walker.rs @@ -76,6 +76,14 @@ pub(crate) fn build_overrides( .with_context(|| format!("invalid .kebabignore pattern: {pat}"))?; } + // p10-1A-1: built-in safety-net blacklist (spec §5.2). 6 patterns + // universal across ecosystems. User can negate via `.kebabignore`. + for pat in kebab_parse_code::BUILTIN_BLACKLIST { + builder + .add(&format!("!{pat}")) + .with_context(|| format!("built-in blacklist pattern: {pat}"))?; + } + builder.build().context("failed to compile override set") } @@ -257,4 +265,55 @@ mod tests { let v = read_kbignore(dir.path()).unwrap(); assert_eq!(v, vec!["*.tmp".to_string(), "ignored/**".to_string()]); } + + #[test] + fn built_in_blacklist_excludes_node_modules() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::create_dir_all(root.join("src")).unwrap(); + fs::create_dir_all(root.join("node_modules/foo")).unwrap(); + fs::write(root.join("src/main.rs"), "x").unwrap(); + fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap(); + + let overrides = build_overrides(root, &[], &[]).unwrap(); + // Override::matched expects paths relative to the builder's root. + let m_in = overrides.matched(Path::new("src/main.rs"), false); + let m_out = overrides.matched(Path::new("node_modules/foo/bar.js"), false); + + assert!(!m_in.is_ignore(), "src/main.rs should NOT be ignored"); + assert!(m_out.is_ignore(), "node_modules/foo/bar.js SHOULD be ignored"); + } + + #[test] + fn built_in_blacklist_excludes_target_pycache_venv() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + for dir in ["target/x", "__pycache__/x", ".venv/x", "venv/x", "env/x"] { + fs::create_dir_all(root.join(dir)).unwrap(); + fs::write(root.join(dir).join("y.txt"), "z").unwrap(); + } + fs::create_dir_all(root.join("ok")).unwrap(); + fs::write(root.join("ok/z.txt"), "z").unwrap(); + + let overrides = build_overrides(root, &[], &[]).unwrap(); + // Override::matched expects paths relative to the builder's root. + for blacklisted in [ + "target/x/y.txt", + "__pycache__/x/y.txt", + ".venv/x/y.txt", + "venv/x/y.txt", + "env/x/y.txt", + ] { + let m = overrides.matched(Path::new(blacklisted), false); + assert!(m.is_ignore(), "{blacklisted} should be ignored"); + } + let m_ok = overrides.matched(Path::new("ok/z.txt"), false); + assert!(!m_ok.is_ignore(), "ok/z.txt should not be ignored"); + } } -- 2.49.1 From abfdcbd31dc3e9c17febeb82a466c573c214a4a6 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:25:19 +0900 Subject: [PATCH 12/21] feat(p10-1a-1): honor repo-root .gitignore in walker overrides Adds read_gitignore() (pub(crate), root-only, nested cascade deferred) and merges its patterns as a 5th group in build_overrides(). Trailing- slash patterns (dist/) are normalized to also emit a stem/** glob so files inside the directory are matched when is_dir=false. Two new tests cover both the happy path and the missing-file no-op. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-source-fs/src/walker.rs | 82 ++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/crates/kebab-source-fs/src/walker.rs b/crates/kebab-source-fs/src/walker.rs index 529963d..bdda967 100644 --- a/crates/kebab-source-fs/src/walker.rs +++ b/crates/kebab-source-fs/src/walker.rs @@ -84,9 +84,54 @@ pub(crate) fn build_overrides( .with_context(|| format!("built-in blacklist pattern: {pat}"))?; } + // p10-1A-1: honor repo-root `.gitignore` (spec §5.2). Read once, + // merge with same convention as user `.kebabignore`. Nested + // cascade deferred to P+. + let gitignore_patterns = read_gitignore(root)?; + for pat in &gitignore_patterns { + builder + .add(&format!("!{pat}")) + .with_context(|| format!(".gitignore pattern: {pat}"))?; + } + builder.build().context("failed to compile override set") } +/// Read `/.gitignore` (single-file, root-only — nested cascade is P+). +/// Missing file → empty Vec. Comments / blanks stripped. +/// +/// Trailing-slash patterns (`dist/`) in real gitignore mean "match the +/// directory AND everything inside it". `OverrideBuilder::matched(path, +/// is_dir=false)` only checks `is_dir` for the trailing-slash variant, so +/// `dist/bundle.js` would not be matched. We normalize by also emitting a +/// `/**` variant so files inside the directory are caught. +pub(crate) fn read_gitignore(root: &Path) -> Result> { + let p = root.join(".gitignore"); + if !p.exists() { + return Ok(vec![]); + } + let s = std::fs::read_to_string(&p) + .with_context(|| format!("read .gitignore at {}", p.display()))?; + let mut out = Vec::new(); + for line in s.lines() { + let trimmed = line.trim(); + if trimmed.is_empty() || trimmed.starts_with('#') { + continue; + } + if let Some(stem) = trimmed.strip_suffix('/') { + // Keep the dir-only form so `is_dir=true` matches are still + // excluded (e.g., for skip_current_dir in the walker). + out.push(trimmed.to_string()); + // Also emit a glob that catches files inside the directory, + // since `is_dir=false` won't satisfy the trailing-slash form. + out.push(format!("{stem}/**")); + } else { + out.push(trimmed.to_string()); + } + } + Ok(out) +} + /// Read `/.kebabignore` if it exists. Each non-blank, non-comment line is /// a gitignore pattern. Missing file → empty Vec (not an error). pub(crate) fn read_kbignore(root: &Path) -> Result> { @@ -316,4 +361,41 @@ mod tests { let m_ok = overrides.matched(Path::new("ok/z.txt"), false); assert!(!m_ok.is_ignore(), "ok/z.txt should not be ignored"); } + + #[test] + fn gitignore_at_repo_root_excludes_matching_files() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::create_dir_all(root.join("src")).unwrap(); + fs::write(root.join(".gitignore"), "*.log\ndist/\n").unwrap(); + fs::write(root.join("a.log"), "x").unwrap(); + fs::write(root.join("src/main.rs"), "x").unwrap(); + fs::create_dir_all(root.join("dist")).unwrap(); + fs::write(root.join("dist/bundle.js"), "x").unwrap(); + + let overrides = build_overrides(root, &[], &[]).unwrap(); + assert!(overrides.matched(Path::new("a.log"), false).is_ignore()); + assert!(overrides.matched(Path::new("dist/bundle.js"), false).is_ignore()); + assert!(!overrides.matched(Path::new("src/main.rs"), false).is_ignore()); + } + + #[test] + fn gitignore_missing_is_no_op() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join("a.log"), "x").unwrap(); + fs::create_dir_all(root.join("src")).unwrap(); + fs::write(root.join("src/main.rs"), "x").unwrap(); + + // No .gitignore present — patterns from .gitignore should not affect overrides. + let overrides = build_overrides(root, &[], &[]).unwrap(); + assert!(!overrides.matched(Path::new("a.log"), false).is_ignore()); + assert!(!overrides.matched(Path::new("src/main.rs"), false).is_ignore()); + } } -- 2.49.1 From 8bbe25dc1030f4ac74e1c38688109968a1e0e1bd Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:30:00 +0900 Subject: [PATCH 13/21] fix(p10-1a-1): guard .gitignore negation + sync doc comments Prevent double-`!` corruption when a `.gitignore` negation pattern (e.g. `!keep/`) hits the trailing-slash normalizer in `read_gitignore`. Also updates module-level and `build_overrides` doc to list all five filter sources in application order, and adds a regression test. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-source-fs/src/walker.rs | 40 +++++++++++++++++++++++----- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/crates/kebab-source-fs/src/walker.rs b/crates/kebab-source-fs/src/walker.rs index bdda967..0d18baa 100644 --- a/crates/kebab-source-fs/src/walker.rs +++ b/crates/kebab-source-fs/src/walker.rs @@ -1,12 +1,15 @@ //! Directory walker with gitignore-style filtering and symlink-cycle //! protection. //! -//! Filter set (per task spec, design §6.2): -//! - `config.workspace.exclude` (passed in by `FsSourceConnector`) -//! - `/.kebabignore` (optional file at workspace root) -//! - default-excludes for `.DS_Store` and macOS resource forks (`._*`) +//! Filter set, in order of application: +//! - DEFAULT_EXCLUDES (constants — VCS dirs, build artifacts, never-useful) +//! - `config.workspace.exclude` (user-supplied per workspace) +//! - `/.kebabignore` (user-supplied kebab-specific exclude) +//! - Built-in safety-net blacklist (`node_modules/`, `target/`, etc. — +//! spec §5.2, applied via `kebab_parse_code::BUILTIN_BLACKLIST`) +//! - `/.gitignore` (repo-root only, no nested cascade — spec §5.2) //! -//! All three are merged via `ignore::overrides::OverrideBuilder`, which +//! All five are merged via `ignore::overrides::OverrideBuilder`, which //! gives full gitignore semantics (anchors, `!` negation, `**`, etc.). We //! prepend `!` to each pattern because `OverrideBuilder` treats positive //! patterns as "include" and negative as "exclude" — see §"Filter set" @@ -46,8 +49,10 @@ const DEFAULT_EXCLUDES: &[&str] = &[ "**/._*", ]; -/// Build the merged `Override` from `config.workspace.exclude` ∪ `.kebabignore` -/// ∪ baked-in default excludes. +/// Build the merged `Override` from all five filter sources, in order: +/// DEFAULT_EXCLUDES, `config.workspace.exclude`, `.kebabignore`, +/// built-in safety-net blacklist (`kebab_parse_code::BUILTIN_BLACKLIST`), +/// and `/.gitignore` (root-only, no nested cascade). /// /// Each input pattern is registered as an *exclude* (gitignore-style: a /// leading `!` flips a positive match to a negative one in the @@ -118,6 +123,13 @@ pub(crate) fn read_gitignore(root: &Path) -> Result> { if trimmed.is_empty() || trimmed.starts_with('#') { continue; } + // If the pattern starts with `!` (gitignore negation/un-ignore), pass through + // as-is. Trailing-slash normalization is unsafe here — the `!`-prefix and `/`- + // suffix combined confuse OverrideBuilder (would produce double-`!`). + if trimmed.starts_with('!') { + out.push(trimmed.to_string()); + continue; + } if let Some(stem) = trimmed.strip_suffix('/') { // Keep the dir-only form so `is_dir=true` matches are still // excluded (e.g., for skip_current_dir in the walker). @@ -398,4 +410,18 @@ mod tests { assert!(!overrides.matched(Path::new("a.log"), false).is_ignore()); assert!(!overrides.matched(Path::new("src/main.rs"), false).is_ignore()); } + + #[test] + fn gitignore_negation_with_trailing_slash_passes_through() { + use std::fs; + use tempfile::TempDir; + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + // Negation pattern. We don't fully implement gitignore negation + // semantics, but at minimum it must not produce double-`!` corruption. + fs::write(root.join(".gitignore"), "!keep/\n").unwrap(); + // Just verify build_overrides doesn't error. + let result = build_overrides(root, &[], &[]); + assert!(result.is_ok(), "should not error on negation pattern: {:?}", result.err()); + } } -- 2.49.1 From 9fce24b1061c8556879ce80c81959a9e0c5c98a1 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:42:28 +0900 Subject: [PATCH 14/21] feat(p10-1a-1): wire IngestReport skip counters by category (gitignore/builtin/kebabignore) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Refactor walker to expose WalkOverrides (combined + per-source matchers), add walk_files_with_skips that returns accepted files alongside skip attribution, wire FsSourceConnector::scan_with_skips into kebab-app so IngestReport.skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist, and skip_examples are populated instead of left at zero. Priority order per spec §5.2 (builtin > gitignore > kebabignore) enforced in classify_skip, with a directory-aware builtin matcher so pruned directory entries are correctly attributed to builtin rather than a coincident gitignore entry. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-app/src/lib.rs | 14 +- crates/kebab-source-fs/src/connector.rs | 376 ++++++++++++++++----- crates/kebab-source-fs/src/lib.rs | 2 +- crates/kebab-source-fs/src/walker.rs | 416 +++++++++++++++++++----- 4 files changed, 644 insertions(+), 164 deletions(-) diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 3c8e03c..99f5597 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -44,7 +44,7 @@ use kebab_core::{ Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker, DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput, EmbeddingKind, ExtractContext, Extractor, IngestReport, Lang, LanguageModel, MediaType, - ParserVersion, RawAsset, SearchHit, SearchQuery, SkipExamples, SourceConnector, SourceScope, + ParserVersion, RawAsset, SearchHit, SearchQuery, SourceScope, SourceUri, VectorRecord, VectorStore, }; use kebab_llm_local::OllamaLanguageModel; @@ -305,8 +305,8 @@ pub fn ingest_with_config_opts( ); let connector = FsSourceConnector::new(&app.config) .context("kb-app::ingest: build FsSourceConnector")?; - let assets = connector - .scan(&scope) + let (assets, fs_skips) = connector + .scan_with_skips(&scope) .context("kb-app::ingest: scan workspace")?; crate::ingest_progress::emit( progress, @@ -675,12 +675,12 @@ pub fn ingest_with_config_opts( errors: error_count, duration_ms, skipped_by_extension, - skipped_gitignore: 0, - skipped_kebabignore: 0, - skipped_builtin_blacklist: 0, + skipped_gitignore: fs_skips.skipped_gitignore, + skipped_kebabignore: fs_skips.skipped_kebabignore, + skipped_builtin_blacklist: fs_skips.skipped_builtin_blacklist, skipped_generated: 0, skipped_size_exceeded: 0, - skip_examples: SkipExamples::default(), + skip_examples: fs_skips.skip_examples, items: if summary_only { None } else { Some(items) }, }) } diff --git a/crates/kebab-source-fs/src/connector.rs b/crates/kebab-source-fs/src/connector.rs index b01c14a..d3cdb1f 100644 --- a/crates/kebab-source-fs/src/connector.rs +++ b/crates/kebab-source-fs/src/connector.rs @@ -10,20 +10,20 @@ //! } //! ``` -use std::path::PathBuf; +use std::path::{Path, PathBuf}; use anyhow::{Context, Result}; use time::OffsetDateTime; use kebab_config::Config; use kebab_core::{ - AssetStorage, Checksum, RawAsset, SourceConnector, SourceScope, SourceUri, + AssetStorage, Checksum, RawAsset, SkipExamples, SourceConnector, SourceScope, SourceUri, id_for_asset, to_posix, }; use crate::hash::hash_file; use crate::media::media_type_for; -use crate::walker::{build_overrides, read_kbignore, walk_files}; +use crate::walker::{SkipCategory, WalkOverrides, build_overrides, read_kbignore, walk_files_with_skips}; /// Local-filesystem `SourceConnector`. Constructed once from `Config`, /// reused across `scan` calls. @@ -61,107 +61,179 @@ impl FsSourceConnector { copy_threshold_bytes, }) } -} -impl SourceConnector for FsSourceConnector { - fn scan(&self, scope: &SourceScope) -> Result> { - // `SourceScope::root` overrides config root when non-empty. This - // matches the design's "scope is the per-call lens; config is the - // default" split (§7.1). + /// Resolve the effective root and build the merged + per-source overrides. + fn resolve_scan_params( + &self, + scope: &SourceScope, + ) -> Result<(PathBuf, Vec, WalkOverrides)> { let root = if scope.root.as_os_str().is_empty() { self.default_root.clone() } else { scope.root.clone() }; - // Union: config.workspace.exclude ∪ scope.exclude ∪ .kebabignore. - // Per §6.2 the union of `.kebabignore` and `config.workspace.exclude` - // is the filter set. `scope.exclude` is added on top so a caller - // can layer a per-call narrowing. let mut excludes = self.default_exclude.clone(); excludes.extend(scope.exclude.iter().cloned()); - // .kebabignore is re-read on every scan() so users can edit it without - // restarting any long-running process. let kbignore = read_kbignore(&root)?; let overrides = build_overrides(&root, &excludes, &kbignore)?; + Ok((root, kbignore, overrides)) + } - // TODO(P1-2/P1-3 router): apply SourceScope::include glob filter at the - // extractor router layer once that crate lands. SourceConnector emits all - // non-excluded files; routing by include-glob is a downstream concern - // (design §6.2 + §7.2 are silent on this split, treat it as router work). - // - // `scope.include` is intentionally ignored at this stage of the - // pipeline: per §6.2 the workspace-level include lives in - // `WorkspaceCfg` and is enforced by the asset writer / extractors. - // Surfacing it here would double-filter Markdown vs PDF before the - // extractor router gets to see them. - if !scope.include.is_empty() { - tracing::debug!( - count = scope.include.len(), - "FsSourceConnector ignores scope.include — handled by extractor router" - ); - } + /// Scan the workspace and return the accepted assets together with + /// per-category skip counts and sample paths for `IngestReport`. + /// + /// This is the **preferred entry point** for `kebab-app`: it provides + /// all the information needed to populate `IngestReport.skipped_gitignore`, + /// `skipped_kebabignore`, `skipped_builtin_blacklist`, and `skip_examples` + /// without a second walker pass. + pub fn scan_with_skips( + &self, + scope: &SourceScope, + ) -> Result<(Vec, FsScanSkips)> { + let (root, _kbignore, overrides) = self.resolve_scan_params(scope)?; - let files = walk_files(&root, &overrides)?; + // Suppress unused-variable warning — kbignore patterns are already + // baked into `overrides`; we don't need them again here. - let mut assets = Vec::with_capacity(files.len()); - for abs in &files { - // `to_posix` does NFC + leading `./` strip + `#` rejection. - // Compute the workspace-relative path before handing to it so - // emitted `WorkspacePath` is always relative. - let rel = abs.strip_prefix(&root).unwrap_or(abs); - let workspace_path = match to_posix(rel) { - Ok(p) => p, - Err(e) => { - // A path containing `#` is the only documented reason - // `to_posix` fails today. Drop the file with a warning - // rather than aborting the entire scan — a single bad - // filename should not nuke a 10 000-file ingest. - tracing::warn!( - path = %abs.display(), - error = %e, - "skipping file: path is not a valid WorkspacePath", + log_scope_include_warning(scope); + + let (files, skipped_entries) = walk_files_with_skips(&root, &overrides)?; + + // Accumulate per-category skip counts and sample paths. + let mut fs_skips = FsScanSkips::default(); + for entry in &skipped_entries { + match entry.category { + SkipCategory::BuiltinBlacklist => { + fs_skips.skipped_builtin_blacklist = + fs_skips.skipped_builtin_blacklist.saturating_add(1); + push_sample( + &mut fs_skips.skip_examples.builtin_blacklist, + &entry.path, + &root, ); - continue; } - }; - - let media_type = media_type_for(abs); - let (byte_len, full_hex) = hash_file(abs) - .with_context(|| format!("hashing {}", abs.display()))?; - let checksum = Checksum(full_hex.clone()); - let asset_id = id_for_asset(&full_hex); - - // Storage variant signals *intent*, not an actual copy. - // P1-6 (asset writer) is responsible for the on-disk copy. - let stored = if byte_len > self.copy_threshold_bytes { - AssetStorage::Reference { - path: abs.clone(), - sha: checksum.clone(), + SkipCategory::Gitignore => { + fs_skips.skipped_gitignore = + fs_skips.skipped_gitignore.saturating_add(1); + push_sample( + &mut fs_skips.skip_examples.gitignore, + &entry.path, + &root, + ); } - } else { - AssetStorage::Copied { path: abs.clone() } - }; - - assets.push(RawAsset { - asset_id, - source_uri: SourceUri::File(abs.clone()), - workspace_path, - media_type, - byte_len, - checksum, - discovered_at: OffsetDateTime::now_utc(), - stored, - }); + SkipCategory::Kebabignore => { + fs_skips.skipped_kebabignore = + fs_skips.skipped_kebabignore.saturating_add(1); + // kebabignore intentionally NOT in skip_examples per spec §5.5. + } + SkipCategory::Other => { + // DEFAULT_EXCLUDES or config.workspace.exclude — no dedicated + // IngestReport counter; these are lumped into the existing + // `skipped` field by kebab-app. + } + } } - // Determinism: sort by workspace_path. WorkspacePath is a String - // newtype with stable lexicographic ordering. Two scans of the - // same tree must produce identical Vec modulo the - // wall-clock `discovered_at` field. - assets.sort_by(|a, b| a.workspace_path.0.cmp(&b.workspace_path.0)); + let assets = build_assets(&files, &root, self.copy_threshold_bytes)?; + Ok((assets, fs_skips)) + } +} +/// Per-category skip counts and sample paths returned alongside the asset list +/// by [`FsSourceConnector::scan_with_skips`]. +/// +/// Populated from the walker's per-source matchers without a second pass. +#[derive(Debug, Default)] +pub struct FsScanSkips { + pub skipped_gitignore: u32, + pub skipped_kebabignore: u32, + pub skipped_builtin_blacklist: u32, + /// Sample paths per spec §5.5 (≤ 5 per category). Paths are + /// workspace-relative POSIX strings when available, absolute otherwise. + pub skip_examples: SkipExamples, +} + +/// Push a path into a sample vec (cap = 5) as a workspace-relative POSIX +/// string. Falls back to the lossy absolute path if relativisation fails. +fn push_sample(samples: &mut Vec, abs: &Path, root: &Path) { + if samples.len() >= 5 { + return; + } + let rel = abs.strip_prefix(root).unwrap_or(abs); + // Best-effort POSIX string; any non-UTF8 char → replacement char. + let s = rel.to_string_lossy().replace('\\', "/"); + samples.push(s); +} + +/// Convert a list of absolute file paths to `Vec`, sorted by +/// workspace-relative POSIX path for determinism. +fn build_assets( + files: &[PathBuf], + root: &Path, + copy_threshold_bytes: u64, +) -> Result> { + let mut assets = Vec::with_capacity(files.len()); + for abs in files { + let rel = abs.strip_prefix(root).unwrap_or(abs); + let workspace_path = match to_posix(rel) { + Ok(p) => p, + Err(e) => { + tracing::warn!( + path = %abs.display(), + error = %e, + "skipping file: path is not a valid WorkspacePath", + ); + continue; + } + }; + + let media_type = media_type_for(abs); + let (byte_len, full_hex) = hash_file(abs) + .with_context(|| format!("hashing {}", abs.display()))?; + let checksum = Checksum(full_hex.clone()); + let asset_id = id_for_asset(&full_hex); + + let stored = if byte_len > copy_threshold_bytes { + AssetStorage::Reference { + path: abs.clone(), + sha: checksum.clone(), + } + } else { + AssetStorage::Copied { path: abs.clone() } + }; + + assets.push(RawAsset { + asset_id, + source_uri: SourceUri::File(abs.clone()), + workspace_path, + media_type, + byte_len, + checksum, + discovered_at: OffsetDateTime::now_utc(), + stored, + }); + } + + assets.sort_by(|a, b| a.workspace_path.0.cmp(&b.workspace_path.0)); + Ok(assets) +} + +fn log_scope_include_warning(scope: &SourceScope) { + if !scope.include.is_empty() { + tracing::debug!( + count = scope.include.len(), + "FsSourceConnector ignores scope.include — handled by extractor router" + ); + } +} + +impl SourceConnector for FsSourceConnector { + fn scan(&self, scope: &SourceScope) -> Result> { + // Delegate to scan_with_skips; discard the skip counts. + // Callers that need skip attribution should call scan_with_skips directly. + let (assets, _skips) = self.scan_with_skips(scope)?; Ok(assets) } } @@ -401,4 +473,142 @@ mod tests { let v2 = conn2.scan(&SourceScope::default()).unwrap(); assert!(matches!(v2[0].stored, AssetStorage::Copied { .. })); } + + // ── IngestReport skip counter wiring tests ─────────────────────────────── + + #[test] + fn scan_with_skips_counts_gitignored_files() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join(".gitignore"), "*.log\n").unwrap(); + std::fs::write(root.join("ok.md"), b"# ok").unwrap(); + std::fs::write(root.join("skipme.log"), b"x").unwrap(); + + let conn = + FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_gitignore >= 1, + "skipped_gitignore should be >= 1; got {}", + skips.skipped_gitignore + ); + assert!( + skips.skip_examples.gitignore.iter().any(|p| p.contains("skipme.log")), + "skip_examples.gitignore should contain 'skipme.log'; got: {:?}", + skips.skip_examples.gitignore + ); + // kebabignore counter must be 0 — file matched gitignore, not kebabignore. + assert_eq!(skips.skipped_kebabignore, 0); + } + + #[test] + fn scan_with_skips_counts_builtin_blacklist_dirs() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::create_dir_all(root.join("node_modules/foo")).unwrap(); + std::fs::write(root.join("node_modules/foo/bar.js"), b"x").unwrap(); + std::fs::write(root.join("ok.md"), b"# ok").unwrap(); + + let conn = + FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_builtin_blacklist >= 1, + "skipped_builtin_blacklist should be >= 1; got {}", + skips.skipped_builtin_blacklist + ); + assert!( + skips.skip_examples.builtin_blacklist.iter().any(|p| p.contains("node_modules")), + "skip_examples.builtin_blacklist should contain a node_modules path; got: {:?}", + skips.skip_examples.builtin_blacklist + ); + } + + #[test] + fn scan_with_skips_kebabignore_increments_counter_no_example() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join(".kebabignore"), "*.secret\n").unwrap(); + std::fs::write(root.join("ok.md"), b"x").unwrap(); + std::fs::write(root.join("creds.secret"), b"pw").unwrap(); + + let conn = + FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_kebabignore >= 1, + "skipped_kebabignore should be >= 1; got {}", + skips.skipped_kebabignore + ); + // Per spec §5.5: kebabignore is intentionally NOT in skip_examples. + assert!( + skips.skip_examples.gitignore.is_empty(), + "gitignore examples should be empty; got: {:?}", + skips.skip_examples.gitignore + ); + assert!( + skips.skip_examples.builtin_blacklist.is_empty(), + "builtin_blacklist examples should be empty; got: {:?}", + skips.skip_examples.builtin_blacklist + ); + } + + #[test] + fn scan_with_skips_builtin_priority_over_gitignore() { + // node_modules/ matches both BUILTIN_BLACKLIST and a .gitignore entry. + // It must be attributed to builtin (spec §5.2 priority order). + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join(".gitignore"), "node_modules/\n").unwrap(); + std::fs::create_dir_all(root.join("node_modules/pkg")).unwrap(); + std::fs::write(root.join("node_modules/pkg/index.js"), b"x").unwrap(); + std::fs::write(root.join("ok.md"), b"x").unwrap(); + + let conn = + FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_builtin_blacklist >= 1, + "builtin counter should be >= 1; got {}", + skips.skipped_builtin_blacklist + ); + assert_eq!( + skips.skipped_gitignore, 0, + "gitignore counter must be 0 when builtin wins; got {}", + skips.skipped_gitignore + ); + } + + #[test] + fn skip_examples_cap_at_five() { + // Write 7 .log files — skip_examples.gitignore must cap at 5. + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join(".gitignore"), "*.log\n").unwrap(); + for i in 0..7 { + std::fs::write(root.join(format!("f{i}.log")), b"x").unwrap(); + } + std::fs::write(root.join("ok.md"), b"x").unwrap(); + + let conn = + FsSourceConnector::new(&cfg_with_root(root.to_str().unwrap())) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert_eq!(skips.skipped_gitignore, 7, "should count all 7"); + assert_eq!( + skips.skip_examples.gitignore.len(), + 5, + "skip_examples.gitignore must cap at 5; got: {:?}", + skips.skip_examples.gitignore + ); + } } diff --git a/crates/kebab-source-fs/src/lib.rs b/crates/kebab-source-fs/src/lib.rs index ea7f294..6975271 100644 --- a/crates/kebab-source-fs/src/lib.rs +++ b/crates/kebab-source-fs/src/lib.rs @@ -13,4 +13,4 @@ mod hash; mod media; mod walker; -pub use connector::FsSourceConnector; +pub use connector::{FsScanSkips, FsSourceConnector}; diff --git a/crates/kebab-source-fs/src/walker.rs b/crates/kebab-source-fs/src/walker.rs index 0d18baa..e963c0b 100644 --- a/crates/kebab-source-fs/src/walker.rs +++ b/crates/kebab-source-fs/src/walker.rs @@ -35,7 +35,7 @@ use std::path::{Path, PathBuf}; use anyhow::{Context, Result}; use ignore::overrides::{Override, OverrideBuilder}; -use walkdir::{DirEntry, WalkDir}; +use walkdir::WalkDir; /// Default-excludes baked into the connector. These are NOT configurable; /// they cover noise that is never useful to ingest and would otherwise need @@ -49,7 +49,96 @@ const DEFAULT_EXCLUDES: &[&str] = &[ "**/._*", ]; -/// Build the merged `Override` from all five filter sources, in order: +/// Per-source `Override` matchers for skip-counter attribution (spec §5.5). +/// +/// `combined` is the merged union of all sources — used for the actual +/// "is this entry excluded?" decision in the walker. The three per-source +/// matchers (`gitignore`, `kebabignore`, `builtin`) are used ONLY when +/// classifying an already-excluded path for `IngestReport` counter wiring; +/// they are never consulted for every walked file. +/// +/// `default_and_config` covers DEFAULT_EXCLUDES + `config.workspace.exclude` +/// — these do NOT map to any of the three named `IngestReport` counters. +pub(crate) struct WalkOverrides { + /// Merged matcher — same as today's `Override`; used for the walk decision. + pub combined: Override, + /// Matcher built from `/.gitignore` patterns only. + pub gitignore: Override, + /// Matcher built from `/.kebabignore` patterns only. + pub kebabignore: Override, + /// Matcher built from `kebab_parse_code::BUILTIN_BLACKLIST` only. + pub builtin: Override, +} + +/// Skip attribution category. Used by the connector when counting per-source +/// skips for `IngestReport` (spec §5.5). +/// +/// Priority order per spec §5.2: built-in > gitignore > kebabignore. +/// A path matching multiple sources is attributed to the first match. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub(crate) enum SkipCategory { + BuiltinBlacklist, + Gitignore, + Kebabignore, + /// Matched DEFAULT_EXCLUDES or `config.workspace.exclude`. No dedicated + /// counter in `IngestReport` — lumped into the existing `skipped` field. + Other, +} + +/// Build a single `Override` from a list of gitignore-style patterns, all +/// registered as excludes (prepend `!`). +/// +/// Empty pattern list → an `Override` that matches nothing (i.e. no +/// exclusions). Callers must strip blanks / comments before passing. +fn build_single_matcher(root: &Path, patterns: &[&str]) -> Result { + let mut builder = OverrideBuilder::new(root); + for pat in patterns { + builder + .add(&format!("!{pat}")) + .with_context(|| format!("invalid pattern: {pat}"))?; + } + builder.build().context("failed to compile override") +} + +/// Build the builtin-blacklist `Override`, adding directory-level patterns in +/// addition to the `/**`-suffix ones from `BUILTIN_BLACKLIST`. +/// +/// BUILTIN_BLACKLIST uses `**/X/**` patterns which match files *inside* X but +/// NOT the directory entry `X` itself (because `**/X/**` requires a path +/// component after X). The walker prunes at the directory level (`is_dir=true`), +/// so we need `**/X` (no trailing `/**`) to also match the directory itself +/// for attribution purposes. +fn build_builtin_matcher(root: &Path) -> Result { + let mut builder = OverrideBuilder::new(root); + for pat in kebab_parse_code::BUILTIN_BLACKLIST { + // Register the original pattern (matches files inside the dir). + builder + .add(&format!("!{pat}")) + .with_context(|| format!("builtin pattern: {pat}"))?; + // Also derive a directory-level match by stripping trailing `/**`. + // This makes `is_dir=true` checks on the directory itself work. + if let Some(dir_pat) = pat.strip_suffix("/**") { + builder + .add(&format!("!{dir_pat}")) + .with_context(|| format!("builtin dir pattern: {dir_pat}"))?; + } + } + builder.build().context("failed to compile builtin override") +} + +/// Owned-string variant of `build_single_matcher` for caller-supplied +/// `Vec` sources (config.workspace.exclude, .kebabignore). +fn build_single_matcher_owned(root: &Path, patterns: &[String]) -> Result { + let mut builder = OverrideBuilder::new(root); + for pat in patterns { + builder + .add(&format!("!{pat}")) + .with_context(|| format!("invalid pattern: {pat}"))?; + } + builder.build().context("failed to compile override") +} + +/// Build the merged `WalkOverrides` from all five filter sources, in order: /// DEFAULT_EXCLUDES, `config.workspace.exclude`, `.kebabignore`, /// built-in safety-net blacklist (`kebab_parse_code::BUILTIN_BLACKLIST`), /// and `/.gitignore` (root-only, no nested cascade). @@ -58,48 +147,82 @@ const DEFAULT_EXCLUDES: &[&str] = &[ /// leading `!` flips a positive match to a negative one in the /// `OverrideBuilder` API). Order doesn't matter — the union is computed by /// the underlying gitignore engine. +/// +/// The three per-source matchers (`gitignore`, `kebabignore`, `builtin`) are +/// built in addition to the combined one so the connector can attribute skips +/// to the correct `IngestReport` counter without a second walker pass. pub(crate) fn build_overrides( root: &Path, config_exclude: &[String], kbignore_patterns: &[String], -) -> Result { - let mut builder = OverrideBuilder::new(root); +) -> Result { + let gitignore_patterns = read_gitignore(root)?; + + // Per-source matchers (for attribution only). + let gitignore = + build_single_matcher(root, &gitignore_patterns.iter().map(|s| s.as_str()).collect::>())?; + let kebabignore = build_single_matcher_owned(root, kbignore_patterns)?; + // Use the directory-aware builtin matcher so that `is_dir=true` checks on + // directory entries (e.g., `node_modules/`) are attributed to builtin rather + // than to an overlapping gitignore pattern. + let builtin = build_builtin_matcher(root)?; + + // Combined matcher — union of all five sources. + let mut combined_builder = OverrideBuilder::new(root); for pat in DEFAULT_EXCLUDES { - builder + combined_builder .add(&format!("!{pat}")) .with_context(|| format!("invalid default-exclude pattern: {pat}"))?; } for pat in config_exclude { - builder + combined_builder .add(&format!("!{pat}")) .with_context(|| format!("invalid workspace.exclude pattern: {pat}"))?; } for pat in kbignore_patterns { - builder + combined_builder .add(&format!("!{pat}")) .with_context(|| format!("invalid .kebabignore pattern: {pat}"))?; } - - // p10-1A-1: built-in safety-net blacklist (spec §5.2). 6 patterns - // universal across ecosystems. User can negate via `.kebabignore`. for pat in kebab_parse_code::BUILTIN_BLACKLIST { - builder + combined_builder .add(&format!("!{pat}")) .with_context(|| format!("built-in blacklist pattern: {pat}"))?; } - - // p10-1A-1: honor repo-root `.gitignore` (spec §5.2). Read once, - // merge with same convention as user `.kebabignore`. Nested - // cascade deferred to P+. - let gitignore_patterns = read_gitignore(root)?; for pat in &gitignore_patterns { - builder + combined_builder .add(&format!("!{pat}")) .with_context(|| format!(".gitignore pattern: {pat}"))?; } + let combined = combined_builder + .build() + .context("failed to compile combined override set")?; - builder.build().context("failed to compile override set") + Ok(WalkOverrides { + combined, + gitignore, + kebabignore, + builtin, + }) +} + +/// Classify why a path was excluded, using per-source matchers in spec §5.2 +/// priority order: built-in > gitignore > kebabignore > other. +/// +/// `rel` must be relative to the walker root (same as `Override::matched` +/// expects). `is_dir` should match what the original walker saw. +pub(crate) fn classify_skip(rel: &Path, is_dir: bool, ov: &WalkOverrides) -> SkipCategory { + if ov.builtin.matched(rel, is_dir).is_ignore() { + return SkipCategory::BuiltinBlacklist; + } + if ov.gitignore.matched(rel, is_dir).is_ignore() { + return SkipCategory::Gitignore; + } + if ov.kebabignore.matched(rel, is_dir).is_ignore() { + return SkipCategory::Kebabignore; + } + SkipCategory::Other } /// Read `/.gitignore` (single-file, root-only — nested cascade is P+). @@ -161,51 +284,87 @@ pub(crate) fn read_kbignore(root: &Path) -> Result> { .collect()) } -/// Iterate every regular file under `root`, applying `overrides` and -/// detecting symlink cycles. Returns absolute file paths. +/// Skipped-path record emitted by `walk_files_with_skips`. +/// +/// `path` is the absolute path of the excluded entry (dir or file). +/// For excluded directories, this is the directory itself — individual +/// files inside are not enumerated (the subtree is pruned). +pub(crate) struct SkippedEntry { + pub path: PathBuf, + pub category: SkipCategory, +} + +/// Iterate every regular file under `root`, applying `overrides.combined` and +/// detecting symlink cycles. Returns: +/// - `accepted`: absolute paths of files that passed all filters. +/// - `skipped`: entries that were excluded, with attribution. +/// +/// For excluded *directories*, the directory path itself is returned (not the +/// individual files inside — the subtree is pruned in one step, matching the +/// walker's `skip_current_dir` behavior). /// /// Strategy: /// - `walkdir::WalkDir::follow_links(true)` to traverse symlinks. +/// - Manual per-entry check (instead of `filter_entry`) so we can capture +/// the excluded paths for skip attribution. /// - Maintain `visited: HashSet` of *canonical* paths. Before /// descending into a directory entry, canonicalize and check the set; /// if already present, skip. This breaks `a -> b -> a` cycles in O(n) /// per entry without a custom recursive walker. -/// - For each yielded entry, ask `overrides` whether it is excluded; if -/// so, drop it. If the entry is a directory, also short-circuit -/// `WalkDir`'s descent via `it.skip_current_dir()`. -pub(crate) fn walk_files(root: &Path, overrides: &Override) -> Result> { - let mut out = Vec::new(); +pub(crate) fn walk_files_with_skips( + root: &Path, + overrides: &WalkOverrides, +) -> Result<(Vec, Vec)> { + let mut accepted = Vec::new(); + let mut skipped: Vec = Vec::new(); let mut visited: HashSet = HashSet::new(); + // Use a non-filtering iterator so we see excluded entries too. let walker = WalkDir::new(root).follow_links(true).into_iter(); - let mut it = walker.filter_entry(|e| !is_excluded(e, root, overrides)); + // We still use filter_entry for the *combined* override so that walkdir + // can short-circuit pruned directories. But we wrap it so we can capture + // the exclusion reason before discarding the entry. + // + // Problem: filter_entry discards without letting us see the entry first. + // Solution: use the raw iterator (no filter_entry) and manage skip_current_dir + // manually, which lets us record what was excluded before pruning. + let mut it = walker; while let Some(res) = it.next() { let entry = match res { Ok(e) => e, Err(err) => { - // `walkdir` surfaces I/O errors AND its own cycle detector - // (when follow_links is on it sometimes catches them). - // Either way: log and skip; do not abort the whole scan. tracing::warn!(error = %err, "walkdir entry error; skipping"); continue; } }; let path = entry.path(); + let rel = match path.strip_prefix(root) { + Ok(p) => p, + Err(_) => path, + }; + let is_dir = entry.file_type().is_dir(); + let excluded = overrides.combined.matched(rel, is_dir).is_ignore(); - // Cycle guard: only canonicalize symlinks (cheap on the common case - // of plain files/dirs) and on directories that are followed via a - // symlink. `walkdir`'s `path_is_symlink()` is true when the entry's - // *original* path is a symlink (it returns true for the link, not - // for the resolved target). For non-symlinked directories we still - // record the canonical path so a *later* symlink that points back - // to one of them is detected. - if entry.file_type().is_dir() { + if excluded { + let cat = classify_skip(rel, is_dir, overrides); + skipped.push(SkippedEntry { + path: path.to_path_buf(), + category: cat, + }); + if is_dir { + // Prune the subtree — don't descend into excluded dirs. + it.skip_current_dir(); + } + continue; + } + + // Cycle guard for directories. + if is_dir { match std::fs::canonicalize(path) { Ok(canon) => { if !visited.insert(canon) { - // Already visited via another path → break cycle. it.skip_current_dir(); continue; } @@ -222,26 +381,13 @@ pub(crate) fn walk_files(root: &Path, overrides: &Override) -> Result bool { - // `Override::matched(path, is_dir)` uses the path *relative to* the - // override builder's root. `walkdir` gives absolute paths when - // `WalkDir::new` was given an absolute path — strip the root prefix - // before consulting the override. - let rel = match entry.path().strip_prefix(root) { - Ok(p) => p, - Err(_) => entry.path(), - }; - overrides - .matched(rel, entry.file_type().is_dir()) - .is_ignore() -} #[cfg(test)] mod tests { @@ -252,7 +398,7 @@ mod tests { let dir = tempfile::tempdir().unwrap(); let ov = build_overrides(dir.path(), &[], &[]).unwrap(); // Default-excludes only; non-special files should not match. - let m = ov.matched(Path::new("notes/alpha.md"), false); + let m = ov.combined.matched(Path::new("notes/alpha.md"), false); assert!(!m.is_ignore()); } @@ -260,13 +406,13 @@ mod tests { fn default_excludes_ds_store_and_resource_forks() { let dir = tempfile::tempdir().unwrap(); let ov = build_overrides(dir.path(), &[], &[]).unwrap(); - assert!(ov.matched(Path::new(".DS_Store"), false).is_ignore()); + assert!(ov.combined.matched(Path::new(".DS_Store"), false).is_ignore()); assert!( - ov.matched(Path::new("notes/.DS_Store"), false).is_ignore() + ov.combined.matched(Path::new("notes/.DS_Store"), false).is_ignore() ); - assert!(ov.matched(Path::new("._foo.md"), false).is_ignore()); + assert!(ov.combined.matched(Path::new("._foo.md"), false).is_ignore()); assert!( - ov.matched(Path::new("notes/._sidecar"), false).is_ignore() + ov.combined.matched(Path::new("notes/._sidecar"), false).is_ignore() ); } @@ -279,13 +425,13 @@ mod tests { &[], ) .unwrap(); - assert!(ov.matched(Path::new("a.tmp"), false).is_ignore()); - assert!(ov.matched(Path::new("notes/x.tmp"), false).is_ignore()); + assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore()); + assert!(ov.combined.matched(Path::new("notes/x.tmp"), false).is_ignore()); assert!( - ov.matched(Path::new("node_modules/foo/bar.js"), false) + ov.combined.matched(Path::new("node_modules/foo/bar.js"), false) .is_ignore() ); - assert!(!ov.matched(Path::new("alpha.md"), false).is_ignore()); + assert!(!ov.combined.matched(Path::new("alpha.md"), false).is_ignore()); } #[test] @@ -298,9 +444,9 @@ mod tests { &["secret/**".to_string()], ) .unwrap(); - assert!(ov.matched(Path::new("a.tmp"), false).is_ignore()); + assert!(ov.combined.matched(Path::new("a.tmp"), false).is_ignore()); assert!( - ov.matched(Path::new("secret/key.md"), false).is_ignore() + ov.combined.matched(Path::new("secret/key.md"), false).is_ignore() ); } @@ -337,8 +483,8 @@ mod tests { let overrides = build_overrides(root, &[], &[]).unwrap(); // Override::matched expects paths relative to the builder's root. - let m_in = overrides.matched(Path::new("src/main.rs"), false); - let m_out = overrides.matched(Path::new("node_modules/foo/bar.js"), false); + let m_in = overrides.combined.matched(Path::new("src/main.rs"), false); + let m_out = overrides.combined.matched(Path::new("node_modules/foo/bar.js"), false); assert!(!m_in.is_ignore(), "src/main.rs should NOT be ignored"); assert!(m_out.is_ignore(), "node_modules/foo/bar.js SHOULD be ignored"); @@ -367,10 +513,10 @@ mod tests { "venv/x/y.txt", "env/x/y.txt", ] { - let m = overrides.matched(Path::new(blacklisted), false); + let m = overrides.combined.matched(Path::new(blacklisted), false); assert!(m.is_ignore(), "{blacklisted} should be ignored"); } - let m_ok = overrides.matched(Path::new("ok/z.txt"), false); + let m_ok = overrides.combined.matched(Path::new("ok/z.txt"), false); assert!(!m_ok.is_ignore(), "ok/z.txt should not be ignored"); } @@ -389,9 +535,9 @@ mod tests { fs::write(root.join("dist/bundle.js"), "x").unwrap(); let overrides = build_overrides(root, &[], &[]).unwrap(); - assert!(overrides.matched(Path::new("a.log"), false).is_ignore()); - assert!(overrides.matched(Path::new("dist/bundle.js"), false).is_ignore()); - assert!(!overrides.matched(Path::new("src/main.rs"), false).is_ignore()); + assert!(overrides.combined.matched(Path::new("a.log"), false).is_ignore()); + assert!(overrides.combined.matched(Path::new("dist/bundle.js"), false).is_ignore()); + assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore()); } #[test] @@ -407,8 +553,8 @@ mod tests { // No .gitignore present — patterns from .gitignore should not affect overrides. let overrides = build_overrides(root, &[], &[]).unwrap(); - assert!(!overrides.matched(Path::new("a.log"), false).is_ignore()); - assert!(!overrides.matched(Path::new("src/main.rs"), false).is_ignore()); + assert!(!overrides.combined.matched(Path::new("a.log"), false).is_ignore()); + assert!(!overrides.combined.matched(Path::new("src/main.rs"), false).is_ignore()); } #[test] @@ -424,4 +570,128 @@ mod tests { let result = build_overrides(root, &[], &[]); assert!(result.is_ok(), "should not error on negation pattern: {:?}", result.err()); } + + // ── Skip attribution tests ──────────────────────────────────────────────── + + #[test] + fn classify_skip_attributes_builtin_over_gitignore() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + // node_modules matches both BUILTIN_BLACKLIST and a hypothetical + // .gitignore entry. Builtin must win (priority order §5.2). + fs::write(root.join(".gitignore"), "node_modules/\n").unwrap(); + + let ov = build_overrides(root, &[], &[]).unwrap(); + // node_modules/ dir itself + let cat = classify_skip(Path::new("node_modules"), true, &ov); + assert_eq!(cat, SkipCategory::BuiltinBlacklist, "builtin must have priority"); + } + + #[test] + fn classify_skip_attributes_gitignore_for_log_files() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join(".gitignore"), "*.log\n").unwrap(); + + let ov = build_overrides(root, &[], &[]).unwrap(); + let cat = classify_skip(Path::new("app.log"), false, &ov); + assert_eq!(cat, SkipCategory::Gitignore); + } + + #[test] + fn classify_skip_attributes_kebabignore() { + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + + let ov = build_overrides(root, &[], &["*.secret".to_string()]).unwrap(); + let cat = classify_skip(Path::new("creds.secret"), false, &ov); + assert_eq!(cat, SkipCategory::Kebabignore); + } + + #[test] + fn walk_files_with_skips_counts_gitignored_files() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::write(root.join(".gitignore"), "*.log\n").unwrap(); + fs::write(root.join("ok.md"), "# ok").unwrap(); + fs::write(root.join("skipme.log"), "x").unwrap(); + + let ov = build_overrides(root, &[], &[]).unwrap(); + let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap(); + + let accepted_names: Vec<_> = accepted + .iter() + .map(|p| p.file_name().unwrap().to_string_lossy().into_owned()) + .collect(); + assert!( + accepted_names.iter().any(|n| n == "ok.md"), + "ok.md should be accepted; got: {accepted_names:?}" + ); + assert!( + !accepted_names.iter().any(|n| n == "skipme.log"), + "skipme.log should not be accepted; got: {accepted_names:?}" + ); + + let gitignore_skipped: Vec<_> = skipped_entries + .iter() + .filter(|e| e.category == SkipCategory::Gitignore) + .collect(); + assert!( + gitignore_skipped.iter().any(|e| e.path.file_name() + .map(|n| n == "skipme.log") + .unwrap_or(false)), + "skipme.log should appear in gitignore_skipped; skipped: {:?}", + skipped_entries.iter().map(|e| &e.path).collect::>() + ); + } + + #[test] + fn walk_files_with_skips_counts_builtin_blacklist_dirs() { + use std::fs; + use tempfile::TempDir; + + let tmp = TempDir::new().unwrap(); + let root = tmp.path(); + fs::create_dir_all(root.join("node_modules/foo")).unwrap(); + fs::write(root.join("node_modules/foo/bar.js"), "x").unwrap(); + fs::write(root.join("ok.md"), "# ok").unwrap(); + + let ov = build_overrides(root, &[], &[]).unwrap(); + let (accepted, skipped_entries) = walk_files_with_skips(root, &ov).unwrap(); + + let accepted_names: Vec<_> = accepted + .iter() + .map(|p| p.file_name().unwrap().to_string_lossy().into_owned()) + .collect(); + assert!( + accepted_names.iter().any(|n| n == "ok.md"), + "ok.md must be accepted; got: {accepted_names:?}" + ); + + let builtin_skipped: Vec<_> = skipped_entries + .iter() + .filter(|e| e.category == SkipCategory::BuiltinBlacklist) + .collect(); + assert!( + !builtin_skipped.is_empty(), + "node_modules/ should produce at least one BuiltinBlacklist skip" + ); + assert!( + builtin_skipped.iter().any(|e| e.path.components() + .any(|c| c.as_os_str() == "node_modules")), + "skipped path should contain node_modules; got: {:?}", + builtin_skipped.iter().map(|e| &e.path).collect::>() + ); + } } -- 2.49.1 From 40b3ea84081799a429a1b876abc2440c1d4fb663 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:50:55 +0900 Subject: [PATCH 15/21] chore(p10-1a-1): cleanup Task 11 review findings + sync Cargo.lock Co-Authored-By: Claude Sonnet 4.6 --- Cargo.lock | 2 +- crates/kebab-source-fs/src/connector.rs | 9 +++------ crates/kebab-source-fs/src/walker.rs | 10 ++++++++++ crates/kebab-source-fs/tests/symlink_cycle.rs | 4 ++-- 4 files changed, 16 insertions(+), 9 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 499f233..f330c3b 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -4345,7 +4345,6 @@ version = "0.6.0" dependencies = [ "anyhow", "gix", - "kebab-core", "tempfile", ] @@ -4460,6 +4459,7 @@ dependencies = [ "ignore", "kebab-config", "kebab-core", + "kebab-parse-code", "serde", "serde_json", "tempfile", diff --git a/crates/kebab-source-fs/src/connector.rs b/crates/kebab-source-fs/src/connector.rs index d3cdb1f..599673e 100644 --- a/crates/kebab-source-fs/src/connector.rs +++ b/crates/kebab-source-fs/src/connector.rs @@ -66,7 +66,7 @@ impl FsSourceConnector { fn resolve_scan_params( &self, scope: &SourceScope, - ) -> Result<(PathBuf, Vec, WalkOverrides)> { + ) -> Result<(PathBuf, WalkOverrides)> { let root = if scope.root.as_os_str().is_empty() { self.default_root.clone() } else { @@ -78,7 +78,7 @@ impl FsSourceConnector { let kbignore = read_kbignore(&root)?; let overrides = build_overrides(&root, &excludes, &kbignore)?; - Ok((root, kbignore, overrides)) + Ok((root, overrides)) } /// Scan the workspace and return the accepted assets together with @@ -92,10 +92,7 @@ impl FsSourceConnector { &self, scope: &SourceScope, ) -> Result<(Vec, FsScanSkips)> { - let (root, _kbignore, overrides) = self.resolve_scan_params(scope)?; - - // Suppress unused-variable warning — kbignore patterns are already - // baked into `overrides`; we don't need them again here. + let (root, overrides) = self.resolve_scan_params(scope)?; log_scope_include_warning(scope); diff --git a/crates/kebab-source-fs/src/walker.rs b/crates/kebab-source-fs/src/walker.rs index e963c0b..8eaf0ff 100644 --- a/crates/kebab-source-fs/src/walker.rs +++ b/crates/kebab-source-fs/src/walker.rs @@ -21,6 +21,16 @@ //! `follow_links(true)`; we layer our own visited-set on top, keyed by the //! canonical path of every entry, and skip any entry we've already seen. //! +//! ## Per-source skip attribution (spec §5.5) +//! +//! `walk_files_with_skips` returns a `WalkOverrides` struct that carries +//! both a `combined` matcher (used for the actual walk decision) and three +//! per-source matchers (`gitignore`, `kebabignore`, `builtin`). When an +//! entry is excluded, `classify_skip` probes the per-source matchers in +//! priority order (built-in > gitignore > kebabignore) to determine which +//! `IngestReport` counter should be incremented — without requiring a +//! second walker pass over the filesystem. +//! //! ## Why `walkdir` instead of `ignore::WalkBuilder`? //! //! `ignore::WalkBuilder` bundles gitignore semantics + cycle detection in diff --git a/crates/kebab-source-fs/tests/symlink_cycle.rs b/crates/kebab-source-fs/tests/symlink_cycle.rs index 1bd8d50..52fbcaa 100644 --- a/crates/kebab-source-fs/tests/symlink_cycle.rs +++ b/crates/kebab-source-fs/tests/symlink_cycle.rs @@ -9,7 +9,7 @@ //! Expected: `scan` returns in O(seconds), every emitted path is unique, //! and `alpha.md` appears at least once. //! -//! The cycle guard lives in `walker::walk_files`; this test exists to +//! The cycle guard lives in `walker::walk_files_with_skips`; this test exists to //! prove it catches the realistic shape (cycle through one or more //! symlinks) end-to-end via the public API. @@ -100,7 +100,7 @@ fn two_step_directory_cycle_visited_set_breaks_loop() { // // Without the visited-set, walkdir would descend // a → a/loop (=b) → a/loop/loop (=a) → … forever. - // The canonical-path visited-set in `walker::walk_files` must break + // The canonical-path visited-set in `walker::walk_files_with_skips` must break // the loop and yield a finite, deterministic result. let dir = tempfile::tempdir().unwrap(); let root = dir.path(); -- 2.49.1 From 682f7dd3a25159a7a6c9bf040ef20dc3596554da Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 16:53:21 +0900 Subject: [PATCH 16/21] feat(p10-1a-1): add [ingest.code] config section Add IngestCfg + IngestCodeCfg structs with serde defaults and embed ingest: IngestCfg into the top-level Config. Existing configs without an [ingest] section continue to load unchanged. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-config/src/lib.rs | 103 +++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/crates/kebab-config/src/lib.rs b/crates/kebab-config/src/lib.rs index 02ab5ca..b9d1c4d 100644 --- a/crates/kebab-config/src/lib.rs +++ b/crates/kebab-config/src/lib.rs @@ -45,6 +45,11 @@ pub struct Config { /// `dark`). #[serde(default = "UiCfg::defaults")] pub ui: UiCfg, + /// p10-1A-1: code ingest settings. `#[serde(default)]` so existing + /// config files without an `[ingest]` / `[ingest.code]` section + /// load cleanly with built-in defaults. + #[serde(default)] + pub ingest: IngestCfg, /// p9-fb-05: directory of the on-disk config file this `Config` /// was loaded from, if any. Populated by `Config::from_file` / /// `Config::load` — never serialized (`#[serde(skip)]`). Used by @@ -265,6 +270,60 @@ impl UiCfg { } } +/// p10-1A-1: top-level ingest configuration wrapper. Contains per-media-type +/// sub-sections; currently only `code` is defined. +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +#[serde(default)] +pub struct IngestCfg { + pub code: IngestCodeCfg, +} + +impl Default for IngestCfg { + fn default() -> Self { + Self { + code: IngestCodeCfg::default(), + } + } +} + +/// p10-1A-1: settings for the code ingest pipeline. All fields have +/// reasonable defaults so the user need not set anything in `config.toml` +/// to get working code ingest. +#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +#[serde(default)] +pub struct IngestCodeCfg { + /// Generated header sniff. Reads first ~512 bytes, checks 7 markers. + pub skip_generated_header: bool, + /// Max byte size per file. Bigger files skipped. + pub max_file_bytes: u64, + /// Max line count per file. Bigger files skipped (byte cap checked first). + pub max_file_lines: u32, + /// User extra skip globs (gitignore syntax). Applied on top of built-in + /// + `.gitignore` + `.kebabignore`. + pub extra_skip_globs: Vec, + /// AST chunk size cap. Functions/classes longer than this fall back to + /// paragraph-based split (1A-2 and later). + pub ast_chunk_max_lines: u32, + /// Tier 3 fallback chunker: lines per chunk. + pub fallback_lines_per_chunk: u32, + /// Tier 3 fallback chunker: line overlap between adjacent chunks. + pub fallback_lines_overlap: u32, +} + +impl Default for IngestCodeCfg { + fn default() -> Self { + Self { + skip_generated_header: true, + max_file_bytes: 262_144, + max_file_lines: 5_000, + extra_skip_globs: vec![], + ast_chunk_max_lines: 200, + fallback_lines_per_chunk: 80, + fallback_lines_overlap: 20, + } + } +} + impl Config { /// Defaults per design §6.4. pub fn defaults() -> Self { @@ -336,6 +395,7 @@ impl Config { }, image: ImageCfg::defaults(), ui: UiCfg::defaults(), + ingest: IngestCfg::default(), // p9-fb-05: defaults are not loaded from disk, so no // source_dir. Relative `workspace.root` (rare with // defaults) falls back to caller `cwd` via the @@ -1060,6 +1120,49 @@ max_context_tokens = 8000 } } } + + #[test] + fn ingest_code_cfg_defaults() { + let cfg: IngestCodeCfg = toml::from_str("").unwrap(); + assert_eq!(cfg.max_file_bytes, 262_144); + assert_eq!(cfg.max_file_lines, 5_000); + assert!(cfg.skip_generated_header); + assert!(cfg.extra_skip_globs.is_empty()); + assert_eq!(cfg.ast_chunk_max_lines, 200); + assert_eq!(cfg.fallback_lines_per_chunk, 80); + assert_eq!(cfg.fallback_lines_overlap, 20); + } + + #[test] + fn ingest_code_cfg_user_override() { + let toml = r#" + max_file_bytes = 1048576 + max_file_lines = 20000 + skip_generated_header = false + extra_skip_globs = ["**/fixtures/**", "**/snapshots/**"] + "#; + let cfg: IngestCodeCfg = toml::from_str(toml).unwrap(); + assert_eq!(cfg.max_file_bytes, 1_048_576); + assert_eq!(cfg.max_file_lines, 20_000); + assert!(!cfg.skip_generated_header); + assert_eq!(cfg.extra_skip_globs.len(), 2); + } + + #[test] + fn config_with_ingest_code_section() { + // Build a full valid Config serialization and patch only the + // [ingest.code] field we care about — avoids having to enumerate + // every required Config field in the test fixture. + let base = Config::defaults(); + let mut toml_text = toml::to_string(&base).unwrap(); + // Inject max_file_bytes override into the [ingest.code] table. + toml_text = toml_text.replace( + "max_file_bytes = 262144", + "max_file_bytes = 524288", + ); + let cfg: Config = toml::from_str(&toml_text).unwrap(); + assert_eq!(cfg.ingest.code.max_file_bytes, 524_288); + } } #[cfg(test)] -- 2.49.1 From 4e8b70a04b756fb178fadb6c9e87daa533f47fe0 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 17:06:59 +0900 Subject: [PATCH 17/21] feat(p10-1a-1): apply generated-header + size-cap skip per file Wire kebab_parse_code::is_generated_file and is_oversized into FsSourceConnector::scan_with_skips. Files that pass gitignore/builtin/ kebabignore matching are now checked for generated-file markers (config-gated via ingest.code.skip_generated_header) and byte/line caps (ingest.code.max_file_bytes / max_file_lines). FsScanSkips gains skipped_generated + skipped_size_exceeded counters; kebab-app threads them into IngestReport. Also fixes a pre-existing clippy::derivable_impls warning in IngestCfg. Three new connector tests cover all three paths. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-app/src/lib.rs | 4 +- crates/kebab-config/src/lib.rs | 10 +- crates/kebab-source-fs/src/connector.rs | 168 +++++++++++++++++++++++- 3 files changed, 170 insertions(+), 12 deletions(-) diff --git a/crates/kebab-app/src/lib.rs b/crates/kebab-app/src/lib.rs index 99f5597..e0a0375 100644 --- a/crates/kebab-app/src/lib.rs +++ b/crates/kebab-app/src/lib.rs @@ -678,8 +678,8 @@ pub fn ingest_with_config_opts( skipped_gitignore: fs_skips.skipped_gitignore, skipped_kebabignore: fs_skips.skipped_kebabignore, skipped_builtin_blacklist: fs_skips.skipped_builtin_blacklist, - skipped_generated: 0, - skipped_size_exceeded: 0, + skipped_generated: fs_skips.skipped_generated, + skipped_size_exceeded: fs_skips.skipped_size_exceeded, skip_examples: fs_skips.skip_examples, items: if summary_only { None } else { Some(items) }, }) diff --git a/crates/kebab-config/src/lib.rs b/crates/kebab-config/src/lib.rs index b9d1c4d..1117713 100644 --- a/crates/kebab-config/src/lib.rs +++ b/crates/kebab-config/src/lib.rs @@ -272,20 +272,12 @@ impl UiCfg { /// p10-1A-1: top-level ingest configuration wrapper. Contains per-media-type /// sub-sections; currently only `code` is defined. -#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] +#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)] #[serde(default)] pub struct IngestCfg { pub code: IngestCodeCfg, } -impl Default for IngestCfg { - fn default() -> Self { - Self { - code: IngestCodeCfg::default(), - } - } -} - /// p10-1A-1: settings for the code ingest pipeline. All fields have /// reasonable defaults so the user need not set anything in `config.toml` /// to get working code ingest. diff --git a/crates/kebab-source-fs/src/connector.rs b/crates/kebab-source-fs/src/connector.rs index 599673e..c61da8d 100644 --- a/crates/kebab-source-fs/src/connector.rs +++ b/crates/kebab-source-fs/src/connector.rs @@ -36,10 +36,16 @@ use crate::walker::{SkipCategory, WalkOverrides, build_overrides, read_kbignore, /// construction time. /// - `copy_threshold_bytes`: `config.storage.copy_threshold_mb * 1 MiB` /// pre-multiplied so we don't recompute per file. +/// - `skip_generated_header`: `config.ingest.code.skip_generated_header`. +/// - `max_file_bytes`: `config.ingest.code.max_file_bytes`. +/// - `max_file_lines`: `config.ingest.code.max_file_lines`. pub struct FsSourceConnector { default_root: PathBuf, default_exclude: Vec, copy_threshold_bytes: u64, + skip_generated_header: bool, + max_file_bytes: u64, + max_file_lines: u32, } impl FsSourceConnector { @@ -59,6 +65,9 @@ impl FsSourceConnector { default_root: root, default_exclude: config.workspace.exclude.clone(), copy_threshold_bytes, + skip_generated_header: config.ingest.code.skip_generated_header, + max_file_bytes: config.ingest.code.max_file_bytes, + max_file_lines: config.ingest.code.max_file_lines, }) } @@ -133,7 +142,59 @@ impl FsSourceConnector { } } - let assets = build_assets(&files, &root, self.copy_threshold_bytes)?; + // p10-1A-1: apply per-file generated-header + size-cap checks on files + // that passed the override (gitignore/builtin/kebabignore) matching. + // These run AFTER the walk-level skip attribution, BEFORE parse dispatch. + let mut accepted_files: Vec = Vec::with_capacity(files.len()); + for abs_path in files { + let rel_path = abs_path.strip_prefix(&root).unwrap_or(&abs_path); + + // Generated-header sniff (config-gated). + if self.skip_generated_header + && kebab_parse_code::is_generated_file(&abs_path).unwrap_or(false) + { + fs_skips.skipped_generated = + fs_skips.skipped_generated.saturating_add(1); + push_sample( + &mut fs_skips.skip_examples.generated, + &abs_path, + &root, + ); + tracing::debug!( + path = %rel_path.display(), + "skip: generated-file marker detected" + ); + continue; + } + + // Size-cap check (byte or line limit). + if kebab_parse_code::is_oversized( + &abs_path, + self.max_file_bytes, + self.max_file_lines, + ) + .unwrap_or(false) + { + fs_skips.skipped_size_exceeded = + fs_skips.skipped_size_exceeded.saturating_add(1); + push_sample( + &mut fs_skips.skip_examples.size_exceeded, + &abs_path, + &root, + ); + tracing::debug!( + path = %rel_path.display(), + max_bytes = self.max_file_bytes, + max_lines = self.max_file_lines, + "skip: file exceeds size cap" + ); + continue; + } + + accepted_files.push(abs_path); + } + + let assets = build_assets(&accepted_files, &root, self.copy_threshold_bytes)?; Ok((assets, fs_skips)) } } @@ -147,6 +208,12 @@ pub struct FsScanSkips { pub skipped_gitignore: u32, pub skipped_kebabignore: u32, pub skipped_builtin_blacklist: u32, + /// p10-1A-1: files skipped because their first ~512 bytes contained a + /// generated-file marker (`@generated`, `do not edit`, …). + pub skipped_generated: u32, + /// p10-1A-1: files skipped because they exceeded `max_file_bytes` or + /// `max_file_lines` in `[ingest.code]`. + pub skipped_size_exceeded: u32, /// Sample paths per spec §5.5 (≤ 5 per category). Paths are /// workspace-relative POSIX strings when available, absolute otherwise. pub skip_examples: SkipExamples, @@ -608,4 +675,103 @@ mod tests { skips.skip_examples.gitignore ); } + + // ── p10-1A-1: generated-header + size-cap skip tests ──────────────────── + + /// Helper: connector with default ingest.code settings. + fn cfg_with_root_defaults(root: &str) -> Config { + // cfg_with_root already uses Config::defaults() which has + // skip_generated_header=true, max_file_bytes=262144, max_file_lines=5000. + cfg_with_root(root) + } + + /// Helper: connector with overridden size caps. + fn cfg_with_size_cap(root: &str, max_bytes: u64, max_lines: u32) -> Config { + let mut c = cfg_with_root(root); + c.ingest.code.max_file_bytes = max_bytes; + c.ingest.code.max_file_lines = max_lines; + c + } + + #[test] + fn ingest_report_counts_generated_files() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join("normal.md"), "# hi").unwrap(); + std::fs::write(root.join("autogen.rs"), "// @generated\nfn x() {}\n").unwrap(); + + let conn = FsSourceConnector::new( + &cfg_with_root_defaults(root.to_str().unwrap()), + ) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_generated >= 1, + "skipped_generated should be >= 1; got {}", + skips.skipped_generated + ); + assert!( + skips.skip_examples.generated.iter().any(|p| p.contains("autogen")), + "skip_examples.generated should contain 'autogen'; got: {:?}", + skips.skip_examples.generated + ); + // The normal.md file must NOT be skipped. + let asset_paths: Vec<_> = _assets + .iter() + .map(|a| a.workspace_path.0.clone()) + .collect(); + assert!( + asset_paths.iter().any(|p| p.contains("normal")), + "normal.md should still be emitted; assets: {asset_paths:?}" + ); + } + + #[test] + fn ingest_report_counts_oversized_files_by_bytes() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + std::fs::write(root.join("normal.md"), "# hi").unwrap(); + // Write a file larger than the 1024-byte cap. + let big: String = "x\n".repeat(1_000); + std::fs::write(root.join("huge.rs"), &big).unwrap(); + + let conn = FsSourceConnector::new( + &cfg_with_size_cap(root.to_str().unwrap(), 1024, 5_000), + ) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_size_exceeded >= 1, + "skipped_size_exceeded should be >= 1; got {}", + skips.skipped_size_exceeded + ); + assert!( + skips.skip_examples.size_exceeded.iter().any(|p| p.contains("huge")), + "skip_examples.size_exceeded should contain 'huge'; got: {:?}", + skips.skip_examples.size_exceeded + ); + } + + #[test] + fn ingest_report_size_cap_by_line_count() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path(); + // 6000 lines but small per-line — line cap of 5000 should trigger. + let body: String = "x\n".repeat(6_000); + std::fs::write(root.join("longfile.rs"), &body).unwrap(); + + let conn = FsSourceConnector::new( + &cfg_with_size_cap(root.to_str().unwrap(), 262_144, 5_000), + ) + .unwrap(); + let (_assets, skips) = conn.scan_with_skips(&SourceScope::default()).unwrap(); + + assert!( + skips.skipped_size_exceeded >= 1, + "skipped_size_exceeded should be >= 1 (line cap); got {}", + skips.skipped_size_exceeded + ); + } } -- 2.49.1 From 298f4adc815330cee9038fc02658f0dad9ae0dd3 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 17:21:59 +0900 Subject: [PATCH 18/21] feat(p10-1a-1): CLI filter flags + SchemaStats breakdowns + regression tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Task 13: add wire regression tests proving markdown SearchHit omits repo/code_lang when None, and all 5 original Citation variants serialize byte-identically without spurious Code-variant keys. Task 15: add --repo (repeatable) and --code-lang (repeatable, comma-separated) flags to `kebab search`; propagate both into SearchFilters instead of the previous vec![] stub. Add #[allow(clippy::large_enum_variant)] — Cmd is short-lived, boxing buys nothing. Task 16: add code_lang_breakdown and repo_breakdown BTreeMap fields to Stats (schema.v1); derive Default on Stats; populate both as empty in collect_stats (1A-2 fills them when code chunks land). Add unit test asserting both keys are present in the serialized object. Co-Authored-By: Claude Sonnet 4.6 --- crates/kebab-app/src/schema.rs | 39 ++++++- crates/kebab-cli/src/main.rs | 25 ++++- .../wire_citation_5_variants_unchanged.rs | 100 ++++++++++++++++++ .../tests/wire_search_filters_code.rs | 72 +++++++++++++ .../tests/wire_search_hit_no_code_fields.rs | 47 ++++++++ 5 files changed, 279 insertions(+), 4 deletions(-) create mode 100644 crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs create mode 100644 crates/kebab-cli/tests/wire_search_filters_code.rs create mode 100644 crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs diff --git a/crates/kebab-app/src/schema.rs b/crates/kebab-app/src/schema.rs index 793f4d5..866c714 100644 --- a/crates/kebab-app/src/schema.rs +++ b/crates/kebab-app/src/schema.rs @@ -45,7 +45,7 @@ pub struct Models { pub corpus_revision: u64, } -#[derive(Debug, Clone, Serialize, Deserialize)] +#[derive(Debug, Clone, Default, Serialize, Deserialize)] pub struct Stats { pub doc_count: u64, pub chunk_count: u64, @@ -63,6 +63,14 @@ pub struct Stats { /// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold. #[serde(default)] pub stale_doc_count: u64, + /// p10-1A-1: code language breakdown (chunk counts by canonical lowercase + /// language identifier). Empty until 1A-2 produces code chunks. + #[serde(default)] + pub code_lang_breakdown: std::collections::BTreeMap, + /// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value). + /// Empty until 1A-2 produces code chunks. + #[serde(default)] + pub repo_breakdown: std::collections::BTreeMap, } const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION"); @@ -158,6 +166,9 @@ fn collect_stats( lang_breakdown: counts.lang_breakdown, index_bytes, stale_doc_count: counts.stale_doc_count, + // p10-1A-1: populated by 1A-2 code ingest; empty until then. + code_lang_breakdown: std::collections::BTreeMap::new(), + repo_breakdown: std::collections::BTreeMap::new(), }) } @@ -182,6 +193,32 @@ fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Mode mod tests_stats_ext { use super::*; + /// p10-1A-1: Stats must serialize `code_lang_breakdown` and + /// `repo_breakdown` so downstream consumers (MCP skill, Claude Code) + /// can branch on their presence. + #[test] + fn stats_includes_code_lang_and_repo_breakdown_fields() { + let stats = Stats::default(); + let v = serde_json::to_value(&stats).unwrap(); + assert!( + v.get("code_lang_breakdown").is_some(), + "Stats JSON must include code_lang_breakdown: {v}" + ); + assert!( + v.get("repo_breakdown").is_some(), + "Stats JSON must include repo_breakdown: {v}" + ); + // Empty BTreeMap serializes as `{}` — confirm it's an object, not null. + assert!( + v["code_lang_breakdown"].is_object(), + "code_lang_breakdown must be an object: {v}" + ); + assert!( + v["repo_breakdown"].is_object(), + "repo_breakdown must be an object: {v}" + ); + } + #[test] fn stats_includes_breakdowns_and_bytes_on_fresh_corpus() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/kebab-cli/src/main.rs b/crates/kebab-cli/src/main.rs index 3ca6e63..ad06916 100644 --- a/crates/kebab-cli/src/main.rs +++ b/crates/kebab-cli/src/main.rs @@ -46,6 +46,11 @@ struct Cli { command: Cmd, } +// p10-1A-1: adding `repo` and `code_lang` Vec fields pushed `Cmd` +// over clippy's large_enum_variant threshold. The enum is short-lived +// (parsed once at startup, never cloned in a hot path) — boxing would add +// noise with no real benefit. +#[allow(clippy::large_enum_variant)] #[derive(Subcommand, Debug)] enum Cmd { /// Initialise XDG dirs + workspace + `config.toml`. @@ -165,6 +170,18 @@ enum Cmd { #[arg(long)] doc_id: Option, + /// p10-1A-1: filter by repo name (`metadata.repo`). Repeatable; + /// multi-value = OR. Empty = no filter (all repos returned). + #[arg(long = "repo", value_name = "NAME", num_args = 1)] + repo: Vec, + + /// p10-1A-1: filter by code language identifier (lowercase + /// canonical). Repeatable or comma-separated. + /// Examples: `rust`, `python`, `typescript`. + /// Unknown values produce empty hits. + #[arg(long = "code-lang", value_name = "LANG", num_args = 1, value_delimiter = ',')] + code_lang: Vec, + /// p9-fb-37: emit pre-fusion lexical / vector / RRF candidate /// lists + per-stage timing in the response. Bypasses cache /// (debug intent — fresh run guaranteed). Requires embeddings @@ -688,6 +705,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> { media, ingested_after, doc_id, + repo, + code_lang, trace, bulk, } => { @@ -819,7 +838,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> { None => None, }; - // p9-fb-36: build SearchFilters from the 7 new flags. + // p9-fb-36 + p10-1A-1: build SearchFilters from CLI flags. let filters = kebab_core::SearchFilters { tags_any: tag.clone(), lang: lang.as_ref().map(|s| kebab_core::Lang(s.clone())), @@ -828,8 +847,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> { media: media_norm, ingested_after: ingested_after_parsed, doc_id: doc_id.as_ref().map(|s| kebab_core::DocumentId(s.clone())), - repo: vec![], - code_lang: vec![], + repo: repo.clone(), + code_lang: code_lang.clone(), }; let q = kebab_core::SearchQuery { diff --git a/crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs b/crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs new file mode 100644 index 0000000..6242024 --- /dev/null +++ b/crates/kebab-cli/tests/wire_citation_5_variants_unchanged.rs @@ -0,0 +1,100 @@ +//! p10-1A-1 Task 13: regression — the 5 original Citation variants +//! (Line, Page, Region, Caption, Time) serialize byte-identically to +//! pre-Task-1 form. No spurious `code`, `line_start`, or `symbol` keys +//! must leak into these variants. + +use kebab_core::{Citation, WorkspacePath}; + +#[test] +fn line_variant_serialization_unchanged() { + let c = Citation::Line { + path: WorkspacePath::new("a.md".into()).unwrap(), + start: 1, + end: 2, + section: Some("§14".into()), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "line"); + assert_eq!(v["start"], 1); + assert_eq!(v["end"], 2); + assert_eq!(v["section"], "§14"); + // Must not bleed Code-variant keys. + assert!(v.get("line_start").is_none(), "line_start must be absent: {v}"); + assert!(v.get("symbol").is_none(), "symbol must be absent: {v}"); + assert!(v.get("code").is_none(), "code must be absent: {v}"); +} + +#[test] +fn line_variant_null_section_omitted() { + let c = Citation::Line { + path: WorkspacePath::new("b.md".into()).unwrap(), + start: 5, + end: 10, + section: None, + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "line"); + // `section` with None should be omitted (skip_serializing_if = is_none). + assert!(v.get("section").is_none() || v["section"].is_null()); +} + +#[test] +fn page_variant_serialization_unchanged() { + let c = Citation::Page { + path: WorkspacePath::new("a.pdf".into()).unwrap(), + page: 13, + section: None, + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "page"); + assert_eq!(v["page"], 13); + assert!(v.get("line_start").is_none(), "line_start must be absent: {v}"); + assert!(v.get("symbol").is_none(), "symbol must be absent: {v}"); +} + +#[test] +fn region_variant_serialization_unchanged() { + let c = Citation::Region { + path: WorkspacePath::new("img.png".into()).unwrap(), + x: 10, + y: 20, + w: 100, + h: 200, + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "region"); + assert_eq!(v["x"], 10); + assert_eq!(v["y"], 20); + assert_eq!(v["w"], 100); + assert_eq!(v["h"], 200); + assert!(v.get("line_start").is_none(), "line_start must be absent: {v}"); +} + +#[test] +fn caption_variant_serialization_unchanged() { + let c = Citation::Caption { + path: WorkspacePath::new("a.png".into()).unwrap(), + model: "qwen2.5-vl:7b".into(), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "caption"); + assert_eq!(v["model"], "qwen2.5-vl:7b"); + assert!(v.get("line_start").is_none(), "line_start must be absent: {v}"); +} + +#[test] +fn time_variant_serialization_unchanged() { + let c = Citation::Time { + path: WorkspacePath::new("audio.mp3".into()).unwrap(), + start_ms: 1000, + end_ms: 5000, + speaker: Some("Alice".into()), + }; + let v = serde_json::to_value(&c).unwrap(); + assert_eq!(v["kind"], "time"); + assert_eq!(v["start_ms"], 1000); + assert_eq!(v["end_ms"], 5000); + assert_eq!(v["speaker"], "Alice"); + assert!(v.get("line_start").is_none(), "line_start must be absent: {v}"); + assert!(v.get("symbol").is_none(), "symbol must be absent: {v}"); +} diff --git a/crates/kebab-cli/tests/wire_search_filters_code.rs b/crates/kebab-cli/tests/wire_search_filters_code.rs new file mode 100644 index 0000000..3480210 --- /dev/null +++ b/crates/kebab-cli/tests/wire_search_filters_code.rs @@ -0,0 +1,72 @@ +//! p10-1A-1 Task 15: CLI accepts --repo and --code-lang flags. +//! +//! These tests verify that clap parses the new flags without error. +//! They drive `kebab search --help` (which exercises flag parsing +//! via clap's help generation path, exiting 0) or use a minimal +//! config + `--json` round-trip to verify the flags reach the wire. + +use std::process::Command; + +fn kebab() -> Command { + Command::new(env!("CARGO_BIN_EXE_kebab")) +} + +/// `kebab search --help` must exit 0 and mention `--repo`. +#[test] +fn cli_search_help_mentions_repo_flag() { + let out = kebab() + .args(["search", "--help"]) + .output() + .expect("failed to run kebab"); + // clap help exits 0. + assert!( + out.status.success(), + "kebab search --help exited non-zero: {:?}", + out.status + ); + let stdout = String::from_utf8_lossy(&out.stdout); + assert!( + stdout.contains("--repo"), + "--repo flag must appear in search help output:\n{stdout}" + ); +} + +/// `kebab search --help` must exit 0 and mention `--code-lang`. +#[test] +fn cli_search_help_mentions_code_lang_flag() { + let out = kebab() + .args(["search", "--help"]) + .output() + .expect("failed to run kebab"); + assert!( + out.status.success(), + "kebab search --help exited non-zero: {:?}", + out.status + ); + let stdout = String::from_utf8_lossy(&out.stdout); + assert!( + stdout.contains("--code-lang"), + "--code-lang flag must appear in search help output:\n{stdout}" + ); +} + +/// `kebab search --help` must exit 0 and mention `--media`. +/// Confirms `--media code` value pathway is available (media is +/// a free-form Vec that already accepted arbitrary values). +#[test] +fn cli_search_help_mentions_media_flag() { + let out = kebab() + .args(["search", "--help"]) + .output() + .expect("failed to run kebab"); + assert!( + out.status.success(), + "kebab search --help exited non-zero: {:?}", + out.status + ); + let stdout = String::from_utf8_lossy(&out.stdout); + assert!( + stdout.contains("--media"), + "--media flag must appear in search help output:\n{stdout}" + ); +} diff --git a/crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs b/crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs new file mode 100644 index 0000000..c3d7d24 --- /dev/null +++ b/crates/kebab-cli/tests/wire_search_hit_no_code_fields.rs @@ -0,0 +1,47 @@ +//! p10-1A-1 Task 13: regression — markdown SearchHit omits `repo` and +//! `code_lang` from JSON when both are `None`. +//! +//! Proves that adding optional fields to SearchHit does not silently +//! inject spurious keys into the existing markdown corpus wire shape. + +use kebab_core::{ + Citation, ChunkId, ChunkerVersion, DocumentId, IndexVersion, RetrievalDetail, ScoreKind, + SearchHit, WorkspacePath, +}; + +#[test] +fn markdown_hit_omits_repo_and_code_lang() { + let hit = SearchHit { + rank: 1, + chunk_id: ChunkId("c1".into()), + doc_id: DocumentId("d1".into()), + doc_path: WorkspacePath::new("notes/foo.md".into()).unwrap(), + heading_path: vec!["A".into(), "B".into()], + section_label: Some("B".into()), + snippet: "hi".into(), + citation: Citation::Line { + path: WorkspacePath::new("notes/foo.md".into()).unwrap(), + start: 1, + end: 2, + section: None, + }, + retrieval: RetrievalDetail::default(), + index_version: IndexVersion("v1".into()), + embedding_model: None, + chunker_version: ChunkerVersion("md-heading-v1".into()), + indexed_at: time::OffsetDateTime::UNIX_EPOCH, + stale: false, + score_kind: ScoreKind::Rrf, + repo: None, + code_lang: None, + }; + let s = serde_json::to_string(&hit).unwrap(); + assert!( + !s.contains("\"repo\""), + "repo should be absent from markdown hit JSON: {s}" + ); + assert!( + !s.contains("\"code_lang\""), + "code_lang should be absent from markdown hit JSON: {s}" + ); +} -- 2.49.1 From d13f58d28a985a615071292d236d4b2ae830bc23 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 17:30:01 +0900 Subject: [PATCH 19/21] fix(p10-1a-1): patch wire.rs Stats fixture for new schema fields Task 16's new code_lang_breakdown / repo_breakdown fields broke the existing schema_wrapper_tags_schema_version test in wire.rs which constructs Stats { ... } literally. Use ..Default::default() since Stats now derives Default. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/kebab-cli/src/wire.rs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/crates/kebab-cli/src/wire.rs b/crates/kebab-cli/src/wire.rs index 3fab435..6a2fce0 100644 --- a/crates/kebab-cli/src/wire.rs +++ b/crates/kebab-cli/src/wire.rs @@ -334,6 +334,8 @@ mod tests { lang_breakdown: Default::default(), index_bytes: Default::default(), stale_doc_count: 0, + // p10-1A-1: new fields added to Stats; use Default for the test fixture. + ..Default::default() }, }; let v = wire_schema(&schema); -- 2.49.1 From 7bbd2c0cbfcddbabb698de9154f62a08a26b059d Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 17:41:26 +0900 Subject: [PATCH 20/21] docs(p10-1a-1): wire schema + frozen design + README/HANDOFF/SMOKE + task index Co-Authored-By: Claude Opus 4.7 (1M context) --- HANDOFF.md | 1 + README.md | 7 ++- docs/SMOKE.md | 5 ++ .../2026-04-27-kebab-final-form-design.md | 47 +++++++++++++++++-- docs/wire-schema/v1/citation.schema.json | 3 +- docs/wire-schema/v1/ingest_report.schema.json | 16 ++++++- docs/wire-schema/v1/schema.schema.json | 10 ++++ docs/wire-schema/v1/search_hit.schema.json | 4 +- tasks/INDEX.md | 10 ++++ tasks/p10/INDEX.md | 13 +++++ tasks/p10/p10-1a-1-code-ingest-framework.md | 31 ++++++++++++ 11 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tasks/p10/INDEX.md create mode 100644 tasks/p10/p10-1a-1-code-ingest-framework.md diff --git a/HANDOFF.md b/HANDOFF.md index b799ad7..07048a8 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -20,6 +20,7 @@ P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. | **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) | | **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) | | **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) | +| **10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 (1A-1 머지 직전) — 1A-1 머지 시점 wire schema additive minor + 새 crate kebab-parse-code skeleton 동결, 실제 code chunker 는 1A-2 부터 | P0~P5 직렬. P6~P9 P5 이후 병렬 가능. diff --git a/README.md b/README.md index e4b27f4..0b12a8f 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ kebab doctor |------|------| | `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 | | `kebab ingest []` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. | -| `kebab search --mode {lexical,vector,hybrid} "" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor ] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. | +| `kebab search --mode {lexical,vector,hybrid} "" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor ] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST] [--media code]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. | | `kebab list docs` | 색인된 문서 목록 | | `kebab inspect doc ` / `kebab inspect chunk ` | raw record 보기 | | `kebab fetch chunk [--context N]` / `kebab fetch doc [--max-tokens N]` / `kebab fetch span [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. | @@ -184,6 +184,11 @@ flowchart TB - `dimensions` (default `1024`) — 모델의 embedding 차원. config 와 LanceDB stored dim 불일치 시 검색 결과 0 건 (orphan table). 모델 변경 시 `kebab reset --vector-only && kebab ingest` 로 vector index 재구축 권장. - `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback). - `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25). +- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값. + - `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip. + - `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip. + - `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법). + - `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`. - `[rag] prompt_template_version` (default `"rag-v2"`) — RAG system prompt version. `"rag-v1"` 은 legacy backwards-compat (사용자 명시 시 유지). v2 강화 규칙: (1) fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표 표기, (2) 학습 지식 동원 금지, (3) 근거 모호 시 "확실하지 않다" 명시. - `--config ` flag — 임시 워크스페이스 / 격리 테스트 시 사용. CLI / TUI 모두 honor. - `KEBAB_*` env — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN`, `KEBAB_COMMIT_HASH` 등). diff --git a/docs/SMOKE.md b/docs/SMOKE.md index ca6c7a7..e09dfb8 100644 --- a/docs/SMOKE.md +++ b/docs/SMOKE.md @@ -113,6 +113,11 @@ max_context_tokens = 6000 [ui] theme = "dark" # p9-fb-14 — TUI palette ("dark" / "light", default "dark") + +[ingest.code] +skip_generated_header = true +max_file_bytes = 262144 +max_file_lines = 5000 ``` `KEBAB_*` 환경변수로 override 가능 (`KEBAB_MODELS_LLM_MODEL=gemma4:26b kebab …` 등). 자세한 키 목록은 `crates/kebab-config/src/lib.rs` 의 `apply_env` 매치 암. `KEBAB_READONLY=1` — write-path 비활성화 (CI 안전망). `KEBAB_PROGRESS=plain` — non-TTY 환경에서 진행 상황을 plain 한 줄씩 stderr 출력 (spinner 대신). diff --git a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md index 666e79f..0ee1a41 100644 --- a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +++ b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md @@ -37,6 +37,7 @@ related_tasks: ../../../tasks/INDEX.md | – | ignore | gitignore 문법 + `.kebabignore` | 익숙함 | | – | 에러 | thiserror per crate, anyhow at boundary | 추적성 + UX | | – | sync | watch=false default | v1 명시 ingest | +| C+ | code ingest 추가 | Tier 1/2/3 fan-out, e5-large 유지, 새 Citation `code` variant | 2026-05-15 spec | --- @@ -168,12 +169,12 @@ $ kebab search "Markdown chunking 규칙" `docs/wire-schema/v1/*.schema.json` 으로 동결. internal Rust struct ↔ wire 변환은 `From`/`TryFrom`. 모든 wire 객체는 `schema_version` 필드 필수. -### 2.1 Citation (5 variants — discriminated by `kind`) +### 2.1 Citation (6 variants — discriminated by `kind`) ```json { "schema_version": "citation.v1", - "kind": "line|page|region|caption|time", + "kind": "line|page|region|caption|time|code", "path": "notes/rust/kebab.md", "uri": "notes/rust/kebab.md#L12-L34", @@ -181,7 +182,20 @@ $ kebab search "Markdown chunking 규칙" "page": { "page": 13, "section": "Experiment Setup" }, "region": { "x": 120, "y": 40, "w": 520, "h": 180 }, "caption": { "model": "qwen2.5-vl:7b" }, - "time": { "start_ms": 822000, "end_ms": 850000, "speaker": "S1" } + "time": { "start_ms": 822000, "end_ms": 850000, "speaker": "S1" }, + "code": { "start": 10, "end": 42, "lang": "rust", "repo": "kebab", "symbol": "fn ingest" } +} +``` + +code variant example (p10-1A-1): + +```json +{ + "schema_version": "citation.v1", + "kind": "code", + "path": "crates/kebab-app/src/ingest.rs", + "uri": "crates/kebab-app/src/ingest.rs#L10-L42", + "code": { "start": 10, "end": 42, "lang": "rust", "repo": "kebab", "symbol": "fn ingest" } } ``` @@ -213,7 +227,9 @@ variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 }, "index_version": "v1.0", "embedding_model": "multilingual-e5-large", - "chunker_version": "md-heading-v1" + "chunker_version": "md-heading-v1", + "repo": null, // p10-1A-1: optional, omitted when null (code corpus only) + "code_lang": null // p10-1A-1: optional, omitted when null (code corpus only) } ``` @@ -297,6 +313,17 @@ Per-query failure 는 `bulk_search_item.v1.error` (error.v1) 에 격리, 다른 "scope": { "root": "/home/altair/KnowledgeBase", "include": ["**/*.md"], "exclude": [".git/**"] }, "scanned": 142, "new": 12, "updated": 3, "skipped": 127, "errors": 0, "duration_ms": 4231, + "skipped_gitignore": 40, + "skipped_kebabignore": 5, + "skipped_builtin_blacklist": 80, + "skipped_generated": 2, + "skipped_size_exceeded": 1, + "skip_examples": { + "generated": ["crates/kebab-app/src/generated.rs"], + "size_exceeded": ["crates/kebab-app/fixtures/huge.rs"], + "builtin_blacklist": ["target/release/kebab"], + "gitignore": ["node_modules/lodash/index.js"] + }, "items": [ { "kind": "new|updated|skipped|error", @@ -434,6 +461,8 @@ pub struct PromptTemplateVersion(pub String); pub struct SchemaVersion(pub &'static str); ``` +Note: `chunker_version` family extended in phase 10 (per-language pattern, see 2026-05-15 spec §3.3 for canonical list). Each new language AST chunker registers its own `ChunkerVersion` label (e.g. `code-rust-ast-v1`, `code-python-ast-v1`). The existing `md-heading-v1` / `pdf-page-v1` labels are unaffected. + ### 3.3 RawAsset ```rust @@ -577,6 +606,11 @@ pub struct Metadata { pub trust_level: TrustLevel, pub user_id_alias: Option, pub user: serde_json::Map, + // p10-1A-1: code corpus fields — None for non-code assets. + pub repo: Option, // git repo name (top-level dir or remote basename) + pub git_branch: Option, // HEAD branch name at ingest time + pub git_commit: Option, // HEAD commit SHA (short, 12 chars) at ingest time + pub code_lang: Option, // lowercase language name (e.g. "rust", "python") } pub enum SourceType { Markdown, Note, Paper, Reference, Inbox } @@ -1370,8 +1404,11 @@ pub trait JobRepo { kebab-cli, kebab-tui, kebab-desktop └─> kebab-app ├─> kebab-source-fs + │ └─> kebab-parse-code (p10-1A-1: lang detect / repo detect / skip policy) ├─> kebab-parse-md / kebab-parse-pdf / kebab-parse-image / kebab-parse-audio │ └─> kebab-parse-types (parser intermediate) + ├─> kebab-parse-code + │ └─> kebab-core (domain types only — NO store/embed/llm/rag/UI) ├─> kebab-normalize │ └─> kebab-parse-types ├─> kebab-chunk @@ -1555,6 +1592,8 @@ agent 가 분기). HTTP-SSE transport 는 fb-29 deferral 따라 P+. classify - real-time collab - enterprise auth +코드 ingest 는 더 이상 비-스코프 아님 (2026-05-15 spec). 단 multi-workspace / watch mode / history aware (git blame 기반 citation, diff-aware re-chunking) 는 그대로 비-스코프. + --- ## 12. 다음 단계 diff --git a/docs/wire-schema/v1/citation.schema.json b/docs/wire-schema/v1/citation.schema.json index 30ef875..404e77e 100644 --- a/docs/wire-schema/v1/citation.schema.json +++ b/docs/wire-schema/v1/citation.schema.json @@ -7,7 +7,7 @@ "required": ["schema_version", "kind", "path", "uri", "indexed_at", "stale"], "properties": { "schema_version": { "const": "citation.v1" }, - "kind": { "enum": ["line", "page", "region", "caption", "time"] }, + "kind": { "enum": ["line", "page", "region", "caption", "time", "code"] }, "path": { "type": "string" }, "uri": { "type": "string" }, "line": { "type": "object" }, @@ -15,6 +15,7 @@ "region": { "type": "object" }, "caption": { "type": "object" }, "time": { "type": "object" }, + "code": { "type": "object" }, "indexed_at": { "type": "string", "format": "date-time" }, "stale": { "type": "boolean" } } diff --git a/docs/wire-schema/v1/ingest_report.schema.json b/docs/wire-schema/v1/ingest_report.schema.json index aeb2e67..92ed1f1 100644 --- a/docs/wire-schema/v1/ingest_report.schema.json +++ b/docs/wire-schema/v1/ingest_report.schema.json @@ -38,6 +38,20 @@ }, "description": "p9-fb-25: per-extension skip count. Key = lowercase extension without leading dot (e.g. 'docx'). Files without extension key under ''." }, - "items": { "type": ["array", "null"] } + "items": { "type": ["array", "null"] }, + "skipped_gitignore": { "type": "integer", "minimum": 0 }, + "skipped_kebabignore": { "type": "integer", "minimum": 0 }, + "skipped_builtin_blacklist": { "type": "integer", "minimum": 0 }, + "skipped_generated": { "type": "integer", "minimum": 0 }, + "skipped_size_exceeded": { "type": "integer", "minimum": 0 }, + "skip_examples": { + "type": "object", + "properties": { + "generated": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "size_exceeded": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "builtin_blacklist": { "type": "array", "items": { "type": "string" }, "maxItems": 5 }, + "gitignore": { "type": "array", "items": { "type": "string" }, "maxItems": 5 } + } + } } } diff --git a/docs/wire-schema/v1/schema.schema.json b/docs/wire-schema/v1/schema.schema.json index 5b46e7e..6e610b1 100644 --- a/docs/wire-schema/v1/schema.schema.json +++ b/docs/wire-schema/v1/schema.schema.json @@ -78,6 +78,16 @@ "type": "integer", "minimum": 0, "description": "p9-fb-37: docs whose updated_at exceeds config.search.stale_threshold_days. 0 when threshold=0." + }, + "code_lang_breakdown": { + "type": "object", + "description": "p10-1A-1: per-language code chunk count. Key = lowercase language name (e.g. 'rust', 'python'). Populated after 1A-2 lands; empty on markdown-only corpora.", + "additionalProperties": { "type": "integer", "minimum": 0 } + }, + "repo_breakdown": { + "type": "object", + "description": "p10-1A-1: per-repo code chunk count. Key = repo name as detected by kebab-parse-code::repo. Empty on markdown-only corpora.", + "additionalProperties": { "type": "integer", "minimum": 0 } } } } diff --git a/docs/wire-schema/v1/search_hit.schema.json b/docs/wire-schema/v1/search_hit.schema.json index 88256e1..db4fa97 100644 --- a/docs/wire-schema/v1/search_hit.schema.json +++ b/docs/wire-schema/v1/search_hit.schema.json @@ -42,6 +42,8 @@ "embedding_model": { "type": ["string", "null"] }, "chunker_version": { "type": "string" }, "indexed_at": { "type": "string", "format": "date-time" }, - "stale": { "type": "boolean" } + "stale": { "type": "boolean" }, + "repo": { "type": ["string", "null"] }, + "code_lang": { "type": ["string", "null"] } } } diff --git a/tasks/INDEX.md b/tasks/INDEX.md index 7a874b3..00969e0 100644 --- a/tasks/INDEX.md +++ b/tasks/INDEX.md @@ -35,6 +35,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능. | P7 | [phase-7-pdf.md](phase-7-pdf.md) | PDF text + page citation | kebab-parse-pdf | P5 | | P8 | [phase-8-audio.md](phase-8-audio.md) | 음성 transcription + timestamp citation | kebab-parse-audio | P5 | | P9 | [phase-9-ui.md](phase-9-ui.md) | TUI + desktop app | kebab-tui, kebab-desktop | P5 | +| P10 | [p10/INDEX.md](p10/INDEX.md) | Code ingest framework + AST chunkers | kebab-parse-code, kebab-source-fs (code walk) | P5 | ## Component task decomposition (per phase) @@ -137,6 +138,15 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능. - [p9-fb-41 multi-hop reasoning](p9/p9-fb-41-multi-hop-reasoning.md) — ⏳ 미구현, brainstorm 필요 (XL, eval 인프라 선행) - [p9-fb-42 bulk multi-query + re-rank hint](p9/p9-fb-42-bulk-multi-query-rerank.md) — ✅ 머지 (2026-05-10) — bulk only, rerank hint deferred +- P10 — [p10/](p10/) — code ingest (multi-task, sub-indexed in [p10/INDEX.md](p10/INDEX.md)) + - [p10-1A-1 code ingest framework](p10/p10-1a-1-code-ingest-framework.md) — 🟡 진행 중 + - p10-1A-2 Rust AST chunker — ⏳ + - p10-1B Python + TS/JS AST chunkers — ⏳ + - p10-1C Go + Java + Kotlin AST chunkers — ⏳ + - p10-1D C + C++ AST chunkers — ⏳ + - p10-2 Tier 2 resource-aware — ⏳ + - p10-3 Tier 3 paragraph + line-window fallback — ⏳ + ## Post-merge 핫픽스 머지 후 발견된 버그들과 그 follow-up PR들은 [HOTFIXES.md](HOTFIXES.md)에 dated 로그로 기록한다. 원래 task spec은 frozen 상태로 두고, post-merge 동작 변경은 HOTFIXES.md를 source of truth로 본다. diff --git a/tasks/p10/INDEX.md b/tasks/p10/INDEX.md new file mode 100644 index 0000000..8d0017a --- /dev/null +++ b/tasks/p10/INDEX.md @@ -0,0 +1,13 @@ +# Phase 10 — Code Ingest + +| ID | Subject | Status | +|----|---------|--------| +| 1A-1 | code ingest framework (wire schema, parse-code crate skeleton, filter flags, skip policy, config 절) | 🟡 진행 중 | +| 1A-2 | Rust AST chunker | ⏳ | +| 1B | Python + TS/JS AST chunkers | ⏳ | +| 1C | Go + Java + Kotlin AST chunkers | ⏳ | +| 1D | C + C++ AST chunkers | ⏳ | +| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ⏳ | +| 3 | Tier 3 paragraph + line-window fallback | ⏳ | + +Design: [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) diff --git a/tasks/p10/p10-1a-1-code-ingest-framework.md b/tasks/p10/p10-1a-1-code-ingest-framework.md new file mode 100644 index 0000000..188c9e0 --- /dev/null +++ b/tasks/p10/p10-1a-1-code-ingest-framework.md @@ -0,0 +1,31 @@ +# p10-1A-1 — code ingest framework + +**Status:** 🟡 진행 중 +**Contract sections:** §2.1 (Citation `code` variant), §2.2 (SearchHit repo/code_lang), §2.4 (IngestReport skip counters), §2 schema.v1 (code_lang_breakdown + repo_breakdown), §3.6 (Metadata fields), §8 (kebab-parse-code crate boundary), §11 (code ingest no longer 비-스코프). +**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1A-1. +**Plan:** [2026-05-15-p10-1a-1-code-ingest-framework.md](../../docs/superpowers/plans/2026-05-15-p10-1a-1-code-ingest-framework.md). + +## Goal + +Land the *framework surface* for code ingest — wire schema (additive minor), CLI filter flags, ignore policy, skip policy infrastructure, `kebab-parse-code` crate skeleton, `[ingest.code]` config section — without enabling any code chunker. 1A-2 plugs in the Rust AST chunker on top. + +## Acceptance criteria + +- `cargo test --workspace --no-fail-fast -j 1` passes. +- Regression test (`wire_search_hit_no_code_fields`, `wire_citation_5_variants_unchanged`) passes — markdown corpus wire output unchanged. +- `cargo clippy --workspace --all-targets -- -D warnings` passes. +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` updated per design §10.1. +- README + HANDOFF + SMOKE updated. + +## Allowed dependencies + +- `kebab-parse-code` may depend on `kebab-core`, `anyhow`, `gix`. NOT on store / embed / llm / rag / UI. +- Source-fs may depend on `kebab-parse-code`. + +## Forbidden dependencies + +- UI crates (cli / mcp / tui) must NOT import `kebab-parse-code` directly. + +## Risks / notes + +- `.gitignore` honor changes existing behavior for markdown corpora whose files live in gitignored areas. Regression test covers the standard case (no overlap). If a user reports missing docs after 1A-1 lands, log to HOTFIXES. -- 2.49.1 From 7961f8813df23296a0228c3f8db5483929dd70d7 Mon Sep 17 00:00:00 2001 From: th-kim0823 Date: Fri, 15 May 2026 18:24:15 +0900 Subject: [PATCH 21/21] =?UTF-8?q?fix(p10-1a-1):=20PR=20review=20round=201?= =?UTF-8?q?=20=E2=80=94=20doc=20inconsistencies?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 회차 1 review 의 4 건 actionable 모두 반영: 1. frozen design §2.1 의 code variant 예시에서 존재하지 않는 `repo` 필드 제거 + nested form 에서 actual wire (flat) 형태로 정리. 5 variant 의 nested-form illustrative example 은 그대로 두고, code variant 만 별도 block 으로 분리해서 actual wire 와 1:1 매칭. 또 위쪽 6 variant nested-form group 에서도 'code' 행 삭제 (정확한 contract 는 별도 block 에 있음). 2. §2.2 SearchHit 예시의 `repo: null, code_lang: null` + 'omitted when null' 주석 모순 제거 — 키 자체를 빼고 inline 주석으로 'markdown hit 에는 absent, 코드 hit 에서만 surface' 설명. 3. HANDOFF Phase row 식별자 `**10**` → `**P10**` (다른 row 와 일관성). 4. README synopsis 의 중복 `[--media code]` 제거 (`--media` 는 이미 위쪽에 한 번 있음, code 는 값 중 하나라 prose 에서 설명). 코드 변경 없음 — 모두 markdown 문서. Co-Authored-By: Claude Opus 4.7 (1M context) --- HANDOFF.md | 2 +- README.md | 2 +- .../2026-04-27-kebab-final-form-design.md | 21 +++++++++++-------- 3 files changed, 14 insertions(+), 11 deletions(-) diff --git a/HANDOFF.md b/HANDOFF.md index 07048a8..9025ed8 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -20,7 +20,7 @@ P0–P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. | **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) | | **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) | | **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) | -| **10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 (1A-1 머지 직전) — 1A-1 머지 시점 wire schema additive minor + 새 crate kebab-parse-code skeleton 동결, 실제 code chunker 는 1A-2 부터 | +| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 (1A-1 머지 직전) — 1A-1 머지 시점 wire schema additive minor + 새 crate kebab-parse-code skeleton 동결, 실제 code chunker 는 1A-2 부터 | P0~P5 직렬. P6~P9 P5 이후 병렬 가능. diff --git a/README.md b/README.md index 0b12a8f..6c18d3e 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ kebab doctor |------|------| | `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 | | `kebab ingest []` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. | -| `kebab search --mode {lexical,vector,hybrid} "" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor ] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST] [--media code]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. | +| `kebab search --mode {lexical,vector,hybrid} "" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor ] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. | | `kebab list docs` | 색인된 문서 목록 | | `kebab inspect doc ` / `kebab inspect chunk ` | raw record 보기 | | `kebab fetch chunk [--context N]` / `kebab fetch doc [--max-tokens N]` / `kebab fetch span [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. | diff --git a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md index 0ee1a41..d883d38 100644 --- a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +++ b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md @@ -182,12 +182,15 @@ $ kebab search "Markdown chunking 규칙" "page": { "page": 13, "section": "Experiment Setup" }, "region": { "x": 120, "y": 40, "w": 520, "h": 180 }, "caption": { "model": "qwen2.5-vl:7b" }, - "time": { "start_ms": 822000, "end_ms": 850000, "speaker": "S1" }, - "code": { "start": 10, "end": 42, "lang": "rust", "repo": "kebab", "symbol": "fn ingest" } + "time": { "start_ms": 822000, "end_ms": 850000, "speaker": "S1" } } ``` -code variant example (p10-1A-1): +variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 path + W3C Media Fragments 합본). + +**구현 노트 (wire 실제 형태):** 위 nested form 은 illustrative 구조. 실제 wire 는 `#[serde(tag = "kind")]` 외부 tag enum 이라 variant 별 필드가 *top-level* 에 들어감 (e.g. `Line` → `{"kind":"line", "start":12, "end":34, ...}`, nested 형태 아님). 모든 6 variant 동일. + +**code variant (p10-1A-1, flat wire form):** 자세한 contract 은 2026-05-15 code ingest spec §3.1 참조. 5 필드 — `path`, `line_start`, `line_end`, `symbol` (Option, AST 결과면 채움), `lang` (Option, lowercase canonical). `repo` 는 Citation 이 아니라 `SearchHit` / `Metadata` 에 surface. ```json { @@ -195,12 +198,13 @@ code variant example (p10-1A-1): "kind": "code", "path": "crates/kebab-app/src/ingest.rs", "uri": "crates/kebab-app/src/ingest.rs#L10-L42", - "code": { "start": 10, "end": 42, "lang": "rust", "repo": "kebab", "symbol": "fn ingest" } + "line_start": 10, + "line_end": 42, + "symbol": "fn ingest", + "lang": "rust" } ``` -variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 path + W3C Media Fragments 합본). - ### 2.2 SearchHit ```json @@ -227,9 +231,8 @@ variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 }, "index_version": "v1.0", "embedding_model": "multilingual-e5-large", - "chunker_version": "md-heading-v1", - "repo": null, // p10-1A-1: optional, omitted when null (code corpus only) - "code_lang": null // p10-1A-1: optional, omitted when null (code corpus only) + "chunker_version": "md-heading-v1" + // p10-1A-1: 코드 hit 에만 surface — `"repo": "kebab"` / `"code_lang": "rust"` 같은 키 추가됨. markdown hit 에는 키 자체 absent (skip_serializing_if). } ``` -- 2.49.1