diff --git a/docs/superpowers/plans/2026-05-19-p10-1a-2-rust-ast-chunker.md b/docs/superpowers/plans/2026-05-19-p10-1a-2-rust-ast-chunker.md new file mode 100644 index 0000000..5d55d11 --- /dev/null +++ b/docs/superpowers/plans/2026-05-19-p10-1a-2-rust-ast-chunker.md @@ -0,0 +1,1441 @@ +# p10-1A-2 Rust AST Chunker Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Activate Rust code ingest end-to-end — `.rs` files parse via tree-sitter into one `Block::Code` per AST semantic unit, chunk 1:1 (with oversize fallback split), and surface as `Citation::Code { symbol, lang, line_start, line_end }` in search. + +**Architecture:** tree-sitter lives in the **parser** (`kebab-parse-code`, per design §6.3 dependency graph; mirrors the proven PDF pattern where the parser emits structured blocks and the chunker maps them). `kebab-parse-code/src/rust.rs` is an `Extractor` producing a `CanonicalDocument` whose blocks are AST units carrying a new internal `SourceSpan::Code` variant. `kebab-chunk/src/code_rust_ast_v1.rs` maps each block to a chunk and splits oversize units. `citation_helper` gains one arm so the span flows to the existing (1A-1) `Citation::Code` wire shape. A new `MediaType::Code(String)` routes `.rs` files; non-Rust code langs stay `MediaType::Other` in 1A. + +**Tech Stack:** Rust 2024 workspace, `tree-sitter` + `tree-sitter-rust`, existing `kebab-core` domain model, `serde_json_canonicalizer` + `blake3` (chunk id / policy hash), `gix` (1A-1 repo detect). + +--- + +## Pre-flight + +- [ ] **Branch.** From clean `main`: + +```bash +cd /home/altair823/kebab +git checkout main && git pull +git checkout -b feat/p10-1a-2-rust-ast-chunker +``` + +- [ ] **Disk hygiene** (CLAUDE.md — routine after each merged PR; #139 just merged): + +```bash +cargo clean +``` + +Notes that apply throughout: +- Full suite is always `cargo test --workspace --no-fail-fast -j 1` (CLAUDE.md — parallel link OOMs). Per-crate runs (`cargo test -p `) may run normally. +- `cargo clippy --workspace --all-targets -- -D warnings` is the CI gate; run before every commit that touches code. +- Frozen task spec for this work: `tasks/p10/p10-1a-2-rust-ast-chunker.md` (already written; do not edit retroactively). + +--- + +## Task 1: Add tree-sitter dependencies (workspace-deps pattern) + +**Files:** +- Modify: `Cargo.toml` (workspace `[workspace.dependencies]`, ends ~line 90 after the `gix` entry) +- Modify: `crates/kebab-parse-code/Cargo.toml` (`[dependencies]`) + +- [ ] **Step 1: Resolve + pin versions via cargo add** + +Run (this picks the latest compatible versions and writes them into the crate): + +```bash +cargo add tree-sitter tree-sitter-rust -p kebab-parse-code +``` + +- [ ] **Step 2: Move the resolved versions into workspace deps** + +Read the two version strings `cargo add` wrote into `crates/kebab-parse-code/Cargo.toml`. Then in the workspace `Cargo.toml`, append to `[workspace.dependencies]` directly after the `gix = { ... }` line (keep the existing comment style — one explanatory comment line): + +```toml +# Rust source parsing for code ingest (kebab-parse-code, p10-1A-2). The +# chunker stays tree-sitter-free — AST work is parser-side per design §6.3. +tree-sitter = "" +tree-sitter-rust = "" +``` + +Then rewrite the crate's `[dependencies]` to use the workspace table (matching the existing `anyhow`/`gix` style): + +```toml +[dependencies] +anyhow = { workspace = true } +gix = { workspace = true } +tree-sitter = { workspace = true } +tree-sitter-rust = { workspace = true } + +[dev-dependencies] +tempfile = { workspace = true } +``` + +- [ ] **Step 3: Verify it builds and the lock updated** + +Run: `cargo build -p kebab-parse-code` +Expected: compiles clean (skeleton still has no tree-sitter use yet — deps unused is fine, no `-D warnings` on a plain build). + +- [ ] **Step 4: Commit** + +```bash +git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml +git commit -m "build(p10-1a-2): add tree-sitter + tree-sitter-rust workspace deps + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 2: `SourceSpan::Code` internal variant + +**Files:** +- Modify: `crates/kebab-core/src/document.rs` (`SourceSpan` enum, after the `Time { start_ms, end_ms }` variant ~line 144) +- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (§3.4 `SourceSpan` enum listing, ~line 562) +- Test: `crates/kebab-core/src/document.rs` (`#[cfg(test)] mod tests`, or the file's existing test module) + +`SourceSpan` is `#[serde(rename_all = "lowercase", tag = "kind")]` — the new variant serializes as `{"kind":"code", ...}`. This is the chunks-table `source_spans_json` internal shape, NOT a wire schema (wire `Citation::Code` already shipped in 1A-1), so no wire `.v2` bump. + +- [ ] **Step 1: Write the failing test** + +Add to the `kebab-core` test module that covers `SourceSpan` (search `mod tests` in `document.rs`; if none, add one at end of file): + +```rust +#[test] +fn source_span_code_round_trips_and_tags_lowercase() { + let s = SourceSpan::Code { + line_start: 10, + line_end: 42, + symbol: Some("foo::Bar::baz".to_string()), + lang: Some("rust".to_string()), + }; + let v = serde_json::to_value(&s).unwrap(); + assert_eq!(v["kind"], "code"); + assert_eq!(v["line_start"], 10); + assert_eq!(v["line_end"], 42); + assert_eq!(v["symbol"], "foo::Bar::baz"); + assert_eq!(v["lang"], "rust"); + let back: SourceSpan = serde_json::from_value(v).unwrap(); + assert_eq!(back, s); +} +``` + +- [ ] **Step 2: Run it — expect compile failure** + +Run: `cargo test -p kebab-core source_span_code_round_trips` +Expected: FAIL — `no variant named Code`. + +- [ ] **Step 3: Add the variant** + +In `crates/kebab-core/src/document.rs`, add as the last variant of `pub enum SourceSpan` (after `Time { ... }`): + +```rust + /// p10-1A-2: AST-unit span for code ingest. Internal storage shape + /// (chunks.source_spans_json) — `citation_helper` maps this to the + /// wire `Citation::Code` (added 1A-1). `symbol` is the per-language + /// self-reference path (design §3.4); `` / `` for + /// glue regions, never null for an identified unit. `lang` is the + /// canonical code_lang. + Code { + line_start: u32, + line_end: u32, + symbol: Option, + lang: Option, + }, +``` + +- [ ] **Step 4: Compile — fix every non-exhaustive match** + +Run: `cargo build --workspace 2>&1 | grep -A2 "non-exhaustive\|E0004"` + +The compiler will flag every exhaustive `match` on `SourceSpan`. Known sites (handle each minimally — a `Code` arm that does the type-correct thing, NOT a catch-all `_`): +- `crates/kebab-search/src/citation_helper.rs` — handled fully in Task 3; for now add a temporary `SourceSpan::Code { .. } => Citation::Line { path, start: 1, end: 1, section }` arm with a `// TODO(Task 3)` and replace it in Task 3. +- Any `store-sqlite` / `search` / `id` site: add a faithful arm (e.g. id recipe already serializes `SourceSpan` via serde — likely no match there; only fix real `match` statements the compiler points at). + +Run: `cargo build --workspace` +Expected: clean. + +- [ ] **Step 5: Run the test** + +Run: `cargo test -p kebab-core source_span_code_round_trips` +Expected: PASS. + +- [ ] **Step 6: Sync frozen design §3.4** + +In `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`, find `pub enum SourceSpan` (~line 562) and add the `Code { line_start, line_end, symbol, lang }` variant to the listing, with a one-line comment `// p10-1A-2: internal code-unit span (see tasks/p10/p10-1a-2)`. Do not alter other variants. + +- [ ] **Step 7: clippy + commit** + +```bash +cargo clippy --workspace --all-targets -- -D warnings +git add crates/ docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +git commit -m "feat(p10-1a-2): add internal SourceSpan::Code variant + design §3.4 sync + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 3: `citation_helper` Code arm + +**Files:** +- Modify: `crates/kebab-search/src/citation_helper.rs:21-74` +- Test: `crates/kebab-search/src/citation_helper.rs` (add `#[cfg(test)] mod tests` if absent; `lexical.rs` has `build_citation_*` tests — mirror style there if a test module already imports the helper) + +- [ ] **Step 1: Write the failing test** + +Add a test (in `citation_helper.rs` add a test module, or extend the existing helper tests in `crates/kebab-search/src/lexical.rs` near `build_citation_page_forwards_section`): + +```rust +#[test] +fn build_citation_code_maps_symbol_and_lang() { + use kebab_core::{Citation, SourceSpan, WorkspacePath}; + let span = SourceSpan::Code { + line_start: 5, + line_end: 30, + symbol: Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk".into()), + lang: Some("rust".into()), + }; + let c = super::citation_from_first_span( + "c1", + WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()), + None, + Some(&span), + ); + match c { + Citation::Code { path, line_start, line_end, symbol, lang } => { + assert_eq!(path.0, "crates/kebab-chunk/src/md_heading_v1.rs"); + assert_eq!(line_start, 5); + assert_eq!(line_end, 30); + assert_eq!(symbol.as_deref(), Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk")); + assert_eq!(lang.as_deref(), Some("rust")); + } + other => panic!("expected Citation::Code, got {other:?}"), + } +} +``` + +- [ ] **Step 2: Run — expect fail** + +Run: `cargo test -p kebab-search build_citation_code_maps_symbol_and_lang` +Expected: FAIL (currently the Task-2 temporary arm produces `Citation::Line`). + +- [ ] **Step 3: Replace the temporary arm** + +In `citation_from_first_span`, replace the Task-2 placeholder arm with the real mapping (place it directly after the `SourceSpan::Time` arm, before the `Byte | None` fallback): + +```rust + Some(SourceSpan::Code { line_start, line_end, symbol, lang }) => Citation::Code { + path, + line_start: *line_start, + line_end: *line_end, + symbol: symbol.clone(), + lang: lang.clone(), + }, +``` + +(`section` is unused for code — `Citation::Code` has no section field; this matches the spec's code citation shape.) + +- [ ] **Step 4: Run — expect pass** + +Run: `cargo test -p kebab-search build_citation_code_maps_symbol_and_lang` +Expected: PASS. + +- [ ] **Step 5: clippy + commit** + +```bash +cargo clippy -p kebab-search --all-targets -- -D warnings +git add crates/kebab-search/ +git commit -m "feat(p10-1a-2): map SourceSpan::Code -> Citation::Code in citation_helper + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 4: `MediaType::Code(String)` variant + +**Files:** +- Modify: `crates/kebab-core/src/media.rs:38-44` (`MediaType` enum) +- Modify: `crates/kebab-app/src/ingest_progress.rs:99` (`media_label`) +- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (only if `enum MediaType` is enumerated there — `grep -n "enum MediaType" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`; if present, add the `Code(String)` variant + one-line comment, else skip) +- Test: `crates/kebab-core/src/media.rs` test module + +`MediaType` is `#[serde(rename_all = "lowercase")]`; `Code(String)` serializes as `{"code":"rust"}`. + +- [ ] **Step 1: Write the failing test** + +Add to `media.rs` tests: + +```rust +#[test] +fn media_type_code_serializes_lowercase_tagged() { + let m = MediaType::Code("rust".to_string()); + let v = serde_json::to_value(&m).unwrap(); + assert_eq!(v, serde_json::json!({ "code": "rust" })); + let back: MediaType = serde_json::from_value(v).unwrap(); + assert_eq!(back, m); +} +``` + +- [ ] **Step 2: Run — expect fail** + +Run: `cargo test -p kebab-core media_type_code_serializes` +Expected: FAIL — `no variant named Code`. + +- [ ] **Step 3: Add the variant + media_label arm** + +In `crates/kebab-core/src/media.rs`, add to `pub enum MediaType` immediately before `Other(String)`: + +```rust + /// p10-1A-2: a source-code file. Inner string is the canonical + /// code_lang (design §3.5). 1A activates `"rust"` only; other + /// recognized code langs are still routed `Other` until their phase. + Code(String), +``` + +In `crates/kebab-app/src/ingest_progress.rs`, add a match arm next to the `MediaType::Other(_) => "other"` arm (~line 99): + +```rust + kebab_core::MediaType::Code(_) => "code", +``` + +Then `cargo build --workspace 2>&1 | grep non-exhaustive` and add a faithful arm to every other `MediaType` match the compiler flags (e.g. any UI/store display) — `Code(lang)` should render analogously to `Other`. + +- [ ] **Step 4: Run — expect pass + suite green** + +Run: `cargo test -p kebab-core media_type_code_serializes` +Expected: PASS. +Run: `cargo test --workspace --no-fail-fast -j 1` +Expected: PASS (catches any golden/asset serialization that enumerates MediaType variants; fix faithfully if any fixture counts variants). + +- [ ] **Step 5: design sync (conditional) + clippy + commit** + +```bash +grep -n "enum MediaType" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +# if present, add Code(String) to that listing with a p10-1A-2 comment +cargo clippy --workspace --all-targets -- -D warnings +git add crates/ docs/ +git commit -m "feat(p10-1a-2): add MediaType::Code(lang) variant + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 5: Route `.rs` → `MediaType::Code("rust")` + +**Files:** +- Modify: `crates/kebab-source-fs/src/media.rs:39` (the `_ => MediaType::Other(ext)` fallthrough) +- Test: `crates/kebab-source-fs/src/media.rs` test module (`media_type_for` tests, ~line 49) + +Scope to Rust only (1A-2 = Rust). Non-Rust extensions keep their current `MediaType::Other` mapping — minimal blast radius, regression-safe. + +- [ ] **Step 1: Write the failing test** + +Add near the existing `media_type_for` asserts: + +```rust +#[test] +fn rust_files_map_to_media_code_rust() { + assert_eq!( + media_type_for(Path::new("crates/kebab-core/src/lib.rs")), + MediaType::Code("rust".to_string()) + ); + // non-Rust code extensions stay Other in 1A + assert_eq!(media_type_for(Path::new("a/b.py")), MediaType::Other("py".to_string())); + assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string())); +} +``` + +- [ ] **Step 2: Run — expect fail** + +Run: `cargo test -p kebab-source-fs rust_files_map_to_media_code_rust` +Expected: FAIL — `.rs` currently → `MediaType::Other("rs")`. + +- [ ] **Step 3: Add the routing arm** + +In `crates/kebab-source-fs/src/media.rs`, add an `"rs"` arm before the final `_ => MediaType::Other(ext)`: + +```rust + // p10-1A-2: Rust is the only code lang activated in 1A. Other + // recognized code langs stay Other until their phase (1B+). + "rs" => MediaType::Code("rust".to_string()), +``` + +- [ ] **Step 4: Run — expect pass** + +Run: `cargo test -p kebab-source-fs rust_files_map_to_media_code_rust` +Expected: PASS. + +- [ ] **Step 5: clippy + commit** + +```bash +cargo clippy -p kebab-source-fs --all-targets -- -D warnings +git add crates/kebab-source-fs/ +git commit -m "feat(p10-1a-2): route .rs files to MediaType::Code(rust) + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 6: `kebab-parse-code` Rust AST extractor + +**Files:** +- Create: `crates/kebab-parse-code/src/rust.rs` +- Modify: `crates/kebab-parse-code/src/lib.rs` (add `pub mod rust;` + re-export `PARSER_VERSION`, `RustAstExtractor`) +- Create: `crates/kebab-parse-code/tests/fixtures/sample.rs` (Rust fixture) +- Test: inline `#[cfg(test)] mod tests` in `rust.rs` + +This is the core. The extractor walks the tree-sitter parse tree and emits **one `Block::Code` per top-level AST semantic unit**, each with `SourceSpan::Code`. The `CanonicalDocument` scaffold (doc_id, provenance, metadata, return struct) **mirrors `crates/kebab-parse-pdf/src/lib.rs:51-225` exactly** — same `Extractor` trait impl shape, same `id_for_doc` / `ProvenanceEvent` / `CanonicalDocument` construction. Only the differences below change. + +### tree-sitter API note + +Use the modern API: `parser.set_language(&tree_sitter_rust::LANGUAGE.into())`. If the resolved `tree-sitter-rust` predates the `LANGUAGE: LanguageFn` const (no `LANGUAGE` symbol), use `parser.set_language(&tree_sitter_rust::language())` instead. Verify which by `cargo doc -p tree-sitter-rust --no-deps` or reading its docs; pick the one that compiles. + +### Semantic-unit rules (design §9.1 + §3.4) + +Walk `root_node()` (kind `source_file`) named children. Maintain a `mod_path: Vec` (module nesting), starting empty. + +| node kind | unit | symbol | +|-----------|------|--------| +| `function_item` | 1 | `mod_path::fn_name` | +| `struct_item` / `enum_item` / `union_item` / `trait_item` / `type_item` | 1 | `mod_path::TypeName` | +| `macro_definition` | 1 | `mod_path::name!` | +| `impl_item` | 1 per inner `function_item` | `mod_path::ImplType::method` (ImplType = text of impl `type` field; for `impl Trait for T`, use `Trait::method` per §3.4) | +| `mod_item` **with** `declaration_list` body | recurse with `mod_path` + mod name pushed | — | +| `use_declaration`, `extern_crate_declaration`, `const_item`, `static_item`, `mod_item` **without** body, top-level `attribute_item`/`macro_invocation` | accumulated into ONE grouped unit | `` (or `` if the whole file produced no fn/type/impl unit and the group is only `mod_item` declarations) | + +- **Line range:** `node.start_position().row + 1 ..= node.end_position().row + 1` (1-based inclusive). Extend `line_start` upward over contiguous immediately-preceding sibling `line_comment` / `block_comment` / `attribute_item` nodes (doc comments + attributes belong to their item — design §9.1 "선언 + doc comment"). +- **Grouped unit line range:** min start_line over the group .. max end_line over the group. +- **`code` field:** the exact source substring for those lines (split `source` by `\n`, take `[line_start-1 ..= line_end-1]`, rejoin with `\n`). +- Each unit → `Block::Code(CodeBlock { common: CommonBlock { block_id, heading_path: vec![], source_span }, lang: Some("rust".into()), code })` where `source_span = SourceSpan::Code { line_start, line_end, symbol: Some(sym), lang: Some("rust".into()) }`. `block_id = id_for_block(&doc_id, "code", &[], ordinal, &source_span)` with `ordinal` = 0-based unit index. + +- [ ] **Step 1: Create the fixture** + +Create `crates/kebab-parse-code/tests/fixtures/sample.rs`: + +```rust +//! sample fixture + +use std::fmt; + +const ANSWER: u32 = 42; + +/// Doc comment on a free fn. +pub fn parse(input: &str) -> usize { + input.len() +} + +pub struct Foo { + pub n: u32, +} + +impl Foo { + /// method doc + pub fn double(&self) -> u32 { + self.n * 2 + } + + fn name() -> &'static str { + "foo" + } +} + +pub trait Greet { + fn hello(&self) -> String; +} + +mod inner { + pub fn helper() -> bool { + true + } +} +``` + +- [ ] **Step 2: Write the failing test** + +In `rust.rs` add: + +```rust +#[cfg(test)] +mod tests { + use super::*; + use kebab_core::{Block, MediaType, SourceSpan}; + + fn extract_fixture() -> kebab_core::CanonicalDocument { + let bytes = std::fs::read( + concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.rs"), + ) + .unwrap(); + let asset = kebab_parse_code_test_support::fixed_rust_asset("crates/x/src/sample.rs"); + let cfg = kebab_core::ExtractConfig::default(); + let root = std::path::PathBuf::from("/tmp"); + let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg }; + RustAstExtractor::new().extract(&ctx, &bytes).unwrap() + } + + #[test] + fn extractor_supports_only_media_code_rust() { + let e = RustAstExtractor::new(); + assert!(e.supports(&MediaType::Code("rust".into()))); + assert!(!e.supports(&MediaType::Code("python".into()))); + assert!(!e.supports(&MediaType::Markdown)); + } + + #[test] + fn emits_one_block_per_semantic_unit_with_symbols() { + let doc = extract_fixture(); + let mut syms: Vec<(String, u32, u32)> = doc + .blocks + .iter() + .map(|b| match b { + Block::Code(c) => match &c.common.source_span { + SourceSpan::Code { symbol, line_start, line_end, lang } => { + assert_eq!(lang.as_deref(), Some("rust")); + (symbol.clone().unwrap(), *line_start, *line_end) + } + _ => panic!("code block must carry SourceSpan::Code"), + }, + other => panic!("expected Block::Code, got {other:?}"), + }) + .collect(); + syms.sort(); + let names: Vec<&str> = syms.iter().map(|(s, _, _)| s.as_str()).collect(); + assert!(names.contains(&"parse")); + assert!(names.contains(&"Foo")); + assert!(names.contains(&"Foo::double")); + assert!(names.contains(&"Foo::name")); + assert!(names.contains(&"Greet")); + assert!(names.contains(&"inner::helper")); + assert!(names.contains(&"")); // use + const grouped + // doc-comment line is folded into the unit it documents: + let parse_unit = syms.iter().find(|(s, _, _)| s == "parse").unwrap(); + let parse_src = doc.blocks.iter().find_map(|b| match b { + Block::Code(c) if matches!(&c.common.source_span, SourceSpan::Code{symbol,..} if symbol.as_deref()==Some("parse")) => Some(c.code.clone()), + _ => None, + }).unwrap(); + assert!(parse_src.contains("/// Doc comment on a free fn."), "doc comment folded in: {parse_src}"); + let _ = parse_unit; + } + + #[test] + fn deterministic_across_runs() { + let a = extract_fixture(); + for _ in 0..50 { + assert_eq!(extract_fixture().blocks, a.blocks); + } + } +} + +#[cfg(test)] +mod kebab_parse_code_test_support { + use kebab_core::*; + use time::OffsetDateTime; + pub fn fixed_rust_asset(path: &str) -> RawAsset { + RawAsset { + asset_id: AssetId("a".repeat(64)), + source_uri: SourceUri::File(std::path::PathBuf::from(path)), + workspace_path: WorkspacePath(path.to_string()), + media_type: MediaType::Code("rust".to_string()), + byte_len: 0, + checksum: Checksum("b".repeat(64)), + discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + stored: AssetStorage::InPlace, + } + } +} +``` + +> Before relying on it, verify `Checksum`/`SourceUri`/`AssetStorage` field shapes by reading `crates/kebab-core/src/asset.rs:63-95`; adjust the test-support constructor to the actual variants (e.g. `AssetStorage` may be `External`/`InPlace` — use whatever exists). This is a fixture-construction detail, not a design choice. + +- [ ] **Step 3: Run — expect fail** + +Run: `cargo test -p kebab-parse-code emits_one_block_per_semantic_unit` +Expected: FAIL — `RustAstExtractor` undefined. + +- [ ] **Step 4: Implement `rust.rs`** + +Create `crates/kebab-parse-code/src/rust.rs`. Scaffold (Extractor impl, doc_id, provenance events, final `CanonicalDocument {…}`) is a **direct adaptation of `crates/kebab-parse-pdf/src/lib.rs:51-225`** with these concrete differences: + +- `pub const PARSER_VERSION: &str = "code-rust-v1";` +- `pub struct RustAstExtractor;` + `new()`/`Default` like `PdfTextExtractor`. +- `fn supports(&self, m: &MediaType) -> bool { matches!(m, MediaType::Code(l) if l == "rust") }` +- agent strings: `"kb-parse-code"` instead of `"kb-parse-pdf"`. +- `title`: filename stem of `asset.workspace_path` (reuse the same `strip_extension(filename_from_workspace_path(...))` helpers — copy those two small fns from `kebab-parse-pdf/src/lib.rs:229+` into `rust.rs`, or inline equivalent). +- `lang: Lang("und".into())` (natural-language detection out of scope, design §3.5). +- **metadata**: same `Metadata { … }` literal as PDF but set: + - `source_type`: use `SourceType::Code` if the enum has it (`grep -n "enum SourceType" crates/kebab-core/src/metadata.rs`), else `SourceType::Note`. + - `code_lang: Some("rust".to_string())` + - `repo` / `git_branch` / `git_commit`: from `crate::repo::detect_repo`. Resolve the file's absolute path: if `asset.source_uri` is `SourceUri::File(p)` use `p`; join with `ctx.workspace_root` if relative. `match detect_repo(&abs) { Some(r) => (Some(r.name), r.branch, r.commit), None => (None, None, None) }`. +- **blocks**: replace the PDF per-page loop with the AST walk. Implementation: + +```rust +fn build_blocks( + source: &str, + doc_id: &kebab_core::DocumentId, +) -> anyhow::Result> { + use kebab_core::{Block, CodeBlock, CommonBlock, SourceSpan, id_for_block}; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&tree_sitter_rust::LANGUAGE.into()) + .map_err(|e| anyhow::anyhow!("set tree-sitter-rust language: {e}"))?; + let tree = parser + .parse(source.as_bytes(), None) + .ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Rust source"))?; + let lines: Vec<&str> = source.split('\n').collect(); + + // (symbol, start_line_1based, end_line_1based) in document order. + let mut units: Vec<(String, u32, u32)> = Vec::new(); + // Pending glue (use/const/static/mod-decl/attr) accumulated into one + // (or ) unit, flushed when a real unit appears or + // at end of a scope. + let mut glue: Vec<(usize, u32, u32)> = Vec::new(); // (is_mod_decl as 0/1 via usize, s, e) + + fn node_name<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> { + n.child_by_field_name("name") + .map(|c| &src[c.start_byte()..c.end_byte()]) + } + // Extend start upward over leading doc-comments / attributes. + fn unit_start(n: &tree_sitter::Node) -> u32 { + let mut start = n.start_position().row as u32 + 1; + let mut prev = n.prev_sibling(); + while let Some(p) = prev { + let k = p.kind(); + if k == "line_comment" || k == "block_comment" || k == "attribute_item" { + start = p.start_position().row as u32 + 1; + prev = p.prev_sibling(); + } else { + break; + } + } + start + } + + fn walk( + node: tree_sitter::Node, + src: &str, + mod_path: &[String], + units: &mut Vec<(String, u32, u32)>, + glue: &mut Vec<(usize, u32, u32)>, + ) { + let mut cur = node.walk(); + for child in node.named_children(&mut cur) { + let s = unit_start(&child); + let e = child.end_position().row as u32 + 1; + let prefix = if mod_path.is_empty() { + String::new() + } else { + format!("{}::", mod_path.join("::")) + }; + match child.kind() { + "function_item" | "struct_item" | "enum_item" | "union_item" + | "trait_item" | "type_item" => { + if let Some(name) = node_name(&child, src) { + flush_glue(glue, units); + units.push((format!("{prefix}{name}"), s, e)); + } + } + "macro_definition" => { + if let Some(name) = node_name(&child, src) { + flush_glue(glue, units); + units.push((format!("{prefix}{name}!"), s, e)); + } + } + "impl_item" => { + flush_glue(glue, units); + let ty = child + .child_by_field_name("type") + .map(|c| src[c.start_byte()..c.end_byte()].trim().to_string()); + let tr = child + .child_by_field_name("trait") + .map(|c| src[c.start_byte()..c.end_byte()].trim().to_string()); + let owner = tr.or(ty).unwrap_or_else(|| "".to_string()); + if let Some(body) = child.child_by_field_name("body") { + let mut bc = body.walk(); + for m in body.named_children(&mut bc) { + if m.kind() == "function_item" { + if let Some(mn) = node_name(&m, src) { + let ms = unit_start(&m); + let me = m.end_position().row as u32 + 1; + units.push((format!("{prefix}{owner}::{mn}"), ms, me)); + } + } + } + } + } + "mod_item" => { + if let Some(body) = child.child_by_field_name("body") { + flush_glue(glue, units); + let name = node_name(&child, src).unwrap_or("mod").to_string(); + let mut np = mod_path.to_vec(); + np.push(name); + walk(body, src, &np, units, glue); + } else { + glue.push((1, s, e)); // bare `mod foo;` declaration + } + } + "use_declaration" | "extern_crate_declaration" | "const_item" + | "static_item" | "attribute_item" | "macro_invocation" => { + glue.push((0, s, e)); + } + _ => { /* line_comment / block_comment / unknown: ignore (folded via unit_start) */ } + } + } + flush_glue(glue, units); + } + + fn flush_glue(glue: &mut Vec<(usize, u32, u32)>, units: &mut Vec<(String, u32, u32)>) { + if glue.is_empty() { + return; + } + let s = glue.iter().map(|(_, a, _)| *a).min().unwrap(); + let e = glue.iter().map(|(_, _, b)| *b).max().unwrap(); + let only_mod_decls = glue.iter().all(|(is_mod, _, _)| *is_mod == 1); + let sym = if only_mod_decls { "" } else { "" }; + units.push((sym.to_string(), s, e)); + glue.clear(); + } + + walk(tree.root_node(), source, &[], &mut units, &mut glue); + + let total_lines = lines.len() as u32; + let mut blocks = Vec::with_capacity(units.len()); + for (ordinal, (symbol, ls, le)) in units.into_iter().enumerate() { + let line_start = ls.max(1); + let line_end = le.min(total_lines.max(1)); + let span = SourceSpan::Code { + line_start, + line_end, + symbol: Some(symbol), + lang: Some("rust".to_string()), + }; + let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span); + let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n"); + blocks.push(Block::Code(CodeBlock { + common: CommonBlock { + block_id, + heading_path: Vec::new(), + source_span: span, + }, + lang: Some("rust".to_string()), + code, + })); + } + Ok(blocks) +} +``` + +Notes for the implementer: +- `flush_glue` ordering: glue flushed *before* pushing a real unit so document order is preserved (glue that precedes the first fn becomes the `` chunk spanning the `use`/`const` region; the `unit_start` doc-comment extension keeps the fn's own doc comment with the fn, not the glue). +- A `glue` flushed after a real unit between two fns is still a valid `` unit (rare; acceptable). +- If `units` is empty (e.g. an empty file) → emit zero blocks (consistent with empty-PDF-page behavior). +- The `e` of a fixture's last `mod inner { … }` etc. is end-of-block; line slicing uses inclusive 1-based. + +- [ ] **Step 5: Wire into lib.rs** + +In `crates/kebab-parse-code/src/lib.rs`: + +```rust +pub mod rust; +pub use rust::{PARSER_VERSION as RUST_PARSER_VERSION, RustAstExtractor}; +``` + +- [ ] **Step 6: Run the tests — expect pass** + +Run: `cargo test -p kebab-parse-code` +Expected: PASS (`extractor_supports_*`, `emits_one_block_per_semantic_unit_with_symbols`, `deterministic_across_runs`). +If symbol names mismatch (tree-sitter-rust grammar field-name drift, e.g. `impl_item` `type` vs `type_arguments`), inspect with a scratch `node.kind()`/`field` dump and adjust the field names; pin behavior with the test. + +- [ ] **Step 7: clippy + commit** + +```bash +cargo clippy -p kebab-parse-code --all-targets -- -D warnings +git add crates/kebab-parse-code/ +git commit -m "feat(p10-1a-2): tree-sitter-rust AST extractor (parser-side) + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 7: `code-rust-ast-v1` chunker + +**Files:** +- Create: `crates/kebab-chunk/src/code_rust_ast_v1.rs` +- Modify: `crates/kebab-chunk/src/lib.rs` (`mod` + `pub use`) +- Test: inline `#[cfg(test)] mod tests` in the new file + +The chunker consumes the AST `CanonicalDocument` and maps **1 `Block::Code` → 1 `Chunk`**, except a unit longer than `AST_CHUNK_MAX_LINES` is split into `[part i/N]` sub-chunks. tree-sitter is NOT imported here (forbidden — AST is parser-side). Mirror `crates/kebab-chunk/src/pdf_page_v1.rs` for: `VERSION_LABEL` const, `BYTES_PER_TOKEN = 3`, `POLICY_HASH_HEX_LEN = 16`, `policy_hash` impl (identical blake3 recipe — cross-chunker fingerprint identity is required), per-chunk `policy_hash` variant to avoid `chunk_id` collision on split units, the upfront block-shape validation that `bail!`s on a non-code doc. + +`AST_CHUNK_MAX_LINES` is a module constant (`= 200`) matching `IngestCodeCfg::default().ast_chunk_max_lines`. Threading the config value through the fixed `Chunker` trait needs a per-medium chunker registry — a P+ task; this mirrors the existing `pdf-page-v1` "chunker_version hard-coded" deviation. Record it (Task 11 HOTFIXES). + +- [ ] **Step 1: Write failing tests** + +```rust +#[cfg(test)] +mod tests { + use super::*; + use kebab_core::{ + Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock, + SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance, + SourceType, TrustLevel, WorkspacePath, + }; + use time::OffsetDateTime; + + fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument { + let wp = WorkspacePath("crates/x/src/a.rs".into()); + let aid = AssetId("a".repeat(64)); + let pv = ParserVersion("code-rust-v1".into()); + let doc_id = id_for_doc(&wp, &aid, &pv); + let blocks = units + .iter() + .enumerate() + .map(|(i, (sym, ls, le, code))| { + let span = SourceSpan::Code { + line_start: *ls, + line_end: *le, + symbol: Some((*sym).to_string()), + lang: Some("rust".into()), + }; + let bid = id_for_block(&doc_id, "code", &[], i as u32, &span); + Block::Code(CodeBlock { + common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span }, + lang: Some("rust".into()), + code: (*code).to_string(), + }) + }) + .collect(); + CanonicalDocument { + doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(), + lang: Lang("und".into()), blocks, + metadata: Metadata { + aliases: vec![], tags: vec![], + created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(), + source_type: SourceType::Note, trust_level: TrustLevel::Primary, + user_id_alias: None, user: Default::default(), + repo: Some("kebab".into()), git_branch: Some("main".into()), + git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()), + }, + provenance: Provenance { events: vec![] }, + parser_version: pv, schema_version: 1, doc_version: 1, + last_chunker_version: None, last_embedding_version: None, + } + } + fn policy() -> ChunkPolicy { + ChunkPolicy { target_tokens: 500, overlap_tokens: 80, + respect_markdown_headings: false, + chunker_version: ChunkerVersion(VERSION_LABEL.into()) } + } + + #[test] + fn chunker_version_is_code_rust_ast_v1() { + assert_eq!(CodeRustAstV1Chunker.chunker_version(), + ChunkerVersion("code-rust-ast-v1".into())); + } + + #[test] + fn one_chunk_per_unit_preserves_code_span() { + let doc = code_doc(&[ + ("parse", 1, 3, "pub fn parse() {}\n// x\n}"), + ("Foo::double", 5, 7, "fn double() {}\n//\n}"), + ]); + let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap(); + assert_eq!(chunks.len(), 2); + for c in &chunks { + assert_eq!(c.source_spans.len(), 1); + assert!(matches!(c.source_spans[0], SourceSpan::Code { .. })); + assert_eq!(c.heading_path, Vec::::new()); + assert_eq!(c.chunker_version.0, "code-rust-ast-v1"); + } + match &chunks[0].source_spans[0] { + SourceSpan::Code { symbol, line_start, line_end, .. } => { + assert_eq!(symbol.as_deref(), Some("parse")); + assert_eq!((*line_start, *line_end), (1, 3)); + } + _ => unreachable!(), + } + } + + #[test] + fn oversize_unit_splits_into_parts_with_unique_ids() { + // 500-line fn → must split (AST_CHUNK_MAX_LINES = 200). + let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::>().join("\n"); + let code = format!("pub fn big() {{\n{body}\n}}"); + let doc = code_doc(&[("big", 1, 502, &code)]); + let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap(); + assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len()); + for c in &chunks { + match &c.source_spans[0] { + SourceSpan::Code { symbol, .. } => { + assert!(symbol.as_deref().unwrap().starts_with("big [part "), + "part-numbered symbol, got {symbol:?}"); + } + _ => unreachable!(), + } + } + let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect(); + let n = ids.len(); ids.sort(); ids.dedup(); + assert_eq!(ids.len(), n, "chunk_ids unique across split parts"); + } + + #[test] + fn non_code_doc_errors() { + use kebab_core::TextBlock; + let mut doc = code_doc(&[("parse", 1, 1, "fn parse(){}")]); + doc.blocks = vec![Block::Paragraph(TextBlock { + common: CommonBlock { + block_id: kebab_core::BlockId("b".into()), + heading_path: vec![], + source_span: SourceSpan::Line { start: 1, end: 1 }, + }, + text: "x".into(), inlines: vec![], + })]; + let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err(); + assert!(err.to_string().contains("CodeRustAstV1Chunker")); + } + + #[test] + fn deterministic_chunk_ids_1000() { + let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]); + let base: Vec = CodeRustAstV1Chunker.chunk(&doc, &policy()) + .unwrap().into_iter().map(|c| c.chunk_id.0).collect(); + for _ in 0..1000 { + let again: Vec = CodeRustAstV1Chunker.chunk(&doc, &policy()) + .unwrap().into_iter().map(|c| c.chunk_id.0).collect(); + assert_eq!(again, base); + } + } + + #[test] + fn policy_hash_matches_md_heading_v1() { + let p = policy(); + assert_eq!(CodeRustAstV1Chunker.policy_hash(&p), + crate::MdHeadingV1Chunker.policy_hash(&p)); + } +} +``` + +- [ ] **Step 2: Run — expect fail** + +Run: `cargo test -p kebab-chunk code_rust_ast` +Expected: FAIL — `CodeRustAstV1Chunker` undefined. + +- [ ] **Step 3: Implement the chunker** + +Create `crates/kebab-chunk/src/code_rust_ast_v1.rs`: + +```rust +//! `code-rust-ast-v1` — maps a tree-sitter-derived Rust AST +//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with +//! `SourceSpan::Code`) to chunks 1:1. A unit longer than +//! `AST_CHUNK_MAX_LINES` is split into ` [part i/N]` sub-chunks +//! at blank-line paragraph boundaries (design §9.1 oversize fallback). +//! +//! tree-sitter is intentionally NOT a dependency here: AST work is +//! parser-side (`kebab-parse-code`, design §6.3). This chunker only +//! consumes the `CanonicalDocument`. +//! +//! `AST_CHUNK_MAX_LINES` is a constant matching +//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium +//! config threading needs a chunker registry (P+); same deviation +//! pattern as `pdf-page-v1`'s pinned `chunker_version` +//! (`tasks/HOTFIXES.md`). + +use kebab_core::{ + Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId, + SourceSpan, id_for_chunk, +}; + +const VERSION_LABEL: &str = "code-rust-ast-v1"; +const BYTES_PER_TOKEN: usize = 3; +const POLICY_HASH_HEX_LEN: usize = 16; +const AST_CHUNK_MAX_LINES: u32 = 200; + +#[derive(Clone, Copy, Debug, Default)] +pub struct CodeRustAstV1Chunker; + +impl Chunker for CodeRustAstV1Chunker { + fn chunker_version(&self) -> ChunkerVersion { + ChunkerVersion(VERSION_LABEL.to_string()) + } + + fn policy_hash(&self, policy: &ChunkPolicy) -> String { + let bytes = serde_json_canonicalizer::to_vec(policy) + .expect("canonical JSON serialization of ChunkPolicy must not fail"); + let hex = blake3::hash(&bytes).to_hex().to_string(); + hex[..POLICY_HASH_HEX_LEN].to_string() + } + + fn chunk( + &self, + doc: &CanonicalDocument, + policy: &ChunkPolicy, + ) -> anyhow::Result> { + for b in &doc.blocks { + let c = match b { + Block::Code(c) => c, + _ => anyhow::bail!( + "CodeRustAstV1Chunker only handles code docs (got non-Code block)" + ), + }; + if !matches!(c.common.source_span, SourceSpan::Code { .. }) { + anyhow::bail!( + "CodeRustAstV1Chunker only handles code docs (got non-Code source_span)" + ); + } + } + + let base_policy_hash = self.policy_hash(policy); + let chunker_version = self.chunker_version(); + let mut out: Vec = Vec::new(); + + for b in &doc.blocks { + let cb = match b { + Block::Code(c) => c, + _ => unreachable!("validated above"), + }; + let (ls, le, symbol, lang) = match &cb.common.source_span { + SourceSpan::Code { line_start, line_end, symbol, lang } => { + (*line_start, *line_end, symbol.clone(), lang.clone()) + } + _ => unreachable!("validated above"), + }; + let block_ids: Vec = vec![cb.common.block_id.clone()]; + let span_lines = le.saturating_sub(ls) + 1; + + if span_lines <= AST_CHUNK_MAX_LINES { + let span = SourceSpan::Code { + line_start: ls, + line_end: le, + symbol: symbol.clone(), + lang: lang.clone(), + }; + out.push(make_chunk( + doc, &chunker_version, &block_ids, &base_policy_hash, + None, span, cb.code.clone(), + )); + } else { + let parts = split_oversize(&cb.code); + let n = parts.len(); + for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() { + let part_ls = ls + off_start; + let part_le = ls + off_end; + let part_sym = symbol + .as_ref() + .map(|s| format!("{s} [part {}/{n}]", i + 1)); + let span = SourceSpan::Code { + line_start: part_ls, + line_end: part_le, + symbol: part_sym, + lang: lang.clone(), + }; + out.push(make_chunk( + doc, &chunker_version, &block_ids, &base_policy_hash, + Some(part_ls), span, text, + )); + } + } + } + + tracing::debug!( + target: "kebab-chunk", + doc_id = %doc.doc_id, + chunks = out.len(), + "code-rust-ast-v1 chunked", + ); + Ok(out) + } +} + +#[allow(clippy::too_many_arguments)] +fn make_chunk( + doc: &CanonicalDocument, + chunker_version: &ChunkerVersion, + block_ids: &[BlockId], + base_policy_hash: &str, + split_key: Option, + span: SourceSpan, + text: String, +) -> Chunk { + // Per-chunk policy_hash variant prevents chunk_id collision when one + // block splits into multiple parts (same block_ids). Mirrors + // pdf-page-v1. Single-chunk units use the base hash unchanged. + let id_hash = match split_key { + Some(k) => format!("{base_policy_hash}#L{k}"), + None => base_policy_hash.to_string(), + }; + let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash); + let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN); + Chunk { + chunk_id, + doc_id: DocumentId(doc.doc_id.0.clone()), + block_ids: block_ids.to_vec(), + text, + heading_path: Vec::new(), + source_spans: vec![span], + token_estimate, + chunker_version: chunker_version.clone(), + policy_hash: base_policy_hash.to_string(), + } +} + +/// Split an oversize unit at blank-line paragraph boundaries, greedily +/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate. +/// Returns `(line_offset_start, line_offset_end, text)` where offsets are +/// 0-based within the unit (caller adds the unit's absolute `line_start`). +fn split_oversize(code: &str) -> Vec<(u32, u32, String)> { + let lines: Vec<&str> = code.split('\n').collect(); + let total = lines.len() as u32; + let mut out: Vec<(u32, u32, String)> = Vec::new(); + let mut start: u32 = 0; + while start < total { + let mut end = (start + AST_CHUNK_MAX_LINES).min(total); + // Prefer ending on a blank line within the last 20% of the window + // so we don't cut mid-paragraph when a boundary is nearby. + let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5); + if end < total { + if let Some(b) = (floor.min(end)..end) + .rev() + .find(|&i| lines[i as usize].trim().is_empty()) + { + end = b + 1; + } + } + let text = lines[start as usize..end as usize].join("\n"); + out.push((start, end.saturating_sub(1), text)); + start = end; + } + if out.is_empty() { + out.push((0, total.saturating_sub(1), code.to_string())); + } + out +} +``` + +- [ ] **Step 4: Wire into lib.rs** + +In `crates/kebab-chunk/src/lib.rs` add (matching the existing `mod`/`pub use` style): + +```rust +mod code_rust_ast_v1; +pub use code_rust_ast_v1::CodeRustAstV1Chunker; +``` + +- [ ] **Step 5: Run — expect pass** + +Run: `cargo test -p kebab-chunk code_rust_ast` +Expected: PASS (all 6 tests). + +- [ ] **Step 6: clippy + commit** + +```bash +cargo clippy -p kebab-chunk --all-targets -- -D warnings +git add crates/kebab-chunk/ +git commit -m "feat(p10-1a-2): code-rust-ast-v1 chunker (1:1 + oversize split) + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 8: `kebab-app` dispatch — `ingest_one_code_asset` + +**Files:** +- Modify: `crates/kebab-app/Cargo.toml` (add `kebab-parse-code = { path = "../kebab-parse-code" }` if not already a dep) +- Modify: `crates/kebab-app/src/lib.rs` (`ingest_one_asset` match ~line 896; new `ingest_one_code_asset` fn modeled on `ingest_one_pdf_asset` at 1455-end) +- Test: `crates/kebab-app/tests/` — add `code_ingest_smoke.rs` (mirror an existing app integration test's harness) + +- [ ] **Step 1: Check/Add the crate dep** + +```bash +grep -n "kebab-parse-code" crates/kebab-app/Cargo.toml || \ + echo 'kebab-parse-code = { path = "../kebab-parse-code" } # p10-1A-2' \ + >> /dev/stderr +``` + +If absent, add `kebab-parse-code = { path = "../kebab-parse-code" }` under `[dependencies]` in `crates/kebab-app/Cargo.toml` (1A-1 may have added it for the skip helpers — verify). + +- [ ] **Step 2: Write the failing integration test** + +Create `crates/kebab-app/tests/code_ingest_smoke.rs`. Model the harness on an existing app integration test (read one under `crates/kebab-app/tests/` for the `App`/`Config`/TempDir setup pattern). Core assertions: + +```rust +// Build an isolated TempDir KB, drop a tiny .rs file in the workspace, +// run ingest, then assert a search returns a Citation::Code. +#[test] +fn rust_file_ingests_and_searches_as_code_citation() { + // ... TempDir + Config pointing workspace.root at it (copy the + // harness from the sibling integration test verbatim) ... + std::fs::write(workspace_root.join("demo.rs"), + "/// adds\npub fn add(a: i32, b: i32) -> i32 {\n a + b\n}\n").unwrap(); + + let report = kebab_app::ingest_with_config(&cfg, /* args per existing test */).unwrap(); + assert!(report.ingested >= 1, "rust file ingested: {report:?}"); + + let hits = kebab_app::search_with_config(&cfg, "add", /* args */).unwrap(); + let code_hit = hits.iter().find(|h| matches!( + h.citation, kebab_core::Citation::Code { .. })); + let h = code_hit.expect("a Citation::Code hit"); + match &h.citation { + kebab_core::Citation::Code { lang, symbol, line_start, .. } => { + assert_eq!(lang.as_deref(), Some("rust")); + assert_eq!(symbol.as_deref(), Some("add")); + assert!(*line_start >= 1); + } + _ => unreachable!(), + } + assert_eq!(h.code_lang.as_deref(), Some("rust")); +} +``` + +> Use the exact `*_with_config` facade signatures from a sibling test (the facade rule — CLAUDE.md — requires the `_with_config` form). Read one existing `crates/kebab-app/tests/*.rs` to copy the harness; do not invent the Config builder. + +- [ ] **Step 3: Run — expect fail** + +Run: `cargo test -p kebab-app rust_file_ingests_and_searches_as_code_citation` +Expected: FAIL — code asset currently hits the `_ => Skipped` arm in `ingest_one_asset`. + +- [ ] **Step 4: Add the dispatch + `ingest_one_code_asset`** + +In `crates/kebab-app/src/lib.rs`, in the `ingest_one_asset` match (~line 896, where `MediaType::Pdf => { return ingest_one_pdf_asset(...) }`), add before the catch-all `_`: + +```rust + MediaType::Code(lang) if lang == "rust" => { + return ingest_one_code_asset( + app, asset, chunk_policy, embedder, vector_store, + existing_doc_ids, force_reingest, + ); + } + // Non-Rust code langs activate in later phases (1B+); skip for now. +``` + +Add `fn ingest_one_code_asset(...)` modeled **line-for-line on `ingest_one_pdf_asset` (lib.rs:1455-end)** with these substitutions: +- parser_version: `ParserVersion(kebab_parse_code::RUST_PARSER_VERSION.to_string())` +- extractor: `kebab_parse_code::RustAstExtractor::new().extract(&ctx, &bytes).context("kb-parse-code::RustAstExtractor::extract")?` +- chunker: `let chunker = CodeRustAstV1Chunker;` + `chunker.chunk(&canonical, chunk_policy).context("kb-chunk::CodeRustAstV1Chunker::chunk")?` +- `try_skip_unchanged(... &CodeRustAstV1Chunker.chunker_version() ...)` +- `.context("... (code)")` strings instead of `(pdf)` +- import `CodeRustAstV1Chunker` (it's re-exported from `kebab_chunk`) at the top of `lib.rs` alongside the existing `PdfPageV1Chunker` import. + +Everything else (read bytes, ExtractContext, put_asset/document/blocks/chunks, embed branch, IngestItem construction, `kb://` skip) is identical. + +- [ ] **Step 5: Run — expect pass** + +Run: `cargo test -p kebab-app rust_file_ingests_and_searches_as_code_citation` +Expected: PASS. + +- [ ] **Step 6: clippy + commit** + +```bash +cargo clippy -p kebab-app --all-targets -- -D warnings +git add crates/kebab-app/ +git commit -m "feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset) + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 9: `code_lang_breakdown` in `kebab schema` + +**Files:** +- Modify: `crates/kebab-store-sqlite/src/store.rs` (add a `code_lang_breakdown()` query next to whatever computes `media_breakdown` ~line 608) +- Modify: `crates/kebab-app/src/schema.rs:170` (currently `code_lang_breakdown: BTreeMap::new()`) +- Test: extend the existing `schema.rs` test (`crates/kebab-app/src/schema.rs:196+`) to assert a real count after a code ingest, or a store-level unit test in `kebab-store-sqlite` + +- [ ] **Step 1: Write the failing test** + +In `crates/kebab-store-sqlite` add a unit test that inserts a document with `metadata.code_lang = Some("rust")` and asserts: + +```rust +#[test] +fn code_lang_breakdown_counts_by_code_lang() { + // ... open in-memory/temp store, put_document with code_lang=Some("rust") ... + let bd = store.code_lang_breakdown().unwrap(); + assert_eq!(bd.get("rust"), Some(&1)); +} +``` + +> Read an existing `kebab-store-sqlite` test for the store-setup harness and how documents/metadata are persisted (so the `code_lang` column / json path is correct — `metadata` is stored as JSON; the query likely uses `json_extract(metadata_json, '$.code_lang')`). + +- [ ] **Step 2: Run — expect fail** + +Run: `cargo test -p kebab-store-sqlite code_lang_breakdown_counts` +Expected: FAIL — `code_lang_breakdown` undefined. + +- [ ] **Step 3: Implement the store query** + +In `crates/kebab-store-sqlite/src/store.rs`, mirror the `media_breakdown` query. The exact column depends on how `Metadata` is stored — inspect the `put_document` insert and the existing `media_breakdown` SQL, then add: + +```rust +pub fn code_lang_breakdown(&self) -> anyhow::Result> { + // documents whose metadata JSON has a non-null code_lang + let mut stmt = self.conn.prepare( + "SELECT json_extract(metadata_json, '$.code_lang') AS cl, COUNT(*) \ + FROM documents \ + WHERE cl IS NOT NULL \ + GROUP BY cl", + )?; + let rows = stmt.query_map([], |r| { + Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32)) + })?; + let mut out = std::collections::BTreeMap::new(); + for row in rows { + let (k, v) = row?; + out.insert(k, v); + } + Ok(out) +} +``` + +> Adjust table/column names (`documents`, `metadata_json`) to the actual schema — grep the `media_breakdown` impl and copy its `FROM`/column conventions exactly. + +- [ ] **Step 4: Populate it in schema.rs** + +In `crates/kebab-app/src/schema.rs`, replace the `code_lang_breakdown: std::collections::BTreeMap::new(),` placeholder (line ~170) with a call to the new store method (mirror how `media_breakdown: counts.media_breakdown` is sourced — likely `app.sqlite.code_lang_breakdown()?`). + +- [ ] **Step 5: Run — expect pass + suite** + +Run: `cargo test -p kebab-store-sqlite code_lang_breakdown_counts` +Expected: PASS. +Run: `cargo test -p kebab-app schema` +Expected: PASS (existing `schema.v1` serialization tests still green; `code_lang_breakdown` now populated). + +- [ ] **Step 6: clippy + commit** + +```bash +cargo clippy -p kebab-store-sqlite -p kebab-app --all-targets -- -D warnings +git add crates/kebab-store-sqlite/ crates/kebab-app/ +git commit -m "feat(p10-1a-2): populate schema.v1 code_lang_breakdown + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 10: Full-suite gate + self-ingest snapshot + +**Files:** +- Create: `crates/kebab-chunk/tests/code_rust_ast_snapshot.rs` + fixture `crates/kebab-chunk/tests/fixtures/code-sample.rs` + baseline `code-sample.chunks.snapshot.json` (mirror `crates/kebab-chunk/tests/long_section_snapshot.rs`) + +- [ ] **Step 1: Add a snapshot integration test** + +Mirror `long_section_snapshot.rs` exactly (same `UPDATE_SNAPSHOTS=1` regen mechanism). Build a `CanonicalDocument` by running `kebab_parse_code::RustAstExtractor` on `tests/fixtures/code-sample.rs` (a representative file with a free fn, an impl with 2 methods, a struct, a trait, a top-level `use`/`const` block, and one >200-line fn to exercise the split), chunk with `CodeRustAstV1Chunker`, serialize, compare to baseline. + +> `kebab-chunk` may not depend on `kebab-parse-code` even as a dev-dep if that crosses a boundary — check `tasks/p10/p10-1a-2-rust-ast-chunker.md` Allowed deps. It does not list kebab-chunk→kebab-parse-code. So instead, build the `CanonicalDocument` **by hand** in the test (construct `Block::Code` units directly, like Task 7's `code_doc` helper) rather than invoking the extractor. The snapshot then locks the *chunker's* mapping/splitting, which is the unit under test here. (Extractor behavior is already locked by Task 6's tests.) + +- [ ] **Step 2: Generate the baseline** + +Run: `UPDATE_SNAPSHOTS=1 cargo test -p kebab-chunk code_rust_ast_snapshot` +Then run without the env var: +Run: `cargo test -p kebab-chunk code_rust_ast_snapshot` +Expected: PASS. + +- [ ] **Step 3: Full workspace suite (the real gate)** + +```bash +cargo test --workspace --no-fail-fast -j 1 +``` + +Expected: PASS. Pay attention to: citation/wire round-trip tests (Citation still 6 variants — `Code` from 1A-1 unchanged), any golden eval fixtures, asset/MediaType serialization tests. Fix faithfully; a regression here means an earlier task's match arm was wrong. + +```bash +cargo clippy --workspace --all-targets -- -D warnings +``` + +Expected: clean. + +- [ ] **Step 4: Manual smoke (SMOKE.md flow, isolated TempDir)** + +```bash +cargo build --release +rm -rf /tmp/kebab-p10smoke && mkdir -p /tmp/kebab-p10smoke/ws +cp crates/kebab-chunk/src/code_rust_ast_v1.rs /tmp/kebab-p10smoke/ws/ +cat > /tmp/kebab-p10smoke/config.toml <<'EOF' +[workspace] +root = "/tmp/kebab-p10smoke/ws" +[paths] +data_dir = "/tmp/kebab-p10smoke/data" +EOF +# (match the SMOKE.md config skeleton if these keys differ — read docs/SMOKE.md) +./target/release/kebab --config /tmp/kebab-p10smoke/config.toml ingest --json +./target/release/kebab --config /tmp/kebab-p10smoke/config.toml search "chunk" --json | head +./target/release/kebab --config /tmp/kebab-p10smoke/config.toml schema --json | grep code_lang +``` + +Expected: ingest report shows ≥1 ingested; a search hit with `"citation":{"kind":"code",...,"lang":"rust"}` and top-level `"code_lang":"rust"`, `"repo":...`; `schema --json` `code_lang_breakdown` has `"rust"`. + +- [ ] **Step 5: Commit** + +```bash +git add crates/kebab-chunk/tests/ +git commit -m "test(p10-1a-2): code-rust-ast-v1 chunker snapshot + full-suite gate + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Task 11: Docs, version bump, status flips + +**Files:** +- `README.md` — 지원 형식 / 명령 table: note Rust code ingest is now active (per CLAUDE.md README rule — a new media surface). Mermaid: add `code` source crossing the boundary only if a media type newly crosses it (it does — add a code-ingest edge to the existing diagram). +- `HANDOFF.md` — P10 phase row: note 1A-2 merged (Rust code ingest active, kebab self-dogfooding possible); add a one-line entry under 머지 후 결정 if a HOTFIXES item lands. +- `docs/ARCHITECTURE.md` — add the `kebab-app → kebab-parse-code` edge + `kebab-parse-code → tree-sitter/tree-sitter-rust` to the dependency graph; add the locked-in decision "tree-sitter lives parser-side, not chunker-side (design §6.3)" to the decisions table. +- `docs/SMOKE.md` — add the Rust code-ingest smoke steps (the Task 10 Step 4 flow), and the `[ingest.code]` config keys if not already documented. +- `tasks/INDEX.md` line ~143 + `tasks/p10/INDEX.md` row 1A-2 — flip 1A-1 to ✅ and 1A-2 to ✅ (on merge). +- `tasks/HOTFIXES.md` — add a dated entry for the `AST_CHUNK_MAX_LINES` constant-vs-config deviation (chunker can't see `IngestCodeCfg.ast_chunk_max_lines` through the fixed `Chunker` trait; pinned to the default 200; per-medium chunker registry is P+). Cross-link one line in `tasks/p10/p10-1a-2-rust-ast-chunker.md` Risks/notes. +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §10.1 — record the 1A-2 surface per design §10.1 (already partly done for SourceSpan/MediaType in Tasks 2/4; ensure §10.1 mentions code ingest activation). +- `Cargo.toml` — workspace `version` minor bump (design §10.4: 1A-2 merge = 도그푸딩 가능 = bump trigger). e.g. `0.6.x` → `0.7.0` (check current value first). + +- [ ] **Step 1: Make all doc edits above.** Keep README narrow (usage only, one Mermaid). HANDOFF gets phase status. ARCHITECTURE gets the graph/decision. + +- [ ] **Step 2: Version bump** + +```bash +grep -m1 '^version' Cargo.toml # read current workspace version +# bump minor in Cargo.toml [workspace.package].version +cargo build --release # refresh Cargo.lock + binary identity +``` + +- [ ] **Step 3: Full suite once more (docs/version shouldn't break it, but the lock changed)** + +```bash +cargo test --workspace --no-fail-fast -j 1 +cargo clippy --workspace --all-targets -- -D warnings +``` + +Expected: PASS / clean. + +- [ ] **Step 4: Commit (bump = release commit, same commit per CLAUDE.md)** + +```bash +git add -A +git commit -m "docs(p10-1a-2): README/HANDOFF/ARCHITECTURE/SMOKE + HOTFIXES + status; chore: bump version + +Co-Authored-By: Claude Opus 4.7 (1M context) " +``` + +--- + +## Finalize: PR + review loop + release + +Per `tasks/HOTFIXES.md` workflow memory and CLAUDE.md Remote section (Gitea, not gh): + +- [ ] Use the **gitea-ops** skill: `gitea-pr` to open the PR (feature branch → main). Title: `feat(p10-1a-2): Rust AST chunker — tree-sitter-rust code ingest active`. Body summarizes: SourceSpan::Code internal variant, parser-side tree-sitter (design §6.3), code-rust-ast-v1 chunker + oversize split, MediaType::Code, schema code_lang_breakdown, frozen design §3.4/§10.1 sync, version bump. +- [ ] **Review loop mode** (do not ask single-shot): `gitea-pr-status --wait-ci` → `gitea-pr-diff` → analyze → `gitea-pr-review` (REQUEST_CHANGES/APPROVE) each round; actionable comments → follow-up commits; converge to APPROVE. +- [ ] On APPROVE: merge immediately (no asking), sync local `main`, delete `feat/p10-1a-2-rust-ast-chunker`. +- [ ] After merge: `cargo clean` (CLAUDE.md routine). Cut release via gitea-ops `gitea-release v` (release notes: Rust code ingest active, `Citation::Code` now populated, `MediaType::Code`, `schema.v1 code_lang_breakdown`, internal `SourceSpan::Code`). +- [ ] Flip `tasks/INDEX.md` / `tasks/p10/INDEX.md` 1A-2 → ✅ if not already in the merged commit; update memory phase-priorities note if the next-task priority shifts (P10-1B vs other). + +--- + +## Self-Review (completed by plan author) + +- **Spec coverage:** design §1A-2 (Rust chunker + tree-sitter-rust + activation) → Tasks 6/7/8; §3.3 (`code-rust-ast-v1`) → Task 7; §3.4 symbol path → Task 6 walk rules + Task 2 SourceSpan; §6.1 (rust.rs parser) → Task 6; §6.2 (kebab-chunk module) → Task 7; §6.3 dep graph (tree-sitter parser-side) → Task 1 + Task 7 forbidden-dep note; §9.1 Tier-1 + oversize fallback → Task 7 `split_oversize`; §10.4 version bump → Task 11; wire (no v2 — Citation::Code from 1A-1) → Task 10 Step 3 gate. Citation routing gap (1A-1 left unwired) → Tasks 2+3. `MediaType::Code` + routing → Tasks 4+5. schema breakdown → Task 9. Docs cascade → Task 11. +- **Placeholder scan:** novel logic (SourceSpan variant, citation arm, AST walk, chunker, split) is given in full. Mechanical mirrors (extractor scaffold, `ingest_one_code_asset`, store breakdown query, integration-test harness) are pinned to an exact existing file:line to copy with enumerated deltas — the established-pattern path the writing-plans skill endorses, not "TBD". +- **Type consistency:** `RustAstExtractor` / `RUST_PARSER_VERSION` / `CodeRustAstV1Chunker` / `VERSION_LABEL="code-rust-ast-v1"` / `SourceSpan::Code{line_start,line_end,symbol,lang}` / `Citation::Code` (1A-1 shape) used consistently across Tasks 2/3/6/7/8. `id_for_block(doc,"code",&[],ordinal,&span)` and `id_for_chunk(doc,cv,block_ids,hash)` match `crates/kebab-core/src/ids.rs:146,163`. diff --git a/tasks/p10/p10-1a-2-rust-ast-chunker.md b/tasks/p10/p10-1a-2-rust-ast-chunker.md new file mode 100644 index 0000000..979fa2b --- /dev/null +++ b/tasks/p10/p10-1a-2-rust-ast-chunker.md @@ -0,0 +1,47 @@ +# p10-1A-2 — Rust AST chunker + +**Status:** 🟡 진행 중 +**Contract sections:** §3.3 (chunker_version `code-rust-ast-v1`), §3.4 (symbol path — Rust convention), §3.4 frozen-design (`SourceSpan::Code` 신규 internal variant), §5 (code ingest 활성화), §6.1 (`kebab-parse-code/src/rust.rs` — tree-sitter-rust → CanonicalDocument), §6.2 (`kebab-chunk/src/code_rust_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback). +**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1A-2. +**Plan:** [2026-05-19-p10-1a-2-rust-ast-chunker.md](../../docs/superpowers/plans/2026-05-19-p10-1a-2-rust-ast-chunker.md). + +## Goal + +1A-1 의 프레임워크 위에 **Rust AST chunker 자체**를 올린다. `tree-sitter` + `tree-sitter-rust` 도입, `kebab-parse-code/src/rust.rs` (tree-sitter-rust → `CanonicalDocument`, AST 의미 단위마다 `Block::Code` + `SourceSpan::Code`), `kebab-chunk/src/code_rust_ast_v1.rs` (1 block → 1 chunk + oversize fallback split), `MediaType::Code` 신설, `kebab-app` dispatch. 머지 시점에 kebab 자기 자신 dogfooding 가능. + +## 동결된 설계 결정 (이 task 로 확정) + +- **tree-sitter 위치 = parser (`kebab-parse-code`)**, chunker 아님. design §6.3 의존성 그래프 (`kebab-parse-code → tree-sitter, tree-sitter-rust`) 가 authoritative. PDF 선례와 동형 — parser 가 구조화된 block 생성, chunker 가 매핑. §9.1 의 "chunker 가 AST" 서술은 *oversize fallback split* 만 chunker-side 라는 의미로 해석. +- **`SourceSpan::Code { line_start, line_end, symbol, lang }` 내부 variant 신설** (kebab-core). chunk 의 `source_spans_json` (chunks 테이블) 은 *내부 저장*이라 wire schema 아님 → wire major bump 불필요. `Citation::Code` (wire) 는 1A-1 에서 이미 추가됨. `citation_helper::citation_from_first_span` 에 `SourceSpan::Code → Citation::Code` arm 추가로 symbol/lang 이 자연스럽게 흐름. +- **`MediaType::Code(String)` 신설** — String = canonical code_lang (1A 는 `"rust"` 만 실제 처리, 그 외 인식된 code lang 은 `Skipped` — Tier 2/3 는 후속 phase). +- frozen design §3.4 의 `SourceSpan` enum 및 (해당 시) `MediaType` enum 목록을 같은 PR 에서 갱신. 본 task spec 은 머지 후 frozen. + +## Acceptance criteria + +- `cargo test --workspace --no-fail-fast -j 1` passes. +- 기존 markdown / PDF / image corpus regression test 무영향 (citation 5→6 variant: `Citation::Code` 는 1A-1 에 이미 존재; 기존 5 variant 직렬화 불변). +- `cargo clippy --workspace --all-targets -- -D warnings` passes. +- Rust fixture 한 개 (fn / impl method / struct / trait / top-level use + 200줄 초과 fn) ingest → chunk snapshot 안정 + `Citation::Code` 의 symbol/line 이 spec §3.4 Rust convention 과 일치. +- kebab 자기 crate 한 개를 isolated TempDir KB 에 ingest → `kebab search --json` 결과가 `citation.kind == "code"`, `repo`, `code_lang == "rust"` 반환 (SMOKE 절차). +- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §3.4 (SourceSpan / MediaType) + §10.1 갱신. +- README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX.md + tasks/p10/INDEX.md 갱신. +- workspace `Cargo.toml` version minor bump (도그푸딩 가능 = bump 트리거, design §10.4) + release cut. + +## Allowed dependencies + +- `kebab-parse-code` 에 `tree-sitter` + `tree-sitter-rust` 추가 (workspace deps 경유). 기존 `kebab-core` / `anyhow` / `gix` 유지. +- `kebab-chunk` 는 `kebab-core` 만 (chunker 는 `CanonicalDocument` 만 소비 — tree-sitter import 금지). +- `kebab-app → kebab-parse-code` (facade 가 Extractor 호출). + +## Forbidden dependencies + +- `kebab-chunk` 가 `tree-sitter*` import 금지 (AST 는 parser-side). +- UI crate (cli / mcp / tui) 가 `kebab-parse-code` 직접 import 금지 — `kebab-app` facade 만. +- `kebab-parse-code` 가 store / embed / llm / rag import 금지 (design §8 inheritance). + +## Risks / notes + +- tree-sitter-rust 의 grammar 버전에 따라 node kind 명칭 차이 가능 — `function_item` / `impl_item` / `struct_item` / `enum_item` / `trait_item` / `mod_item` / `use_declaration` 는 도입 버전으로 pin 후 테스트로 고정. +- `SourceSpan::Code` 추가로 `SourceSpan` 의 모든 exhaustive match (citation_helper, store-sqlite serde, search) 가 영향 — 컴파일러가 non-exhaustive 를 잡아주므로 전수 대응. +- oversize fallback (단일 fn > `ast_chunk_max_lines`) 의 `symbol [part i/N]` 표기는 1A-2 chunker 내부 한정. 일반 Tier-3 `code-text-paragraph-v1` 은 Phase 3. +- 머지 후 동작 deviation 은 `tasks/HOTFIXES.md` 에 dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.