63 KiB
p10-1A-2 Rust AST Chunker Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Activate Rust code ingest end-to-end — .rs files parse via tree-sitter into one Block::Code per AST semantic unit, chunk 1:1 (with oversize fallback split), and surface as Citation::Code { symbol, lang, line_start, line_end } in search.
Architecture: tree-sitter lives in the parser (kebab-parse-code, per design §6.3 dependency graph; mirrors the proven PDF pattern where the parser emits structured blocks and the chunker maps them). kebab-parse-code/src/rust.rs is an Extractor producing a CanonicalDocument whose blocks are AST units carrying a new internal SourceSpan::Code variant. kebab-chunk/src/code_rust_ast_v1.rs maps each block to a chunk and splits oversize units. citation_helper gains one arm so the span flows to the existing (1A-1) Citation::Code wire shape. A new MediaType::Code(String) routes .rs files; non-Rust code langs stay MediaType::Other in 1A.
Tech Stack: Rust 2024 workspace, tree-sitter + tree-sitter-rust, existing kebab-core domain model, serde_json_canonicalizer + blake3 (chunk id / policy hash), gix (1A-1 repo detect).
Pre-flight
- Branch. From clean
main:
cd /home/altair823/kebab
git checkout main && git pull
git checkout -b feat/p10-1a-2-rust-ast-chunker
- Disk hygiene (CLAUDE.md — routine after each merged PR; #139 just merged):
cargo clean
Notes that apply throughout:
- Full suite is always
cargo test --workspace --no-fail-fast -j 1(CLAUDE.md — parallel link OOMs). Per-crate runs (cargo test -p <crate>) may run normally. cargo clippy --workspace --all-targets -- -D warningsis the CI gate; run before every commit that touches code.- Frozen task spec for this work:
tasks/p10/p10-1a-2-rust-ast-chunker.md(already written; do not edit retroactively).
Task 1: Add tree-sitter dependencies (workspace-deps pattern)
Files:
-
Modify:
Cargo.toml(workspace[workspace.dependencies], ends ~line 90 after thegixentry) -
Modify:
crates/kebab-parse-code/Cargo.toml([dependencies]) -
Step 1: Resolve + pin versions via cargo add
Run (this picks the latest compatible versions and writes them into the crate):
cargo add tree-sitter tree-sitter-rust -p kebab-parse-code
- Step 2: Move the resolved versions into workspace deps
Read the two version strings cargo add wrote into crates/kebab-parse-code/Cargo.toml. Then in the workspace Cargo.toml, append to [workspace.dependencies] directly after the gix = { ... } line (keep the existing comment style — one explanatory comment line):
# Rust source parsing for code ingest (kebab-parse-code, p10-1A-2). The
# chunker stays tree-sitter-free — AST work is parser-side per design §6.3.
tree-sitter = "<resolved major.minor>"
tree-sitter-rust = "<resolved major.minor>"
Then rewrite the crate's [dependencies] to use the workspace table (matching the existing anyhow/gix style):
[dependencies]
anyhow = { workspace = true }
gix = { workspace = true }
tree-sitter = { workspace = true }
tree-sitter-rust = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }
- Step 3: Verify it builds and the lock updated
Run: cargo build -p kebab-parse-code
Expected: compiles clean (skeleton still has no tree-sitter use yet — deps unused is fine, no -D warnings on a plain build).
- Step 4: Commit
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "build(p10-1a-2): add tree-sitter + tree-sitter-rust workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 2: SourceSpan::Code internal variant
Files:
- Modify:
crates/kebab-core/src/document.rs(SourceSpanenum, after theTime { start_ms, end_ms }variant ~line 144) - Modify:
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md(§3.4SourceSpanenum listing, ~line 562) - Test:
crates/kebab-core/src/document.rs(#[cfg(test)] mod tests, or the file's existing test module)
SourceSpan is #[serde(rename_all = "lowercase", tag = "kind")] — the new variant serializes as {"kind":"code", ...}. This is the chunks-table source_spans_json internal shape, NOT a wire schema (wire Citation::Code already shipped in 1A-1), so no wire .v2 bump.
- Step 1: Write the failing test
Add to the kebab-core test module that covers SourceSpan (search mod tests in document.rs; if none, add one at end of file):
#[test]
fn source_span_code_round_trips_and_tags_lowercase() {
let s = SourceSpan::Code {
line_start: 10,
line_end: 42,
symbol: Some("foo::Bar::baz".to_string()),
lang: Some("rust".to_string()),
};
let v = serde_json::to_value(&s).unwrap();
assert_eq!(v["kind"], "code");
assert_eq!(v["line_start"], 10);
assert_eq!(v["line_end"], 42);
assert_eq!(v["symbol"], "foo::Bar::baz");
assert_eq!(v["lang"], "rust");
let back: SourceSpan = serde_json::from_value(v).unwrap();
assert_eq!(back, s);
}
- Step 2: Run it — expect compile failure
Run: cargo test -p kebab-core source_span_code_round_trips
Expected: FAIL — no variant named Code.
- Step 3: Add the variant
In crates/kebab-core/src/document.rs, add as the last variant of pub enum SourceSpan (after Time { ... }):
/// p10-1A-2: AST-unit span for code ingest. Internal storage shape
/// (chunks.source_spans_json) — `citation_helper` maps this to the
/// wire `Citation::Code` (added 1A-1). `symbol` is the per-language
/// self-reference path (design §3.4); `<top-level>` / `<module>` for
/// glue regions, never null for an identified unit. `lang` is the
/// canonical code_lang.
Code {
line_start: u32,
line_end: u32,
symbol: Option<String>,
lang: Option<String>,
},
- Step 4: Compile — fix every non-exhaustive match
Run: cargo build --workspace 2>&1 | grep -A2 "non-exhaustive\|E0004"
The compiler will flag every exhaustive match on SourceSpan. Known sites (handle each minimally — a Code arm that does the type-correct thing, NOT a catch-all _):
crates/kebab-search/src/citation_helper.rs— handled fully in Task 3; for now add a temporarySourceSpan::Code { .. } => Citation::Line { path, start: 1, end: 1, section }arm with a// TODO(Task 3)and replace it in Task 3.- Any
store-sqlite/search/idsite: add a faithful arm (e.g. id recipe already serializesSourceSpanvia serde — likely no match there; only fix realmatchstatements the compiler points at).
Run: cargo build --workspace
Expected: clean.
- Step 5: Run the test
Run: cargo test -p kebab-core source_span_code_round_trips
Expected: PASS.
- Step 6: Sync frozen design §3.4
In docs/superpowers/specs/2026-04-27-kebab-final-form-design.md, find pub enum SourceSpan (~line 562) and add the Code { line_start, line_end, symbol, lang } variant to the listing, with a one-line comment // p10-1A-2: internal code-unit span (see tasks/p10/p10-1a-2). Do not alter other variants.
- Step 7: clippy + commit
cargo clippy --workspace --all-targets -- -D warnings
git add crates/ docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
git commit -m "feat(p10-1a-2): add internal SourceSpan::Code variant + design §3.4 sync
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 3: citation_helper Code arm
Files:
-
Modify:
crates/kebab-search/src/citation_helper.rs:21-74 -
Test:
crates/kebab-search/src/citation_helper.rs(add#[cfg(test)] mod testsif absent;lexical.rshasbuild_citation_*tests — mirror style there if a test module already imports the helper) -
Step 1: Write the failing test
Add a test (in citation_helper.rs add a test module, or extend the existing helper tests in crates/kebab-search/src/lexical.rs near build_citation_page_forwards_section):
#[test]
fn build_citation_code_maps_symbol_and_lang() {
use kebab_core::{Citation, SourceSpan, WorkspacePath};
let span = SourceSpan::Code {
line_start: 5,
line_end: 30,
symbol: Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk".into()),
lang: Some("rust".into()),
};
let c = super::citation_from_first_span(
"c1",
WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()),
None,
Some(&span),
);
match c {
Citation::Code { path, line_start, line_end, symbol, lang } => {
assert_eq!(path.0, "crates/kebab-chunk/src/md_heading_v1.rs");
assert_eq!(line_start, 5);
assert_eq!(line_end, 30);
assert_eq!(symbol.as_deref(), Some("chunk::md_heading_v1::MdHeadingV1Chunker::chunk"));
assert_eq!(lang.as_deref(), Some("rust"));
}
other => panic!("expected Citation::Code, got {other:?}"),
}
}
- Step 2: Run — expect fail
Run: cargo test -p kebab-search build_citation_code_maps_symbol_and_lang
Expected: FAIL (currently the Task-2 temporary arm produces Citation::Line).
- Step 3: Replace the temporary arm
In citation_from_first_span, replace the Task-2 placeholder arm with the real mapping (place it directly after the SourceSpan::Time arm, before the Byte | None fallback):
Some(SourceSpan::Code { line_start, line_end, symbol, lang }) => Citation::Code {
path,
line_start: *line_start,
line_end: *line_end,
symbol: symbol.clone(),
lang: lang.clone(),
},
(section is unused for code — Citation::Code has no section field; this matches the spec's code citation shape.)
- Step 4: Run — expect pass
Run: cargo test -p kebab-search build_citation_code_maps_symbol_and_lang
Expected: PASS.
- Step 5: clippy + commit
cargo clippy -p kebab-search --all-targets -- -D warnings
git add crates/kebab-search/
git commit -m "feat(p10-1a-2): map SourceSpan::Code -> Citation::Code in citation_helper
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 4: MediaType::Code(String) variant
Files:
- Modify:
crates/kebab-core/src/media.rs:38-44(MediaTypeenum) - Modify:
crates/kebab-app/src/ingest_progress.rs:99(media_label) - Modify:
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md(only ifenum MediaTypeis enumerated there —grep -n "enum MediaType" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md; if present, add theCode(String)variant + one-line comment, else skip) - Test:
crates/kebab-core/src/media.rstest module
MediaType is #[serde(rename_all = "lowercase")]; Code(String) serializes as {"code":"rust"}.
- Step 1: Write the failing test
Add to media.rs tests:
#[test]
fn media_type_code_serializes_lowercase_tagged() {
let m = MediaType::Code("rust".to_string());
let v = serde_json::to_value(&m).unwrap();
assert_eq!(v, serde_json::json!({ "code": "rust" }));
let back: MediaType = serde_json::from_value(v).unwrap();
assert_eq!(back, m);
}
- Step 2: Run — expect fail
Run: cargo test -p kebab-core media_type_code_serializes
Expected: FAIL — no variant named Code.
- Step 3: Add the variant + media_label arm
In crates/kebab-core/src/media.rs, add to pub enum MediaType immediately before Other(String):
/// p10-1A-2: a source-code file. Inner string is the canonical
/// code_lang (design §3.5). 1A activates `"rust"` only; other
/// recognized code langs are still routed `Other` until their phase.
Code(String),
In crates/kebab-app/src/ingest_progress.rs, add a match arm next to the MediaType::Other(_) => "other" arm (~line 99):
kebab_core::MediaType::Code(_) => "code",
Then cargo build --workspace 2>&1 | grep non-exhaustive and add a faithful arm to every other MediaType match the compiler flags (e.g. any UI/store display) — Code(lang) should render analogously to Other.
- Step 4: Run — expect pass + suite green
Run: cargo test -p kebab-core media_type_code_serializes
Expected: PASS.
Run: cargo test --workspace --no-fail-fast -j 1
Expected: PASS (catches any golden/asset serialization that enumerates MediaType variants; fix faithfully if any fixture counts variants).
- Step 5: design sync (conditional) + clippy + commit
grep -n "enum MediaType" docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
# if present, add Code(String) to that listing with a p10-1A-2 comment
cargo clippy --workspace --all-targets -- -D warnings
git add crates/ docs/
git commit -m "feat(p10-1a-2): add MediaType::Code(lang) variant
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 5: Route .rs → MediaType::Code("rust")
Files:
- Modify:
crates/kebab-source-fs/src/media.rs:39(the_ => MediaType::Other(ext)fallthrough) - Test:
crates/kebab-source-fs/src/media.rstest module (media_type_fortests, ~line 49)
Scope to Rust only (1A-2 = Rust). Non-Rust extensions keep their current MediaType::Other mapping — minimal blast radius, regression-safe.
- Step 1: Write the failing test
Add near the existing media_type_for asserts:
#[test]
fn rust_files_map_to_media_code_rust() {
assert_eq!(
media_type_for(Path::new("crates/kebab-core/src/lib.rs")),
MediaType::Code("rust".to_string())
);
// non-Rust code extensions stay Other in 1A
assert_eq!(media_type_for(Path::new("a/b.py")), MediaType::Other("py".to_string()));
assert_eq!(media_type_for(Path::new("Cargo.toml")), MediaType::Other("toml".to_string()));
}
- Step 2: Run — expect fail
Run: cargo test -p kebab-source-fs rust_files_map_to_media_code_rust
Expected: FAIL — .rs currently → MediaType::Other("rs").
- Step 3: Add the routing arm
In crates/kebab-source-fs/src/media.rs, add an "rs" arm before the final _ => MediaType::Other(ext):
// p10-1A-2: Rust is the only code lang activated in 1A. Other
// recognized code langs stay Other until their phase (1B+).
"rs" => MediaType::Code("rust".to_string()),
- Step 4: Run — expect pass
Run: cargo test -p kebab-source-fs rust_files_map_to_media_code_rust
Expected: PASS.
- Step 5: clippy + commit
cargo clippy -p kebab-source-fs --all-targets -- -D warnings
git add crates/kebab-source-fs/
git commit -m "feat(p10-1a-2): route .rs files to MediaType::Code(rust)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 6: kebab-parse-code Rust AST extractor
Files:
- Create:
crates/kebab-parse-code/src/rust.rs - Modify:
crates/kebab-parse-code/src/lib.rs(addpub mod rust;+ re-exportPARSER_VERSION,RustAstExtractor) - Create:
crates/kebab-parse-code/tests/fixtures/sample.rs(Rust fixture) - Test: inline
#[cfg(test)] mod testsinrust.rs
This is the core. The extractor walks the tree-sitter parse tree and emits one Block::Code per top-level AST semantic unit, each with SourceSpan::Code. The CanonicalDocument scaffold (doc_id, provenance, metadata, return struct) mirrors crates/kebab-parse-pdf/src/lib.rs:51-225 exactly — same Extractor trait impl shape, same id_for_doc / ProvenanceEvent / CanonicalDocument construction. Only the differences below change.
tree-sitter API note
Use the modern API: parser.set_language(&tree_sitter_rust::LANGUAGE.into()). If the resolved tree-sitter-rust predates the LANGUAGE: LanguageFn const (no LANGUAGE symbol), use parser.set_language(&tree_sitter_rust::language()) instead. Verify which by cargo doc -p tree-sitter-rust --no-deps or reading its docs; pick the one that compiles.
Semantic-unit rules (design §9.1 + §3.4)
Walk root_node() (kind source_file) named children. Maintain a mod_path: Vec<String> (module nesting), starting empty.
| node kind | unit | symbol |
|---|---|---|
function_item |
1 | mod_path::fn_name |
struct_item / enum_item / union_item / trait_item / type_item |
1 | mod_path::TypeName |
macro_definition |
1 | mod_path::name! |
impl_item |
1 per inner function_item |
mod_path::ImplType::method (ImplType = text of impl type field; for impl Trait for T, use Trait::method per §3.4) |
mod_item with declaration_list body |
recurse with mod_path + mod name pushed |
— |
use_declaration, extern_crate_declaration, const_item, static_item, mod_item without body, top-level attribute_item/macro_invocation |
accumulated into ONE grouped unit | <top-level> (or <module> if the whole file produced no fn/type/impl unit and the group is only mod_item declarations) |
-
Line range:
node.start_position().row + 1 ..= node.end_position().row + 1(1-based inclusive). Extendline_startupward over contiguous immediately-preceding siblingline_comment/block_comment/attribute_itemnodes (doc comments + attributes belong to their item — design §9.1 "선언 + doc comment"). -
Grouped unit line range: min start_line over the group .. max end_line over the group.
-
codefield: the exact source substring for those lines (splitsourceby\n, take[line_start-1 ..= line_end-1], rejoin with\n). -
Each unit →
Block::Code(CodeBlock { common: CommonBlock { block_id, heading_path: vec![], source_span }, lang: Some("rust".into()), code })wheresource_span = SourceSpan::Code { line_start, line_end, symbol: Some(sym), lang: Some("rust".into()) }.block_id = id_for_block(&doc_id, "code", &[], ordinal, &source_span)withordinal= 0-based unit index. -
Step 1: Create the fixture
Create crates/kebab-parse-code/tests/fixtures/sample.rs:
//! sample fixture
use std::fmt;
const ANSWER: u32 = 42;
/// Doc comment on a free fn.
pub fn parse(input: &str) -> usize {
input.len()
}
pub struct Foo {
pub n: u32,
}
impl Foo {
/// method doc
pub fn double(&self) -> u32 {
self.n * 2
}
fn name() -> &'static str {
"foo"
}
}
pub trait Greet {
fn hello(&self) -> String;
}
mod inner {
pub fn helper() -> bool {
true
}
}
- Step 2: Write the failing test
In rust.rs add:
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn extract_fixture() -> kebab_core::CanonicalDocument {
let bytes = std::fs::read(
concat!(env!("CARGO_MANIFEST_DIR"), "/tests/fixtures/sample.rs"),
)
.unwrap();
let asset = kebab_parse_code_test_support::fixed_rust_asset("crates/x/src/sample.rs");
let cfg = kebab_core::ExtractConfig::default();
let root = std::path::PathBuf::from("/tmp");
let ctx = kebab_core::ExtractContext { asset: &asset, workspace_root: &root, config: &cfg };
RustAstExtractor::new().extract(&ctx, &bytes).unwrap()
}
#[test]
fn extractor_supports_only_media_code_rust() {
let e = RustAstExtractor::new();
assert!(e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Code("python".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn emits_one_block_per_semantic_unit_with_symbols() {
let doc = extract_fixture();
let mut syms: Vec<(String, u32, u32)> = doc
.blocks
.iter()
.map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, line_start, line_end, lang } => {
assert_eq!(lang.as_deref(), Some("rust"));
(symbol.clone().unwrap(), *line_start, *line_end)
}
_ => panic!("code block must carry SourceSpan::Code"),
},
other => panic!("expected Block::Code, got {other:?}"),
})
.collect();
syms.sort();
let names: Vec<&str> = syms.iter().map(|(s, _, _)| s.as_str()).collect();
assert!(names.contains(&"parse"));
assert!(names.contains(&"Foo"));
assert!(names.contains(&"Foo::double"));
assert!(names.contains(&"Foo::name"));
assert!(names.contains(&"Greet"));
assert!(names.contains(&"inner::helper"));
assert!(names.contains(&"<top-level>")); // use + const grouped
// doc-comment line is folded into the unit it documents:
let parse_unit = syms.iter().find(|(s, _, _)| s == "parse").unwrap();
let parse_src = doc.blocks.iter().find_map(|b| match b {
Block::Code(c) if matches!(&c.common.source_span, SourceSpan::Code{symbol,..} if symbol.as_deref()==Some("parse")) => Some(c.code.clone()),
_ => None,
}).unwrap();
assert!(parse_src.contains("/// Doc comment on a free fn."), "doc comment folded in: {parse_src}");
let _ = parse_unit;
}
#[test]
fn deterministic_across_runs() {
let a = extract_fixture();
for _ in 0..50 {
assert_eq!(extract_fixture().blocks, a.blocks);
}
}
}
#[cfg(test)]
mod kebab_parse_code_test_support {
use kebab_core::*;
use time::OffsetDateTime;
pub fn fixed_rust_asset(path: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(std::path::PathBuf::from(path)),
workspace_path: WorkspacePath(path.to_string()),
media_type: MediaType::Code("rust".to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::InPlace,
}
}
}
Before relying on it, verify
Checksum/SourceUri/AssetStoragefield shapes by readingcrates/kebab-core/src/asset.rs:63-95; adjust the test-support constructor to the actual variants (e.g.AssetStoragemay beExternal/InPlace— use whatever exists). This is a fixture-construction detail, not a design choice.
- Step 3: Run — expect fail
Run: cargo test -p kebab-parse-code emits_one_block_per_semantic_unit
Expected: FAIL — RustAstExtractor undefined.
- Step 4: Implement
rust.rs
Create crates/kebab-parse-code/src/rust.rs. Scaffold (Extractor impl, doc_id, provenance events, final CanonicalDocument {…}) is a direct adaptation of crates/kebab-parse-pdf/src/lib.rs:51-225 with these concrete differences:
pub const PARSER_VERSION: &str = "code-rust-v1";pub struct RustAstExtractor;+new()/DefaultlikePdfTextExtractor.fn supports(&self, m: &MediaType) -> bool { matches!(m, MediaType::Code(l) if l == "rust") }- agent strings:
"kb-parse-code"instead of"kb-parse-pdf". title: filename stem ofasset.workspace_path(reuse the samestrip_extension(filename_from_workspace_path(...))helpers — copy those two small fns fromkebab-parse-pdf/src/lib.rs:229+intorust.rs, or inline equivalent).lang: Lang("und".into())(natural-language detection out of scope, design §3.5).- metadata: same
Metadata { … }literal as PDF but set:source_type: useSourceType::Codeif the enum has it (grep -n "enum SourceType" crates/kebab-core/src/metadata.rs), elseSourceType::Note.code_lang: Some("rust".to_string())repo/git_branch/git_commit: fromcrate::repo::detect_repo. Resolve the file's absolute path: ifasset.source_uriisSourceUri::File(p)usep; join withctx.workspace_rootif relative.match detect_repo(&abs) { Some(r) => (Some(r.name), r.branch, r.commit), None => (None, None, None) }.
- blocks: replace the PDF per-page loop with the AST walk. Implementation:
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
use kebab_core::{Block, CodeBlock, CommonBlock, SourceSpan, id_for_block};
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_rust::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-rust language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse Rust source"))?;
let lines: Vec<&str> = source.split('\n').collect();
// (symbol, start_line_1based, end_line_1based) in document order.
let mut units: Vec<(String, u32, u32)> = Vec::new();
// Pending glue (use/const/static/mod-decl/attr) accumulated into one
// <top-level> (or <module>) unit, flushed when a real unit appears or
// at end of a scope.
let mut glue: Vec<(usize, u32, u32)> = Vec::new(); // (is_mod_decl as 0/1 via usize, s, e)
fn node_name<'a>(n: &tree_sitter::Node, src: &'a str) -> Option<&'a str> {
n.child_by_field_name("name")
.map(|c| &src[c.start_byte()..c.end_byte()])
}
// Extend start upward over leading doc-comments / attributes.
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
let k = p.kind();
if k == "line_comment" || k == "block_comment" || k == "attribute_item" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn walk(
node: tree_sitter::Node,
src: &str,
mod_path: &[String],
units: &mut Vec<(String, u32, u32)>,
glue: &mut Vec<(usize, u32, u32)>,
) {
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
let prefix = if mod_path.is_empty() {
String::new()
} else {
format!("{}::", mod_path.join("::"))
};
match child.kind() {
"function_item" | "struct_item" | "enum_item" | "union_item"
| "trait_item" | "type_item" => {
if let Some(name) = node_name(&child, src) {
flush_glue(glue, units);
units.push((format!("{prefix}{name}"), s, e));
}
}
"macro_definition" => {
if let Some(name) = node_name(&child, src) {
flush_glue(glue, units);
units.push((format!("{prefix}{name}!"), s, e));
}
}
"impl_item" => {
flush_glue(glue, units);
let ty = child
.child_by_field_name("type")
.map(|c| src[c.start_byte()..c.end_byte()].trim().to_string());
let tr = child
.child_by_field_name("trait")
.map(|c| src[c.start_byte()..c.end_byte()].trim().to_string());
let owner = tr.or(ty).unwrap_or_else(|| "<impl>".to_string());
if let Some(body) = child.child_by_field_name("body") {
let mut bc = body.walk();
for m in body.named_children(&mut bc) {
if m.kind() == "function_item" {
if let Some(mn) = node_name(&m, src) {
let ms = unit_start(&m);
let me = m.end_position().row as u32 + 1;
units.push((format!("{prefix}{owner}::{mn}"), ms, me));
}
}
}
}
}
"mod_item" => {
if let Some(body) = child.child_by_field_name("body") {
flush_glue(glue, units);
let name = node_name(&child, src).unwrap_or("mod").to_string();
let mut np = mod_path.to_vec();
np.push(name);
walk(body, src, &np, units, glue);
} else {
glue.push((1, s, e)); // bare `mod foo;` declaration
}
}
"use_declaration" | "extern_crate_declaration" | "const_item"
| "static_item" | "attribute_item" | "macro_invocation" => {
glue.push((0, s, e));
}
_ => { /* line_comment / block_comment / unknown: ignore (folded via unit_start) */ }
}
}
flush_glue(glue, units);
}
fn flush_glue(glue: &mut Vec<(usize, u32, u32)>, units: &mut Vec<(String, u32, u32)>) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(_, a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, _, b)| *b).max().unwrap();
let only_mod_decls = glue.iter().all(|(is_mod, _, _)| *is_mod == 1);
let sym = if only_mod_decls { "<module>" } else { "<top-level>" };
units.push((sym.to_string(), s, e));
glue.clear();
}
walk(tree.root_node(), source, &[], &mut units, &mut glue);
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("rust".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("rust".to_string()),
code,
}));
}
Ok(blocks)
}
Notes for the implementer:
-
flush_glueordering: glue flushed before pushing a real unit so document order is preserved (glue that precedes the first fn becomes the<top-level>chunk spanning theuse/constregion; theunit_startdoc-comment extension keeps the fn's own doc comment with the fn, not the glue). -
A
glueflushed after a real unit between two fns is still a valid<top-level>unit (rare; acceptable). -
If
unitsis empty (e.g. an empty file) → emit zero blocks (consistent with empty-PDF-page behavior). -
The
eof a fixture's lastmod inner { … }etc. is end-of-block; line slicing uses inclusive 1-based. -
Step 5: Wire into lib.rs
In crates/kebab-parse-code/src/lib.rs:
pub mod rust;
pub use rust::{PARSER_VERSION as RUST_PARSER_VERSION, RustAstExtractor};
- Step 6: Run the tests — expect pass
Run: cargo test -p kebab-parse-code
Expected: PASS (extractor_supports_*, emits_one_block_per_semantic_unit_with_symbols, deterministic_across_runs).
If symbol names mismatch (tree-sitter-rust grammar field-name drift, e.g. impl_item type vs type_arguments), inspect with a scratch node.kind()/field dump and adjust the field names; pin behavior with the test.
- Step 7: clippy + commit
cargo clippy -p kebab-parse-code --all-targets -- -D warnings
git add crates/kebab-parse-code/
git commit -m "feat(p10-1a-2): tree-sitter-rust AST extractor (parser-side)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 7: code-rust-ast-v1 chunker
Files:
- Create:
crates/kebab-chunk/src/code_rust_ast_v1.rs - Modify:
crates/kebab-chunk/src/lib.rs(mod+pub use) - Test: inline
#[cfg(test)] mod testsin the new file
The chunker consumes the AST CanonicalDocument and maps 1 Block::Code → 1 Chunk, except a unit longer than AST_CHUNK_MAX_LINES is split into [part i/N] sub-chunks. tree-sitter is NOT imported here (forbidden — AST is parser-side). Mirror crates/kebab-chunk/src/pdf_page_v1.rs for: VERSION_LABEL const, BYTES_PER_TOKEN = 3, POLICY_HASH_HEX_LEN = 16, policy_hash impl (identical blake3 recipe — cross-chunker fingerprint identity is required), per-chunk policy_hash variant to avoid chunk_id collision on split units, the upfront block-shape validation that bail!s on a non-code doc.
AST_CHUNK_MAX_LINES is a module constant (= 200) matching IngestCodeCfg::default().ast_chunk_max_lines. Threading the config value through the fixed Chunker trait needs a per-medium chunker registry — a P+ task; this mirrors the existing pdf-page-v1 "chunker_version hard-coded" deviation. Record it (Task 11 HOTFIXES).
- Step 1: Write failing tests
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.rs".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-rust-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("rust".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("rust".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_rust_ast_v1() {
assert_eq!(CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "pub fn parse() {}\n// x\n}"),
("Foo::double", 5, 7, "fn double() {}\n//\n}"),
]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-rust-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
// 500-line fn → must split (AST_CHUNK_MAX_LINES = 200).
let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("pub fn big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "fn parse(){}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeRustAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]);
let base: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}
- Step 2: Run — expect fail
Run: cargo test -p kebab-chunk code_rust_ast
Expected: FAIL — CodeRustAstV1Chunker undefined.
- Step 3: Implement the chunker
Create crates/kebab-chunk/src/code_rust_ast_v1.rs:
//! `code-rust-ast-v1` — maps a tree-sitter-derived Rust AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-rust-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeRustAstV1Chunker;
impl Chunker for CodeRustAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeRustAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeRustAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-rust-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
// Per-chunk policy_hash variant prevents chunk_id collision when one
// block splits into multiple parts (same block_ids). Mirrors
// pdf-page-v1. Single-chunk units use the base hash unchanged.
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
// Prefer ending on a blank line within the last 20% of the window
// so we don't cut mid-paragraph when a boundary is nearby.
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
- Step 4: Wire into lib.rs
In crates/kebab-chunk/src/lib.rs add (matching the existing mod/pub use style):
mod code_rust_ast_v1;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
- Step 5: Run — expect pass
Run: cargo test -p kebab-chunk code_rust_ast
Expected: PASS (all 6 tests).
- Step 6: clippy + commit
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/
git commit -m "feat(p10-1a-2): code-rust-ast-v1 chunker (1:1 + oversize split)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 8: kebab-app dispatch — ingest_one_code_asset
Files:
-
Modify:
crates/kebab-app/Cargo.toml(addkebab-parse-code = { path = "../kebab-parse-code" }if not already a dep) -
Modify:
crates/kebab-app/src/lib.rs(ingest_one_assetmatch ~line 896; newingest_one_code_assetfn modeled oningest_one_pdf_assetat 1455-end) -
Test:
crates/kebab-app/tests/— addcode_ingest_smoke.rs(mirror an existing app integration test's harness) -
Step 1: Check/Add the crate dep
grep -n "kebab-parse-code" crates/kebab-app/Cargo.toml || \
echo 'kebab-parse-code = { path = "../kebab-parse-code" } # p10-1A-2' \
>> /dev/stderr
If absent, add kebab-parse-code = { path = "../kebab-parse-code" } under [dependencies] in crates/kebab-app/Cargo.toml (1A-1 may have added it for the skip helpers — verify).
- Step 2: Write the failing integration test
Create crates/kebab-app/tests/code_ingest_smoke.rs. Model the harness on an existing app integration test (read one under crates/kebab-app/tests/ for the App/Config/TempDir setup pattern). Core assertions:
// Build an isolated TempDir KB, drop a tiny .rs file in the workspace,
// run ingest, then assert a search returns a Citation::Code.
#[test]
fn rust_file_ingests_and_searches_as_code_citation() {
// ... TempDir + Config pointing workspace.root at it (copy the
// harness from the sibling integration test verbatim) ...
std::fs::write(workspace_root.join("demo.rs"),
"/// adds\npub fn add(a: i32, b: i32) -> i32 {\n a + b\n}\n").unwrap();
let report = kebab_app::ingest_with_config(&cfg, /* args per existing test */).unwrap();
assert!(report.ingested >= 1, "rust file ingested: {report:?}");
let hits = kebab_app::search_with_config(&cfg, "add", /* args */).unwrap();
let code_hit = hits.iter().find(|h| matches!(
h.citation, kebab_core::Citation::Code { .. }));
let h = code_hit.expect("a Citation::Code hit");
match &h.citation {
kebab_core::Citation::Code { lang, symbol, line_start, .. } => {
assert_eq!(lang.as_deref(), Some("rust"));
assert_eq!(symbol.as_deref(), Some("add"));
assert!(*line_start >= 1);
}
_ => unreachable!(),
}
assert_eq!(h.code_lang.as_deref(), Some("rust"));
}
Use the exact
*_with_configfacade signatures from a sibling test (the facade rule — CLAUDE.md — requires the_with_configform). Read one existingcrates/kebab-app/tests/*.rsto copy the harness; do not invent the Config builder.
- Step 3: Run — expect fail
Run: cargo test -p kebab-app rust_file_ingests_and_searches_as_code_citation
Expected: FAIL — code asset currently hits the _ => Skipped arm in ingest_one_asset.
- Step 4: Add the dispatch +
ingest_one_code_asset
In crates/kebab-app/src/lib.rs, in the ingest_one_asset match (~line 896, where MediaType::Pdf => { return ingest_one_pdf_asset(...) }), add before the catch-all _:
MediaType::Code(lang) if lang == "rust" => {
return ingest_one_code_asset(
app, asset, chunk_policy, embedder, vector_store,
existing_doc_ids, force_reingest,
);
}
// Non-Rust code langs activate in later phases (1B+); skip for now.
Add fn ingest_one_code_asset(...) modeled line-for-line on ingest_one_pdf_asset (lib.rs:1455-end) with these substitutions:
- parser_version:
ParserVersion(kebab_parse_code::RUST_PARSER_VERSION.to_string()) - extractor:
kebab_parse_code::RustAstExtractor::new().extract(&ctx, &bytes).context("kb-parse-code::RustAstExtractor::extract")? - chunker:
let chunker = CodeRustAstV1Chunker;+chunker.chunk(&canonical, chunk_policy).context("kb-chunk::CodeRustAstV1Chunker::chunk")? try_skip_unchanged(... &CodeRustAstV1Chunker.chunker_version() ...).context("... (code)")strings instead of(pdf)- import
CodeRustAstV1Chunker(it's re-exported fromkebab_chunk) at the top oflib.rsalongside the existingPdfPageV1Chunkerimport.
Everything else (read bytes, ExtractContext, put_asset/document/blocks/chunks, embed branch, IngestItem construction, kb:// skip) is identical.
- Step 5: Run — expect pass
Run: cargo test -p kebab-app rust_file_ingests_and_searches_as_code_citation
Expected: PASS.
- Step 6: clippy + commit
cargo clippy -p kebab-app --all-targets -- -D warnings
git add crates/kebab-app/
git commit -m "feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 9: code_lang_breakdown in kebab schema
Files:
-
Modify:
crates/kebab-store-sqlite/src/store.rs(add acode_lang_breakdown()query next to whatever computesmedia_breakdown~line 608) -
Modify:
crates/kebab-app/src/schema.rs:170(currentlycode_lang_breakdown: BTreeMap::new()) -
Test: extend the existing
schema.rstest (crates/kebab-app/src/schema.rs:196+) to assert a real count after a code ingest, or a store-level unit test inkebab-store-sqlite -
Step 1: Write the failing test
In crates/kebab-store-sqlite add a unit test that inserts a document with metadata.code_lang = Some("rust") and asserts:
#[test]
fn code_lang_breakdown_counts_by_code_lang() {
// ... open in-memory/temp store, put_document with code_lang=Some("rust") ...
let bd = store.code_lang_breakdown().unwrap();
assert_eq!(bd.get("rust"), Some(&1));
}
Read an existing
kebab-store-sqlitetest for the store-setup harness and how documents/metadata are persisted (so thecode_langcolumn / json path is correct —metadatais stored as JSON; the query likely usesjson_extract(metadata_json, '$.code_lang')).
- Step 2: Run — expect fail
Run: cargo test -p kebab-store-sqlite code_lang_breakdown_counts
Expected: FAIL — code_lang_breakdown undefined.
- Step 3: Implement the store query
In crates/kebab-store-sqlite/src/store.rs, mirror the media_breakdown query. The exact column depends on how Metadata is stored — inspect the put_document insert and the existing media_breakdown SQL, then add:
pub fn code_lang_breakdown(&self) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
// documents whose metadata JSON has a non-null code_lang
let mut stmt = self.conn.prepare(
"SELECT json_extract(metadata_json, '$.code_lang') AS cl, COUNT(*) \
FROM documents \
WHERE cl IS NOT NULL \
GROUP BY cl",
)?;
let rows = stmt.query_map([], |r| {
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
})?;
let mut out = std::collections::BTreeMap::new();
for row in rows {
let (k, v) = row?;
out.insert(k, v);
}
Ok(out)
}
Adjust table/column names (
documents,metadata_json) to the actual schema — grep themedia_breakdownimpl and copy itsFROM/column conventions exactly.
- Step 4: Populate it in schema.rs
In crates/kebab-app/src/schema.rs, replace the code_lang_breakdown: std::collections::BTreeMap::new(), placeholder (line ~170) with a call to the new store method (mirror how media_breakdown: counts.media_breakdown is sourced — likely app.sqlite.code_lang_breakdown()?).
- Step 5: Run — expect pass + suite
Run: cargo test -p kebab-store-sqlite code_lang_breakdown_counts
Expected: PASS.
Run: cargo test -p kebab-app schema
Expected: PASS (existing schema.v1 serialization tests still green; code_lang_breakdown now populated).
- Step 6: clippy + commit
cargo clippy -p kebab-store-sqlite -p kebab-app --all-targets -- -D warnings
git add crates/kebab-store-sqlite/ crates/kebab-app/
git commit -m "feat(p10-1a-2): populate schema.v1 code_lang_breakdown
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 10: Full-suite gate + self-ingest snapshot
Files:
-
Create:
crates/kebab-chunk/tests/code_rust_ast_snapshot.rs+ fixturecrates/kebab-chunk/tests/fixtures/code-sample.rs+ baselinecode-sample.chunks.snapshot.json(mirrorcrates/kebab-chunk/tests/long_section_snapshot.rs) -
Step 1: Add a snapshot integration test
Mirror long_section_snapshot.rs exactly (same UPDATE_SNAPSHOTS=1 regen mechanism). Build a CanonicalDocument by running kebab_parse_code::RustAstExtractor on tests/fixtures/code-sample.rs (a representative file with a free fn, an impl with 2 methods, a struct, a trait, a top-level use/const block, and one >200-line fn to exercise the split), chunk with CodeRustAstV1Chunker, serialize, compare to baseline.
kebab-chunkmay not depend onkebab-parse-codeeven as a dev-dep if that crosses a boundary — checktasks/p10/p10-1a-2-rust-ast-chunker.mdAllowed deps. It does not list kebab-chunk→kebab-parse-code. So instead, build theCanonicalDocumentby hand in the test (constructBlock::Codeunits directly, like Task 7'scode_dochelper) rather than invoking the extractor. The snapshot then locks the chunker's mapping/splitting, which is the unit under test here. (Extractor behavior is already locked by Task 6's tests.)
- Step 2: Generate the baseline
Run: UPDATE_SNAPSHOTS=1 cargo test -p kebab-chunk code_rust_ast_snapshot
Then run without the env var:
Run: cargo test -p kebab-chunk code_rust_ast_snapshot
Expected: PASS.
- Step 3: Full workspace suite (the real gate)
cargo test --workspace --no-fail-fast -j 1
Expected: PASS. Pay attention to: citation/wire round-trip tests (Citation still 6 variants — Code from 1A-1 unchanged), any golden eval fixtures, asset/MediaType serialization tests. Fix faithfully; a regression here means an earlier task's match arm was wrong.
cargo clippy --workspace --all-targets -- -D warnings
Expected: clean.
- Step 4: Manual smoke (SMOKE.md flow, isolated TempDir)
cargo build --release
rm -rf /tmp/kebab-p10smoke && mkdir -p /tmp/kebab-p10smoke/ws
cp crates/kebab-chunk/src/code_rust_ast_v1.rs /tmp/kebab-p10smoke/ws/
cat > /tmp/kebab-p10smoke/config.toml <<'EOF'
[workspace]
root = "/tmp/kebab-p10smoke/ws"
[paths]
data_dir = "/tmp/kebab-p10smoke/data"
EOF
# (match the SMOKE.md config skeleton if these keys differ — read docs/SMOKE.md)
./target/release/kebab --config /tmp/kebab-p10smoke/config.toml ingest --json
./target/release/kebab --config /tmp/kebab-p10smoke/config.toml search "chunk" --json | head
./target/release/kebab --config /tmp/kebab-p10smoke/config.toml schema --json | grep code_lang
Expected: ingest report shows ≥1 ingested; a search hit with "citation":{"kind":"code",...,"lang":"rust"} and top-level "code_lang":"rust", "repo":...; schema --json code_lang_breakdown has "rust".
- Step 5: Commit
git add crates/kebab-chunk/tests/
git commit -m "test(p10-1a-2): code-rust-ast-v1 chunker snapshot + full-suite gate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Task 11: Docs, version bump, status flips
Files:
-
README.md— 지원 형식 / 명령 table: note Rust code ingest is now active (per CLAUDE.md README rule — a new media surface). Mermaid: addcodesource crossing the boundary only if a media type newly crosses it (it does — add a code-ingest edge to the existing diagram). -
HANDOFF.md— P10 phase row: note 1A-2 merged (Rust code ingest active, kebab self-dogfooding possible); add a one-line entry under 머지 후 결정 if a HOTFIXES item lands. -
docs/ARCHITECTURE.md— add thekebab-app → kebab-parse-codeedge +kebab-parse-code → tree-sitter/tree-sitter-rustto the dependency graph; add the locked-in decision "tree-sitter lives parser-side, not chunker-side (design §6.3)" to the decisions table. -
docs/SMOKE.md— add the Rust code-ingest smoke steps (the Task 10 Step 4 flow), and the[ingest.code]config keys if not already documented. -
tasks/INDEX.mdline ~143 +tasks/p10/INDEX.mdrow 1A-2 — flip 1A-1 to ✅ and 1A-2 to ✅ (on merge). -
tasks/HOTFIXES.md— add a dated entry for theAST_CHUNK_MAX_LINESconstant-vs-config deviation (chunker can't seeIngestCodeCfg.ast_chunk_max_linesthrough the fixedChunkertrait; pinned to the default 200; per-medium chunker registry is P+). Cross-link one line intasks/p10/p10-1a-2-rust-ast-chunker.mdRisks/notes. -
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md§10.1 — record the 1A-2 surface per design §10.1 (already partly done for SourceSpan/MediaType in Tasks 2/4; ensure §10.1 mentions code ingest activation). -
Cargo.toml— workspaceversionminor bump (design §10.4: 1A-2 merge = 도그푸딩 가능 = bump trigger). e.g.0.6.x→0.7.0(check current value first). -
Step 1: Make all doc edits above. Keep README narrow (usage only, one Mermaid). HANDOFF gets phase status. ARCHITECTURE gets the graph/decision.
-
Step 2: Version bump
grep -m1 '^version' Cargo.toml # read current workspace version
# bump minor in Cargo.toml [workspace.package].version
cargo build --release # refresh Cargo.lock + binary identity
- Step 3: Full suite once more (docs/version shouldn't break it, but the lock changed)
cargo test --workspace --no-fail-fast -j 1
cargo clippy --workspace --all-targets -- -D warnings
Expected: PASS / clean.
- Step 4: Commit (bump = release commit, same commit per CLAUDE.md)
git add -A
git commit -m "docs(p10-1a-2): README/HANDOFF/ARCHITECTURE/SMOKE + HOTFIXES + status; chore: bump version
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>"
Finalize: PR + review loop + release
Per tasks/HOTFIXES.md workflow memory and CLAUDE.md Remote section (Gitea, not gh):
- Use the gitea-ops skill:
gitea-prto open the PR (feature branch → main). Title:feat(p10-1a-2): Rust AST chunker — tree-sitter-rust code ingest active. Body summarizes: SourceSpan::Code internal variant, parser-side tree-sitter (design §6.3), code-rust-ast-v1 chunker + oversize split, MediaType::Code, schema code_lang_breakdown, frozen design §3.4/§10.1 sync, version bump. - Review loop mode (do not ask single-shot):
gitea-pr-status --wait-ci→gitea-pr-diff→ analyze →gitea-pr-review(REQUEST_CHANGES/APPROVE) each round; actionable comments → follow-up commits; converge to APPROVE. - On APPROVE: merge immediately (no asking), sync local
main, deletefeat/p10-1a-2-rust-ast-chunker. - After merge:
cargo clean(CLAUDE.md routine). Cut release via gitea-opsgitea-release v<new version>(release notes: Rust code ingest active,Citation::Codenow populated,MediaType::Code,schema.v1 code_lang_breakdown, internalSourceSpan::Code). - Flip
tasks/INDEX.md/tasks/p10/INDEX.md1A-2 → ✅ if not already in the merged commit; update memory phase-priorities note if the next-task priority shifts (P10-1B vs other).
Self-Review (completed by plan author)
- Spec coverage: design §1A-2 (Rust chunker + tree-sitter-rust + activation) → Tasks 6/7/8; §3.3 (
code-rust-ast-v1) → Task 7; §3.4 symbol path → Task 6 walk rules + Task 2 SourceSpan; §6.1 (rust.rs parser) → Task 6; §6.2 (kebab-chunk module) → Task 7; §6.3 dep graph (tree-sitter parser-side) → Task 1 + Task 7 forbidden-dep note; §9.1 Tier-1 + oversize fallback → Task 7split_oversize; §10.4 version bump → Task 11; wire (no v2 — Citation::Code from 1A-1) → Task 10 Step 3 gate. Citation routing gap (1A-1 left unwired) → Tasks 2+3.MediaType::Code+ routing → Tasks 4+5. schema breakdown → Task 9. Docs cascade → Task 11. - Placeholder scan: novel logic (SourceSpan variant, citation arm, AST walk, chunker, split) is given in full. Mechanical mirrors (extractor scaffold,
ingest_one_code_asset, store breakdown query, integration-test harness) are pinned to an exact existing file:line to copy with enumerated deltas — the established-pattern path the writing-plans skill endorses, not "TBD". - Type consistency:
RustAstExtractor/RUST_PARSER_VERSION/CodeRustAstV1Chunker/VERSION_LABEL="code-rust-ast-v1"/SourceSpan::Code{line_start,line_end,symbol,lang}/Citation::Code(1A-1 shape) used consistently across Tasks 2/3/6/7/8.id_for_block(doc,"code",&[],ordinal,&span)andid_for_chunk(doc,cv,block_ids,hash)matchcrates/kebab-core/src/ids.rs:146,163.