diff --git a/docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md b/docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md new file mode 100644 index 0000000..89c74e7 --- /dev/null +++ b/docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md @@ -0,0 +1,930 @@ +# p10-1D C + C++ AST Chunkers Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Activate C + C++ code ingest end-to-end. P10 Tier 1 chunker family final entry. + +**Architecture:** Same shape as 1B (multi-language single PR) and 1C-JK (JVM family). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing (delegated via `code_lang_for_path`, no change) + app dispatch arms. C symbol = function name only; C++ symbol = `namespace::Class::method` via recursive class/namespace nesting (Java/Kotlin + Python hybrid). + +**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-c` + `tree-sitter-cpp` (NEW). 1A-2/1B/1C/p10-2/p10-3 infrastructure unchanged. + +**Memory note:** Host has been OOM'd previously (재부팅 사례). Per-crate cargo only. ONE full-suite + clippy invocation in Task J. NO `cargo test --workspace` outside that gate. + +--- + +## Pre-flight + +Branch `feat/p10-1d-c-cpp` already exists (spec commit `8add684`). + +- [ ] **Disk hygiene**: `df -h /` 점검. 80% 넘으면 `cargo clean`. + +Reference files: +- 1C-JK extractor: `crates/kebab-parse-code/src/{java,kotlin}.rs` — closest template for source-side identifier prefix (package vs namespace). +- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for C++ class nesting). +- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution pattern. +- 1B/1C/p10-2/p10-3 dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L1796–2116). Current allowlist + 4-arm match. +- spec: `tasks/p10/p10-1d-c-cpp-ast-chunker.md`. + +--- + +## Task A: Workspace deps (tree-sitter-c + tree-sitter-cpp) + +**Files:** +- Modify: `Cargo.toml` (`[workspace.dependencies]`, after `tree-sitter-kotlin-ng`) +- Modify: `crates/kebab-parse-code/Cargo.toml` + +- [ ] **Step 1**: `cargo add tree-sitter-c tree-sitter-cpp -p kebab-parse-code`. If either crate's actively-maintained name differs (e.g. `tree-sitter-cpp` vs `tree-sitter-cpp-ng`), verify on crates.io. The `tree-sitter-c` 0.24 / `tree-sitter-cpp` 0.23 line is the most common; verify compatibility with workspace `tree-sitter = "0.26"` (likely already supported via the `tree-sitter-language` shim). + +- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` (after `tree-sitter-kotlin-ng`): + +```toml +# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D). +tree-sitter-c = "" +tree-sitter-cpp = "" +``` + +Switch crate's `Cargo.toml` entries to `{ workspace = true }`. + +- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine. + +- [ ] **Step 4**: Commit: + +```bash +git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml +git commit -m "$(cat <<'EOF' +build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +If a crate's resolved name has a non-obvious fork suffix (e.g. `tree-sitter-cpp-ng`), document it in the commit body. + +--- + +## Task B: C AST extractor (`kebab-parse-code/src/c.rs`) + +**Files:** +- Create: `crates/kebab-parse-code/src/c.rs` +- Modify: `crates/kebab-parse-code/src/lib.rs` (pub mod + `C_PARSER_VERSION` const) + +- [ ] **Step 1**: Create `crates/kebab-parse-code/src/c.rs`. Mirror `crates/kebab-parse-code/src/go.rs` (closest template — single-language, no namespace/package nesting, top-level units). Replace tree-sitter-go with tree-sitter-c: + +```rust +//! p10-1D: C AST extractor. + +use crate::traits::{Extractor, ExtractContext}; +use anyhow::{Context, Result}; +use kebab_core::{Block, BlockId, CanonicalDocument, CodeBlock, CommonBlock, /*..*/, SourceSpan, id_for_block, id_for_doc}; +use tree_sitter::Parser; + +pub const C_PARSER_VERSION: &str = concat!("tree-sitter-c-", env!("CARGO_PKG_VERSION")); +// Or use the tree-sitter-c crate version: better to hardcode for stability. +// Look at how go.rs / rust.rs / etc. set their PARSER_VERSION. + +pub struct CAstExtractor { + parser: Parser, +} + +impl CAstExtractor { + pub fn new() -> Self { + let mut parser = Parser::new(); + parser.set_language(&tree_sitter_c::LANGUAGE.into()).expect("load tree-sitter-c"); + Self { parser } + } +} + +impl Extractor for CAstExtractor { + fn extract(&mut self, ctx: &ExtractContext, bytes: &[u8]) -> Result { + // ... mirror go.rs: + // 1. parse the tree + // 2. iterate source_file's named_children + // 3. for each top-level node: + // - function_definition → emit unit (symbol = fn name) + // - struct_specifier (named) → emit unit (symbol = struct name) + // - enum_specifier (named) → emit unit (symbol = enum name) + // - union_specifier (named) → emit unit (symbol = union name) + // - declaration → glue + // - preproc_include / preproc_def / preproc_function_def / preproc_ifdef → glue + // - else → glue + // 4. glue chunk if any glue accumulated + // 5. post-pass if 0 units + // ... + todo!("mirror go.rs structure with C-specific node-kind names") + } +} +``` + +**ACTION**: Read `crates/kebab-parse-code/src/go.rs` in full first. It's the closest template — single-language, no namespace prefix to thread through (C is even simpler than Go since there's no `package`). Port the structure: parse → iterate top-level → match on node-kind → emit units or accumulate glue. + +Node-kind name reference (tree-sitter-c): `function_definition`, `struct_specifier`, `enum_specifier`, `union_specifier`, `declaration`, `preproc_*`. Confirm by checking the crate's `node-types.json` if uncertain. + +**Function name extraction**: `function_definition` has a `declarator` field. The innermost `identifier` of that declarator is the function name. Mirror how go.rs extracts function names — it uses tree-sitter field traversal. + +- [ ] **Step 2**: Register the module in `crates/kebab-parse-code/src/lib.rs`: + +```rust +pub mod c; +pub use c::{CAstExtractor, C_PARSER_VERSION}; +``` + +- [ ] **Step 3**: Build: + +```bash +cargo build -p kebab-parse-code 2>&1 | tail -5 +``` + +Expected: clean. + +- [ ] **Step 4**: Commit (no test yet — Task D adds the snapshot test): + +```bash +git add crates/kebab-parse-code/src/c.rs crates/kebab-parse-code/src/lib.rs +git commit -m "$(cat <<'EOF' +feat(p10-1d): C AST extractor (tree-sitter-c) + +Top-level units: function_definition (symbol = fn name), struct_specifier, +enum_specifier, union_specifier (each emits 1 unit with the symbol being +the named identifier). Preprocessor directives + top-level declarations +group into a glue chunk. Empty file or zero units → +post-pass. + +C symbol = function name only — no namespace, no class nesting (design §3.4). + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task C: C++ AST extractor (`kebab-parse-code/src/cpp.rs`) + +**Files:** +- Create: `crates/kebab-parse-code/src/cpp.rs` +- Modify: `crates/kebab-parse-code/src/lib.rs` + +- [ ] **Step 1**: Create `crates/kebab-parse-code/src/cpp.rs`. The closest template is `crates/kebab-parse-code/src/java.rs` (1C-JK) — it handles package prefix + class nesting via recursion. C++ adds namespace nesting (multiple levels possible). + +Pseudocode: + +```rust +//! p10-1D: C++ AST extractor. + +use crate::traits::{Extractor, ExtractContext}; +use anyhow::{Context, Result}; +use kebab_core::{/* ... */}; +use tree_sitter::{Node, Parser}; + +pub const CPP_PARSER_VERSION: &str = "tree-sitter-cpp-"; + +pub struct CppAstExtractor { parser: Parser } + +impl CppAstExtractor { + pub fn new() -> Self { + let mut parser = Parser::new(); + parser.set_language(&tree_sitter_cpp::LANGUAGE.into()).expect("load tree-sitter-cpp"); + Self { parser } + } + + fn visit(&self, node: Node, source: &[u8], prefix: &[&str], units: &mut Vec<(String, Node)>, glue: &mut Vec) { + // prefix is the namespace/class chain so far (e.g. ["kebab", "chunk", "MdHeadingV1Chunker"]). + for child in node.named_children(&mut node.walk()) { + match child.kind() { + "namespace_definition" => { + let name = child.child_by_field_name("name") + .and_then(|n| n.utf8_text(source).ok()) + .unwrap_or(""); + let mut new_prefix = prefix.to_vec(); + new_prefix.push(name); + let body = child.child_by_field_name("body").unwrap_or(child); + self.visit(body, source, &new_prefix, units, glue); + } + "class_specifier" | "struct_specifier" if child.child_by_field_name("name").is_some() => { + let name = child.child_by_field_name("name") + .and_then(|n| n.utf8_text(source).ok()) + .unwrap_or(""); + // Emit the class itself as a unit. + let symbol = build_symbol(prefix, &[], name); // e.g. "kebab::chunk::Foo" + units.push((symbol, child)); + // Recurse for nested classes / methods. + let mut new_prefix = prefix.to_vec(); + new_prefix.push(name); + let body = child.child_by_field_name("body").unwrap_or(child); + self.visit(body, source, &new_prefix, units, glue); + } + "function_definition" => { + // declarator may be qualified_identifier (out-of-class def) or plain identifier. + let symbol = extract_fn_symbol(child, source, prefix); + units.push((symbol, child)); + // Do NOT recurse into function body — inner classes/lambdas left to a future revision. + } + "template_declaration" => { + // Recurse: unwrap to inner declarator (function_definition or class_specifier) + // and treat it as if it were directly there. Template params NOT in symbol. + self.visit(child, source, prefix, units, glue); + } + "enum_specifier" if child.child_by_field_name("name").is_some() => { + let name = child.child_by_field_name("name").and_then(|n| n.utf8_text(source).ok()).unwrap_or(""); + let symbol = build_symbol(prefix, &[], name); + units.push((symbol, child)); + } + "concept_definition" => { + let name = /* extract */; + let symbol = build_symbol(prefix, &[], &name); + units.push((symbol, child)); + } + _ => glue.push(child), + } + } + } +} + +fn build_symbol(prefix: &[&str], extras: &[&str], leaf: &str) -> String { + // Join with :: + let mut parts: Vec<&str> = prefix.iter().copied().collect(); + parts.extend_from_slice(extras); + parts.push(leaf); + parts.join("::") +} + +fn extract_fn_symbol(node: Node, source: &[u8], prefix: &[&str]) -> String { + // function_definition.declarator may be a function_declarator wrapping a + // qualified_identifier (out-of-class def like `void Foo::bar(){}`) or a + // plain identifier (free fn or in-namespace fn). + // Need to walk down to the leaf identifier and any qualifier chain. + // For qualified_identifier "Foo::bar::baz", break into ["Foo", "bar"] qualifier + "baz" leaf. + // ... + todo!("walk declarator → qualified_identifier → assemble symbol with prefix") +} + +// Extractor impl: parse, visit(root, ...), emit chunks-of-blocks per (symbol, node) pair + glue + fallback. +``` + +This is the most intricate extractor in p10-1D. **Action**: read `crates/kebab-parse-code/src/java.rs` for the recursion pattern, then `crates/kebab-parse-code/src/python.rs` for the class-nesting pattern, and combine. tree-sitter-cpp's node-types.json (or a quick `tree-sitter parse` against a sample file) confirms exact node-kind names. + +- [ ] **Step 2**: Register in `crates/kebab-parse-code/src/lib.rs`: + +```rust +pub mod cpp; +pub use cpp::{CppAstExtractor, CPP_PARSER_VERSION}; +``` + +- [ ] **Step 3**: Build: + +```bash +cargo build -p kebab-parse-code 2>&1 | tail -5 +``` + +Expected: clean. + +- [ ] **Step 4**: Commit: + +```bash +git add crates/kebab-parse-code/src/cpp.rs crates/kebab-parse-code/src/lib.rs +git commit -m "$(cat <<'EOF' +feat(p10-1d): C++ AST extractor (tree-sitter-cpp) + +Symbol = namespace::Class::method via recursive visit. namespace_definition +pushes namespace name (anonymous → ). class_specifier / struct_specifier +(named) emit class unit + recurse with class name pushed. function_definition +emits method unit (symbol may include qualified_identifier prefix for +out-of-class definitions). template_declaration unwraps to inner declarator +(template params NOT in symbol). enum_specifier + concept_definition emit +type-level units. extern "C" block content + using/include/define → glue. + +Constructor / destructor symbols use Class::Class / Class::~Class +convention. Operator overloads keep operator+ form. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task D: C chunker + snapshot test + +**Files:** +- Create: `crates/kebab-chunk/src/code_c_ast_v1.rs` +- Create: `crates/kebab-chunk/tests/fixtures/sample.c` +- Create: `crates/kebab-chunk/tests/code_c_ast_snapshot.rs` +- Modify: `crates/kebab-chunk/src/lib.rs` + +- [ ] **Step 1**: Create `crates/kebab-chunk/src/code_c_ast_v1.rs`. **Mirror `crates/kebab-chunk/src/code_go_ast_v1.rs`** (closest 1-extractor pattern, no nesting): + +```rust +//! p10-1D: C AST chunker. + +use crate::tier2_shared::build_chunk; +use crate::{Chunker, ChunkPolicy}; +use anyhow::Result; +use kebab_core::{Block, Chunk, Document}; + +pub const VERSION_LABEL: &str = "code-c-ast-v1"; + +pub struct CodeCAstV1Chunker; + +impl Chunker for CodeCAstV1Chunker { + fn chunker_version(&self) -> &'static str { VERSION_LABEL } + fn policy_hash(&self, policy: &ChunkPolicy) -> String { + crate::tier2_shared::policy_hash(policy) + } + fn chunk(&self, doc: &Document, policy: &ChunkPolicy) -> Result> { + // Mirror code_go_ast_v1.rs's body — iterate doc.blocks, each Block::Code + // contributes 1 chunk via build_chunk. Apply oversize fallback per block + // via tier2_shared::push_chunks_with_oversize. + // ... + todo!("mirror code_go_ast_v1.rs verbatim, substituting VERSION_LABEL") + } +} +``` + +Read `code_go_ast_v1.rs` and port verbatim — the language-agnostic body iterates `doc.blocks` and emits chunks. Only the `VERSION_LABEL` and (potentially) symbol formatting helper change. + +- [ ] **Step 2**: Create `tests/fixtures/sample.c` (~30 lines, includes top-level fn, struct, enum, preprocessor): + +```c +#include +#include + +#define MAX_BUF 4096 + +typedef enum { + OK = 0, + ERR_PARSE, + ERR_IO, +} status_t; + +typedef struct { + int id; + char name[64]; + status_t status; +} record_t; + +static int counter = 0; + +int parse_record(const char *line, record_t *out) { + if (line == NULL || out == NULL) return ERR_PARSE; + return OK; +} + +void print_record(const record_t *r) { + printf("[%d] %s (status=%d)\n", r->id, r->name, r->status); +} + +int main(void) { + record_t r = { .id = 1, .name = "foo", .status = OK }; + print_record(&r); + return 0; +} +``` + +Expected snapshot: 3 function units (`parse_record`, `print_record`, `main`) + 1 enum unit (`status_t`) + 1 struct unit (`record_t`) + 1 `` glue (preproc + global var). Total ~6 chunks. + +- [ ] **Step 3**: Create `tests/code_c_ast_snapshot.rs` mirroring `tests/code_go_ast_snapshot.rs`. Assertions: + +```rust +// Pseudocode: +// 1. Load fixture sample.c +// 2. Run CAstExtractor → Document +// 3. Run CodeCAstV1Chunker.chunk(&doc, &policy) +// 4. Assert chunks.len() == expected (6). +// 5. Assert symbols (from chunks[i].source_spans[0]::SourceSpan::Code.symbol) match expected list: +// ["status_t", "record_t", "parse_record", "print_record", "main", ""] +// (order matches AST traversal order — verify by running once.) +// 6. Assert all chunks have lang = Some("c"). +``` + +- [ ] **Step 4**: Register module in `crates/kebab-chunk/src/lib.rs`: + +```rust +pub mod code_c_ast_v1; +pub use code_c_ast_v1::CodeCAstV1Chunker; +``` + +- [ ] **Step 5**: Run test: + +```bash +cargo test -p kebab-chunk --test code_c_ast_snapshot -- --nocapture 2>&1 | tail -25 +``` + +Expected: PASS. If chunk count or symbol order differs from expectation, INSPECT the actual output and update the test's expected list to match (run once to learn, codify on second run). + +- [ ] **Step 6**: Clippy + commit: + +```bash +cargo clippy -p kebab-chunk --all-targets -- -D warnings +git add crates/kebab-chunk/src/code_c_ast_v1.rs \ + crates/kebab-chunk/src/lib.rs \ + crates/kebab-chunk/tests/fixtures/sample.c \ + crates/kebab-chunk/tests/code_c_ast_snapshot.rs +git commit -m "$(cat <<'EOF' +feat(p10-1d): code-c-ast-v1 chunker + snapshot test + +Mirrors code-go-ast-v1's chunker pattern (1 chunk per AST unit + +glue + oversize fallback). Snapshot test against tests/fixtures/sample.c +(function + struct + enum + preprocessor) verifies symbol order + lang=c +stamping. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task E: C++ chunker + snapshot test + +**Files:** +- Create: `crates/kebab-chunk/src/code_cpp_ast_v1.rs` +- Create: `crates/kebab-chunk/tests/fixtures/sample.cpp` +- Create: `crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs` +- Modify: `crates/kebab-chunk/src/lib.rs` + +- [ ] **Step 1**: Create `code_cpp_ast_v1.rs`. **Mirror `code_c_ast_v1.rs`** verbatim, only VERSION_LABEL differs: + +```rust +pub const VERSION_LABEL: &str = "code-cpp-ast-v1"; + +pub struct CodeCppAstV1Chunker; + +impl Chunker for CodeCppAstV1Chunker { + fn chunker_version(&self) -> &'static str { VERSION_LABEL } + // ... identical body — both languages use the same Block::Code → Chunk emission ... +} +``` + +The actual symbol-formatting work happens in the EXTRACTOR (Task C). The chunker's job is to iterate blocks the extractor produced and emit Chunks. Both C and C++ chunkers are essentially identical bodies. + +- [ ] **Step 2**: Create `tests/fixtures/sample.cpp` (~50 lines, includes namespace + nested class + method + free fn + template): + +```cpp +#include +#include + +namespace kebab { +namespace chunk { + +class MdHeadingV1Chunker { +public: + MdHeadingV1Chunker() = default; + ~MdHeadingV1Chunker() = default; + + std::string chunk_doc(const std::string& doc) { + return doc; + } + + int operator()(int x) const { + return x * 2; + } + +private: + int counter_ = 0; +}; + +template +T identity(T value) { + return value; +} + +} // namespace chunk + +void global_helper() { + // free function in kebab namespace +} + +} // namespace kebab + +int main() { + kebab::chunk::MdHeadingV1Chunker c; + return 0; +} +``` + +Expected snapshot symbols (verify on first run, then codify): +- `kebab::chunk::MdHeadingV1Chunker` (class unit) +- `kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker` (constructor) +- `kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker` (destructor) +- `kebab::chunk::MdHeadingV1Chunker::chunk_doc` +- `kebab::chunk::MdHeadingV1Chunker::operator()` +- `kebab::chunk::identity` (template fn) +- `kebab::global_helper` +- `main` (free fn, no namespace) +- `` (include + using) + +~9 chunks total. + +- [ ] **Step 3**: Create `tests/code_cpp_ast_snapshot.rs` mirroring `code_c_ast_snapshot.rs`. Assert symbol list matches expected (run once to learn the actual order, codify). + +- [ ] **Step 4**: Register module in `lib.rs`: + +```rust +pub mod code_cpp_ast_v1; +pub use code_cpp_ast_v1::CodeCppAstV1Chunker; +``` + +- [ ] **Step 5**: Run test: + +```bash +cargo test -p kebab-chunk --test code_cpp_ast_snapshot -- --nocapture 2>&1 | tail -30 +``` + +Expected: PASS. + +- [ ] **Step 6**: Clippy + commit: + +```bash +cargo clippy -p kebab-chunk --all-targets -- -D warnings +git add crates/kebab-chunk/src/code_cpp_ast_v1.rs \ + crates/kebab-chunk/src/lib.rs \ + crates/kebab-chunk/tests/fixtures/sample.cpp \ + crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs +git commit -m "$(cat <<'EOF' +feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test + +Identical chunker body to code-c-ast-v1; per-language work happens in the +CppAstExtractor (Task C). Snapshot fixture covers nested namespace + +class + ctor/dtor + method + operator overload + template fn + free fn + +top-level main, verifying namespace::Class::method symbol convention per +design §3.4. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task F: ingest_one_code_asset dispatch + tier3 fallback list extension + +**Files:** +- Modify: `crates/kebab-app/src/lib.rs` + +- [ ] **Step 1**: Top-of-file `use kebab_chunk::{...}` extend with `CodeCAstV1Chunker` + `CodeCppAstV1Chunker`: + +```rust +use kebab_chunk::{ + /* existing items */, + CodeCAstV1Chunker, + CodeCppAstV1Chunker, +}; +``` + +- [ ] **Step 2**: Allowlist (around line 953) extend: + +```rust +if matches!(lang.as_str(), + "rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin" + | "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" + | "shell" + | "c" | "cpp") +``` + +- [ ] **Step 3**: `parser_version` match — add C/C++ arms (Tier 1, so they DO get a real parser version): + +```rust +let parser_version = match code_lang { + // ... existing 7 Tier 1 + Tier 2 + shell arms ... + "c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()), + "cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()), + other => anyhow::bail!("unsupported code_lang: {other}"), +}; +``` + +- [ ] **Step 4**: `chunker_version` match — add C/C++ arms: + +```rust +let chunker_version = match code_lang { + // ... existing arms ... + "c" => CodeCAstV1Chunker.chunker_version(), + "cpp" => CodeCppAstV1Chunker.chunker_version(), + other => anyhow::bail!("unreachable chunker_version: {other}"), +}; +``` + +- [ ] **Step 5**: `canonical_result` extract match — add C/C++ arms: + +```rust +let canonical_result: anyhow::Result = match code_lang { + "rust" => RustAstExtractor::new().extract(&ctx, &bytes).context("..."), + // ... existing ... + "c" => CAstExtractor::new().extract(&ctx, &bytes) + .context("kb-parse-code::CAstExtractor::extract (code:c)"), + "cpp" => CppAstExtractor::new().extract(&ctx, &bytes) + .context("kb-parse-code::CppAstExtractor::extract (code:cpp)"), + // ... Tier 2 + shell ... + other => anyhow::bail!("unreachable (extract): {other}"), +}; +``` + +(Add `use kebab_parse_code::{CAstExtractor, CppAstExtractor};` at the top if not already wildcard-imported.) + +- [ ] **Step 6**: `chunks_result` match — add C/C++ arms: + +```rust +let chunks_result: anyhow::Result> = if extract_fell_back { + // ... existing ... +} else { + match code_lang { + "rust" => CodeRustAstV1Chunker.chunk(&canonical, chunk_policy).context("..."), + // ... existing ... + "c" => CodeCAstV1Chunker.chunk(&canonical, chunk_policy) + .context("kb-chunk::CodeCAstV1Chunker::chunk (code:c)"), + "cpp" => CodeCppAstV1Chunker.chunk(&canonical, chunk_policy) + .context("kb-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"), + // ... existing ... + other => anyhow::bail!("unreachable (chunk): {other}"), + } +}; +``` + +- [ ] **Step 7**: `tier3_fallback_cv` (p10-3 Critical fix) — C/C++ are fallback-eligible (extract may fail on `.h` C++ headers or malformed code): + +```rust +let tier3_fallback_cv = match code_lang { + "rust" | "python" | "typescript" | "javascript" + | "go" | "java" | "kotlin" + | "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" + | "c" | "cpp" // p10-1d: + => Some(CodeTextParagraphV1Chunker.chunker_version()), + _ => None, +}; +``` + +(The exact location of this match is in `ingest_one_code_asset` between ~lines 1921-1927 per the p10-3 critical fix.) + +- [ ] **Step 8**: Build: + +```bash +cargo build -p kebab-app 2>&1 | tail -5 +``` + +Expected: clean. + +- [ ] **Step 9**: Per-crate test (no regression): + +```bash +cargo test -p kebab-app --lib -- --nocapture 2>&1 | tail -10 +``` + +Expected: 52 PASS (existing baseline). + +- [ ] **Step 10**: Clippy + commit: + +```bash +cargo clippy -p kebab-app --all-targets -- -D warnings +git add crates/kebab-app/src/lib.rs +git commit -m "$(cat <<'EOF' +feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch + +Extends 4-arm match (parser_version / chunker_version / extract / chunks) ++ allowlist + tier3_fallback_cv list with "c" + "cpp" arms. C uses +CAstExtractor + CodeCAstV1Chunker; C++ uses CppAstExtractor + +CodeCppAstV1Chunker. Both langs are Tier 3-fallback-eligible (e.g. .h +file with C++ syntax may fail tree-sitter-c parse → Tier 3 paragraph +fallback). + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task G: code_ingest_smoke integration tests (C + C++) + +**Files:** +- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs` + +- [ ] **Step 1**: Append 2 tests at the end of the file (mirror the existing tier1 tests `c_ast_v1_*` if present; if not, mirror `rust_ast_v1_*` or `go_ast_v1_*`): + +```rust +#[test] +fn tier1_c_ingest_searchable() { + let env = TestEnv::lexical_only(); + let workspace = env.workspace_root(); + std::fs::write( + workspace.join("parser.c"), + "#include \n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n", + ) + .unwrap(); + + let report = env.ingest().expect("ingest"); + assert!(report.new_docs >= 1, "expected at least 1 new doc"); + + let hits = env.search_code_lang("c", "parse_record").expect("search"); + assert!(!hits.is_empty(), "expected at least 1 c hit"); + + match &hits[0].citation { + Citation::Code { symbol, lang, .. } => { + assert_eq!(symbol.as_deref(), Some("parse_record"), "C symbol must be function name only"); + assert_eq!(lang.as_deref(), Some("c")); + } + other => panic!("expected Citation::Code, got {other:?}"), + } + assert_eq!( + hits[0].chunker_version.as_ref().map(|c| c.0.as_str()), + Some("code-c-ast-v1"), + ); +} + +#[test] +fn tier1_cpp_ingest_searchable() { + let env = TestEnv::lexical_only(); + let workspace = env.workspace_root(); + std::fs::write( + workspace.join("chunker.cpp"), + "namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n", + ) + .unwrap(); + + let report = env.ingest().expect("ingest"); + assert!(report.new_docs >= 1); + + let hits = env.search_code_lang("cpp", "bar").expect("search"); + assert!(!hits.is_empty(), "expected at least 1 cpp hit"); + + match &hits[0].citation { + Citation::Code { symbol, lang, .. } => { + // Symbol could be "kebab::chunk::Foo::bar" or "kebab::chunk::Foo" depending on which chunk hits first. + assert!( + symbol.as_deref().map_or(false, |s| s.starts_with("kebab::chunk::Foo")), + "C++ symbol must start with namespace::Class prefix, got {:?}", symbol + ); + assert_eq!(lang.as_deref(), Some("cpp")); + } + other => panic!("expected Citation::Code, got {other:?}"), + } + assert_eq!( + hits[0].chunker_version.as_ref().map(|c| c.0.as_str()), + Some("code-cpp-ast-v1"), + ); +} +``` + +- [ ] **Step 2**: Run tests: + +```bash +cargo test -p kebab-app --test code_ingest_smoke tier1_c_ingest tier1_cpp_ingest -- --nocapture 2>&1 | tail -30 +``` + +Expected: 2 PASS. + +- [ ] **Step 3**: Full smoke regression: + +```bash +cargo test -p kebab-app --test code_ingest_smoke -- --nocapture 2>&1 | tail -30 +``` + +Expected: 18 PASS (16 existing + 2 new). + +- [ ] **Step 4**: Clippy + commit: + +```bash +cargo clippy -p kebab-app --tests -- -D warnings +git add crates/kebab-app/tests/code_ingest_smoke.rs +git commit -m "$(cat <<'EOF' +test(p10-1d): integration smoke tests for C + C++ + +Verifies end-to-end ingest + search + Citation::Code shape: +- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol + = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1". +- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search → + symbol starts with namespace::Class prefix, lang = "cpp", + chunker_version = "code-cpp-ast-v1". + +Brings code_ingest_smoke to 18 tests (Rust 3 + Python 1 + TS 1 + JS 1 + +Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1 + shell 1 + +yaml-fallback 1 + 2 reingest-unchanged regression + c 1 + cpp 1). + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task H: frozen design §10 activation log + +**Files:** +- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` + +- [ ] **Step 1**: Find §10 activation log. Add p10-1D entry right after the p10-3 entry: + +``` +**p10-1D 활성화 (C + C++) (2026-05-21)**: Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up. +``` + +- [ ] **Step 2**: Commit: + +```bash +git add docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md \ + docs/superpowers/specs/2026-04-27-kebab-final-form-design.md 2>/dev/null +git add docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +git commit -m "$(cat <<'EOF' +docs(p10-1d): activate C + C++ in frozen design §10 + +P10 Tier 1 chunker family complete. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task I: README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX + +**Files:** +- Modify: `README.md` (Mermaid + ingest row), `HANDOFF.md`, `docs/ARCHITECTURE.md`, `docs/SMOKE.md`, `tasks/INDEX.md`, `tasks/p10/INDEX.md` + +- [ ] **Step 1 — README.md**: Update the `kebab ingest` row's supported-langs list to include `.c` / `.h` → `code-c-ast-v1` and `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` → `code-cpp-ast-v1`. Extend `--code-lang c` / `--code-lang cpp` in the enumeration. Update the Mermaid `chunker[...]` node to include `code-c-ast-v1, code-cpp-ast-v1` in the brace. + +- [ ] **Step 2 — HANDOFF.md**: P10 row append `, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)**`. Update 한 줄 요약 to include C/C++. Update 다음 후보 (drop p10-1D; remaining: P9-5 desktop / P8 audio). + +- [ ] **Step 3 — docs/ARCHITECTURE.md**: code parser table row: append C + C++ row mention. Flowchart `pcode` node: append `+ P10-1D`. Directory tree chunkers list: add `code_c_ast_v1.rs` + `code_cpp_ast_v1.rs`. + +- [ ] **Step 4 — docs/SMOKE.md**: Add a "## P10-1D C + C++ AST chunker" section after the P10-3 section. Walkthrough with sample.c + sample.cpp ingest + `--code-lang c` / `--code-lang cpp` search assertions. Append verification checklist entry. + +- [ ] **Step 5 — tasks/INDEX.md + tasks/p10/INDEX.md**: Flip p10-1D row ⏳ → ✅ (v0.16.0). + +- [ ] **Step 6**: Commit: + +```bash +git add README.md HANDOFF.md docs/ARCHITECTURE.md docs/SMOKE.md tasks/INDEX.md tasks/p10/INDEX.md +git commit -m "$(cat <<'EOF' +docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync + +P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java + +Kotlin + C + C++). Tier 2 (k8s + dockerfile + manifest) and Tier 3 +(paragraph fallback) already active. p10-1D 활성화 + ✅ flip. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +--- + +## Task J: workspace test gate + clippy + +- [ ] **Step 1**: Disk check (`df -h /`) + optional `cargo clean`. + +- [ ] **Step 2**: `cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -80`. Expected: all PASS. + +- [ ] **Step 3**: `cargo clippy --workspace --all-targets -- -D warnings 2>&1 | tail -30`. Expected: clean. + +--- + +## Task K: version bump + gitea PR + release + +**Files:** +- Modify: `Cargo.toml` + +- [ ] **Step 1**: Workspace `version = "0.15.0"` → `"0.16.0"`. + +- [ ] **Step 2**: `cargo build -p kebab-cli` to refresh Cargo.lock. + +- [ ] **Step 3**: Commit: + +```bash +git add Cargo.toml Cargo.lock +git commit -m "$(cat <<'EOF' +chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers) + +Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1 ++ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace +deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema +major bump. + +Co-Authored-By: Claude Opus 4.7 (1M context) +EOF +)" +``` + +- [ ] **Step 4**: Push branch + open gitea PR via REST API. Title: `feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete`. + +- [ ] **Step 5**: Wait for code-reviewer APPROVE → merge via gitea REST API → cut `gitea-release v0.16.0`. + +--- + +## Verification matrix + +| 검증 | 명령 | 기대 | +|------|------|------| +| C symbol | `kebab search --code-lang c --json` | `Citation::Code.symbol = ""` | +| C++ symbol | `kebab search --code-lang cpp --json` | `Citation::Code.symbol = "namespace::Class::method"` | +| .h fallback | `.h` with C++ syntax → ingest | Tier 3 fallback: `chunker_version = "code-text-paragraph-v1"`, lang = c | +| code_lang_breakdown | `kebab schema --json` | `"c": N`, `"cpp": M` | + +--- + +## Risks reminder (구현 중 주의) + +- **tree-sitter grammar version resolution**: tree-sitter 0.26 호환 grammar. crates.io 최신 버전 default. +- **tree-sitter-cpp 의 node-kind 명**: spec 의 가정 (`namespace_definition`, `class_specifier`, `function_definition`, `template_declaration`, `concept_definition`, etc.) 이 실제 grammar 와 일치하는지 fixture parse 로 검증. +- **out-of-class method def 의 prefix 복원**: `void Foo::bar()` 의 declarator 가 `function_declarator > qualified_identifier > namespace_identifier "Foo" + identifier "bar"`. spec 의 `extract_fn_symbol` 이 이 chain 정확히 walk. +- **Operator overload**: tree-sitter-cpp 의 `operator_name` 또는 `field_identifier` "operator+" 형태. fixture 로 검증. +- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그.