feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete #156

Merged
altair823 merged 13 commits from feat/p10-1d-c-cpp into main 2026-05-21 15:48:37 +00:00
Owner

Summary

  • code-c-ast-v1.c / .h AST chunker via tree-sitter-c 0.24.2. Top-level units: function_definition (symbol = fn name), struct_specifier / enum_specifier / union_specifier (named). C symbol = function name only (design §3.4). 매크로 + typedef + global 변수 등은 <top-level> glue.
  • code-cpp-ast-v1.cpp / .cc / .cxx / .hpp / .hh / .hxx AST chunker via tree-sitter-cpp 0.23.4. namespace + class recursion (Java/Kotlin + Python hybrid). Symbol = namespace::Class::method. Anonymous namespace → <anonymous>. Constructor/destructor/operator overload/template fn 모두 정확히 처리 (qualified_identifier 의 out-of-class definition 도 prefix 복원). template params 는 symbol 미포함.
  • .h → C 매핑 (design §3.5): C++ syntax (namespace / template / class) 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback (code-text-paragraph-v1) 으로 자동 picked up.
  • tier3_fallback_cv 확장 + missing p10-3 fix landing: PR #155 (p10-3) 가 reviewer-flagged try_skip_unchanged 7-param fallback-aware fix (commit 2a39513 per implementer report) 를 머지하지 못함 (commit 이 main 에 존재하지 않음). 이 PR 의 commit 1034de2 가 그 fix 를 같이 land — try_skip_unchanged 7th param fallback_chunker_version: Option<&ChunkerVersion> + stored_is_tier3_fallback 감지 branch + 모든 call site 갱신 + tier3_fallback_cv match 에 c/cpp 추가 + tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged 2 regression test 추가.
  • P10 Tier 1 chunker family complete: Rust + Python + TS + JS + Go + Java + Kotlin + C + C++ (9 langs). Tier 2 (k8s + dockerfile + manifest) 와 Tier 3 (paragraph fallback) 도 이미 활성. 다음 phase 후보: P9-5 desktop / P8 audio (보류).
  • wire schema: additive — Citation::Code.langc / cpp 값 + chunker_versioncode-c-ast-v1 / code-cpp-ast-v1 값. schema 본문 변경 0.
  • frozen design: 2026-04-27 §10 activation log p10-1D entry. §3.4 / §3.5 의 C/C++ 행이 이미 본 결정과 일치 — 추가 변경 없음.
  • version bump: 0.15.0 → 0.16.0 (additive minor).

Test plan

  • cargo test --workspace --no-fail-fast -j 1 PASS (memory-conscious -j 1)
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • kebab-chunk: code-c-ast-v1 + code-cpp-ast-v1 snapshot tests + 39 lib tests PASS
  • kebab-parse-code: 15 unit tests in cpp.rs covering namespace/class/ctor/dtor/operator/template/qualified_identifier edge cases
  • kebab-app code_ingest_smoke: 18 tests (16 + tier1_c_ingest_searchable + tier1_cpp_ingest_searchable)
  • p10-3 fix regression: tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged confirm Unchanged on second ingest (was broken in main per the missing 2a39513 fix)
  • post-merge dogfood: multi-root KB 에 .c / .cpp 추가 + --code-lang c / --code-lang cpp 검색 결과 + Tier 3 fallback 동작 (.h with namespace) 확인
  • post-merge gitea-release v0.16.0

Branch

feat/p10-1d-c-cpp (head: 86aa180). 12 commits A-K + 1 follow-up fix. Spec: tasks/p10/p10-1d-c-cpp-ast-chunker.md. Plan: docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md.

Critical: missing p10-3 fix landed in this PR

Background: PR #155 (p10-3 merged 2026-05-21 as 7a90df1) was reviewed CHANGES_REQUESTED for a try_skip_unchanged bug — fallback path always re-ingested. The implementer reported a fix commit (2a39513) but it never actually made it to main (likely local-only commit that got dropped during the merge dance). The bug shipped in v0.15.0 release.

This PR's commit 1034de2 lands the original Option B1 fix verbatim. v0.16.0 ships with both p10-1D AND the corrected p10-3 behavior. Regression tests confirm reingest = Unchanged on the fallback path.

🤖 Generated with Claude Code

## Summary - **`code-c-ast-v1`** — `.c` / `.h` AST chunker via tree-sitter-c 0.24.2. Top-level units: function_definition (symbol = fn name), struct_specifier / enum_specifier / union_specifier (named). C symbol = function name only (design §3.4). 매크로 + typedef + global 변수 등은 `<top-level>` glue. - **`code-cpp-ast-v1`** — `.cpp` / `.cc` / `.cxx` / `.hpp` / `.hh` / `.hxx` AST chunker via tree-sitter-cpp 0.23.4. namespace + class recursion (Java/Kotlin + Python hybrid). Symbol = `namespace::Class::method`. Anonymous namespace → `<anonymous>`. Constructor/destructor/operator overload/template fn 모두 정확히 처리 (qualified_identifier 의 out-of-class definition 도 prefix 복원). template params 는 symbol 미포함. - **`.h` → C 매핑 (design §3.5)**: C++ syntax (namespace / template / class) 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback (`code-text-paragraph-v1`) 으로 자동 picked up. - **`tier3_fallback_cv` 확장** + **missing p10-3 fix landing**: PR #155 (p10-3) 가 reviewer-flagged `try_skip_unchanged` 7-param fallback-aware fix (commit `2a39513` per implementer report) 를 머지하지 못함 (commit 이 main 에 존재하지 않음). 이 PR 의 commit `1034de2` 가 그 fix 를 같이 land — `try_skip_unchanged` 7th param `fallback_chunker_version: Option<&ChunkerVersion>` + `stored_is_tier3_fallback` 감지 branch + 모든 call site 갱신 + tier3_fallback_cv match 에 c/cpp 추가 + `tier3_yaml_fallback_reingest_is_unchanged` + `tier3_shell_reingest_is_unchanged` 2 regression test 추가. - **P10 Tier 1 chunker family complete**: Rust + Python + TS + JS + Go + Java + Kotlin + C + C++ (9 langs). Tier 2 (k8s + dockerfile + manifest) 와 Tier 3 (paragraph fallback) 도 이미 활성. 다음 phase 후보: P9-5 desktop / P8 audio (보류). - **wire schema**: additive — `Citation::Code.lang` 의 `c` / `cpp` 값 + `chunker_version` 의 `code-c-ast-v1` / `code-cpp-ast-v1` 값. schema 본문 변경 0. - **frozen design**: 2026-04-27 §10 activation log p10-1D entry. §3.4 / §3.5 의 C/C++ 행이 이미 본 결정과 일치 — 추가 변경 없음. - **version bump**: 0.15.0 → 0.16.0 (additive minor). ## Test plan - [x] `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`) - [x] `cargo clippy --workspace --all-targets -- -D warnings` clean - [x] kebab-chunk: code-c-ast-v1 + code-cpp-ast-v1 snapshot tests + 39 lib tests PASS - [x] kebab-parse-code: 15 unit tests in cpp.rs covering namespace/class/ctor/dtor/operator/template/qualified_identifier edge cases - [x] kebab-app `code_ingest_smoke`: 18 tests (16 + `tier1_c_ingest_searchable` + `tier1_cpp_ingest_searchable`) - [x] p10-3 fix regression: `tier3_yaml_fallback_reingest_is_unchanged` + `tier3_shell_reingest_is_unchanged` confirm Unchanged on second ingest (was broken in main per the missing 2a39513 fix) - [ ] post-merge dogfood: multi-root KB 에 .c / .cpp 추가 + `--code-lang c` / `--code-lang cpp` 검색 결과 + Tier 3 fallback 동작 (`.h` with namespace) 확인 - [ ] post-merge gitea-release v0.16.0 ## Branch `feat/p10-1d-c-cpp` (head: `86aa180`). 12 commits A-K + 1 follow-up fix. Spec: `tasks/p10/p10-1d-c-cpp-ast-chunker.md`. Plan: `docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md`. ## Critical: missing p10-3 fix landed in this PR **Background**: PR #155 (p10-3 merged 2026-05-21 as 7a90df1) was reviewed CHANGES_REQUESTED for a `try_skip_unchanged` bug — fallback path always re-ingested. The implementer reported a fix commit (2a39513) but it never actually made it to main (likely local-only commit that got dropped during the merge dance). The bug shipped in v0.15.0 release. **This PR's commit `1034de2`** lands the original Option B1 fix verbatim. v0.16.0 ships with both p10-1D AND the corrected p10-3 behavior. Regression tests confirm reingest = Unchanged on the fallback path. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
altair823 added 13 commits 2026-05-21 15:38:47 +00:00
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tasks: workspace deps / C extractor / C++ extractor / C chunker + snapshot /
C++ chunker + snapshot / ingest dispatch + tier3_fallback_cv extension /
2 smoke tests / frozen design §10 / docs sync / workspace test gate /
version bump 0.15.0 → 0.16.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.

C symbol = function name only — no namespace, no class nesting (design §3.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).

operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.

All 15 unit tests pass; build and clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.

Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)

All chunks stamped with lang=c and chunker_version=code-c-ast-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identical chunker body to code-c-ast-v1 (per-language work happens in the
CppAstExtractor, Task C). Snapshot fixture covers nested namespace + class
+ ctor/dtor + method + operator overload + template fn + free fn + top-level
main, verifying namespace::Class::method symbol convention per design §3.4.

5 chunks emitted:
- <top-level> (includes, namespace opening)
- kebab::chunk::MdHeadingV1Chunker (class unit)
- kebab::identity (template function)
- kebab::global_helper (free function in namespace)
- main (top-level main function)

Template function symbols emit without <T> parameters per spec convention.
Namespace::Class::method pattern verified. All tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).

This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
   to try_skip_unchanged + the stored_is_tier3_fallback detection branch
   (skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
   that can fall back: rust / python / ts / js / go / java / kotlin /
   yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
   (p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
   regression tests (the originally-promised PR #155 regression coverage
   that also never made it to main).

Smoke tests: 14 + 2 = 16 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
  = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
  symbol starts with namespace::Class prefix, lang = "cpp",
  chunker_version = "code-cpp-ast-v1".

Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3,
Tier 3: 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

- README adds C/C++ to the ingest row + --code-lang c/cpp + Mermaid brace.
- HANDOFF flips p10-1D to  (v0.16.0), updates 한 줄 요약 + 다음 후보.
- ARCHITECTURE adds C/C++ to the code-parser row, extends flowchart pcode
  node, adds chunker tree entries.
- SMOKE adds P10-1D walkthrough section + verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-1D to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.

Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
altair823 merged commit 75a4207aa1 into main 2026-05-21 15:48:37 +00:00
altair823 deleted branch feat/p10-1d-c-cpp 2026-05-21 15:48:38 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#156