PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.
Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).
This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
to try_skip_unchanged + the stored_is_tier3_fallback detection branch
(skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
that can fall back: rust / python / ts / js / go / java / kotlin /
yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
(p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
regression tests (the originally-promised PR #155 regression coverage
that also never made it to main).
Smoke tests: 14 + 2 = 16 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.
Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)
All chunks stamped with lang=c and chunker_version=code-c-ast-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).
operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.
All 15 unit tests pass; build and clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.
C symbol = function name only — no namespace, no class nesting (design §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to ✅ (v0.15.0) and updates the 한 줄 요약 + next
candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to ✅.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.
Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).
shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.
Tier 1/2 fallback wrapper lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.
4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).
Reviewer nit #1 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.
Reviewer nit #3 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.
Reviewer nit #2 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-visible surface sync per the docs-split rule:
- README adds Tier 2 langs (yaml / dockerfile / toml / json / xml / groovy / go-mod) to the ingest支援 list and --code-lang options.
- HANDOFF flips p10-2 phase row to ✅ (v0.14.0) and updates the next-task candidates.
- ARCHITECTURE extends crates/kebab-chunk/src/ tree with k8s_manifest_resource_v1.rs / dockerfile_file_v1.rs / manifest_file_v1.rs / tier2_shared.rs, plus a Tier 2 note on the code-parser row and flowchart node.
- SMOKE adds a Tier 2 smoke walkthrough (k8s yaml + Dockerfile + Cargo.toml ingest + --code-lang search) and a P10-2 entry in the verification checklist.
- tasks/INDEX + tasks/p10/INDEX flip p10-2 to ✅ (v0.14.0).
Workspace test gate (-j 1) + clippy --workspace pass cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.
tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen contract for the p10-2 single PR: 3 chunker activation, k8s
identification via apiVersion+kind, Dockerfile/manifest basename matching,
code_lang_for_path source-of-truth consolidation, frozen design §3.5 +
§10.1 deltas, and version bump 0.13.0 → 0.14.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>