실제로 doc count 임을 명시 (PR #161 워커 리뷰 MEDIUM 반영)
JSON schema description 은 PR-C 본체에서 'code chunk count' →
'doc count' 로 정정했으나 Rust struct field 의 rustdoc 은 같은
오기재를 그대로 carry — Gemini round 2 가 JSON schema 만 봤고
rustdoc 은 miss. 워커 둘 다 동일 finding (MEDIUM).
implementation 변경 없음 — 의미가 doc count 였던 사실이 처음부터
일관. wording 만 맞춤.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous
struct/enum/union 이 typedef alias 이름으로 symbol unit 방출.
- crates/kebab-parse-code/src/c.rs: type_definition 분기 추가.
inner anonymous struct_specifier / enum_specifier / union_specifier
탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic
unit (typedef alias). named inner aggregate / plain alias 는
기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2.
recover_typedef_alias + extract_typedef_alias_name helper 추가.
- crates/kebab-store-sqlite/src/store.rs: 두 helper 신규
(parser_version bump cascade 용 doc-id 기반 orphan purge).
- stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path,
keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반.
- purge_document_at_workspace_path_except_doc_id(workspace_path,
keep_doc_id) — CASCADE document/chunks 제거, assets 보존.
keep_doc_id="" 가 "모든 doc 제거" 사용.
- crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch
분기에서 purge_workspace_path_for_parser_bump 호출. helper 가
app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document
row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시
idx_docs_workspace_path UNIQUE 충돌 회피.
- tests:
- c.rs unit tests 4 신규 — typedef_struct_emits_unit /
typedef_enum_emits_unit / typedef_union_emits_unit /
typedef_to_existing_type_stays_glue (negative).
- tier1_c_ingest_searchable: parser_version assertion code-c-v1 →
code-c-v2.
- 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path
+ purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와
공존, 기존 test 모두 PASS.
미해결 (Risks): nested typedef (typedef struct { struct {...} inner; }
Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는
top-level typedef alias 만.
cargo test --workspace --no-fail-fast -j 1 + clippy 통과.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #159 worker-1 review 의 LOW 가독성 nit 반영 — CLI stderr [hint]
line + --json hint shape 통합 test 가 없었음.
- search_plain_emits_short_query_hint_to_stderr — 빈 KB + 2자 query
→ stderr 가 "[hint]" + "3자 이상" 포함 확인.
- search_json_emits_hint_field_for_short_query — 동일 입력 --json
→ search_response.v1.hint 필드 set + 표준 advisory 문자열 정합.
- search_json_omits_hint_field_when_query_is_long_enough — 3자
query → hint 필드 absent (additive serializer 의 None 제외 동작).
wire_search_response 5 → 8 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.
- wire_search_response::search_json_truncates_with_max_tokens +
search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
0-hit. apples/banana/cherry/durian/elder 5-char unique words
로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
regenerate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3개 신규 unit tests in tests/fts.rs §7:
1. fts_trigram_korean_3char_substring_hits — Codex sqlite 3.45.1 검증
동작 5개 assert pin: raw 3자 substring hit (충돌은/발생한),
quoted phrase hit (\"해시 충돌\"/\"시 충\"), raw 해시충 0-hit (원문
미존재).
2. fts_trigram_korean_short_query_zero_hit_pinned — 2자 한국어 query
(충돌·키) 0-hit 회귀 감지. trigram 구조 변경 시 먼저 fail.
3. fts_trigram_english_substring_hits — substring recall 동작 변경
pin (token→tokenizer, to 0-hit).
검증: cargo test -p kebab-store-sqlite --test fts → 13/13 PASS
(신규 3 + 기존 10).
Step 1c (multi-token 한국어 query e.g. \"해시 충돌\") 와 Step 5
(lexical BM25 snapshot 갱신) 는 Task A5 의 build_match_string()
재설계 후 진행.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
UNIQUE constraint failed: chunks.chunk_id
Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.
Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).
Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).
Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.
Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).
This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
to try_skip_unchanged + the stored_is_tier3_fallback detection branch
(skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
that can fall back: rust / python / ts / js / go / java / kotlin /
yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
(p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
regression tests (the originally-promised PR #155 regression coverage
that also never made it to main).
Smoke tests: 14 + 2 = 16 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.
Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)
All chunks stamped with lang=c and chunker_version=code-c-ast-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).
operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.
All 15 unit tests pass; build and clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.
C symbol = function name only — no namespace, no class nesting (design §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.
Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).
shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.
Tier 1/2 fallback wrapper lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.
4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).
Reviewer nit #1 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.
Reviewer nit #3 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.
tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses tree-sitter-kotlin-ng (bare tree-sitter-kotlin is stuck on tree-sitter
0.21-0.23, incompatible with our 0.26). Mirrors JavaAstExtractor (JVM family,
source-side package extraction + class-nesting) with Kotlin grammar quirks:
- Root is `source_file`, not `program`.
- `package_header` child is `qualified_identifier` (its slice text is the
dotted path); the bare `identifier` shape is also accepted as a fallback.
- `class_declaration` is the single node kind for `class` / `data class` /
`sealed class` / `interface` / `enum class` — distinguished only by its
`modifiers` child. Body is `class_body` for non-enum, `enum_class_body`
for enum class; neither carries a `body` field name, so the extractor
looks the body up by node kind rather than `child_by_field_name("body")`.
- `companion_object` is its own node kind (NOT object_declaration with a
modifier); its `name` field is optional, so the extractor fills in the
implicit Kotlin convention name `Companion`.
- `function_declaration` is allowed at top level (unlike Java), emitted as
`<pkg>.<fn_name>`; the same node kind nested in `class_body` becomes
`<pkg>.<...>.<Class>.<method>` via the same mod_path mechanism.
- `secondary_constructor` has no `name` field; symbol uses the enclosing
class name (Java duplication convention: `<pkg>.<...>.<Class>.<Class>`).
- Enum bodies (`enum_class_body`) are NOT recursed — `enum_entry` is not
emitted as a unit (matches Java 1차 scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bare tree-sitter-kotlin v0.3.8 requires tree-sitter >=0.21,<0.23 which
conflicts with the workspace's tree-sitter 0.26 (links = "tree-sitter"
is a singleton). tree-sitter-kotlin-ng v1.1.0 (from
tree-sitter-grammars/tree-sitter-kotlin) uses the tree-sitter-language
0.1 shim which is compatible with tree-sitter 0.26. Using
tree-sitter-kotlin-ng as the Kotlin grammar crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>