- tasks/HOTFIXES.md: 새 2026-05-24 PR-B closure entry — extractor 의
type_definition 분기, PARSER_VERSION bump, same-workspace_path
orphan purge, 사용자 영향, 잔여 nested typedef Risks.
- tasks/HOTFIXES.md: 기존 2026-05-21 typedef 항목의 Status / Next step
을 v0.17.0 closure 표현으로 갱신 (관찰 기록은 frozen 유지).
- tasks/p10/p10-1d-c-cpp-ast-chunker.md: Risks 의 typedef idiom 라인
을 closure ✅ + 잔여 nested typedef 안내로 갱신.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous
struct/enum/union 이 typedef alias 이름으로 symbol unit 방출.
- crates/kebab-parse-code/src/c.rs: type_definition 분기 추가.
inner anonymous struct_specifier / enum_specifier / union_specifier
탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic
unit (typedef alias). named inner aggregate / plain alias 는
기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2.
recover_typedef_alias + extract_typedef_alias_name helper 추가.
- crates/kebab-store-sqlite/src/store.rs: 두 helper 신규
(parser_version bump cascade 용 doc-id 기반 orphan purge).
- stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path,
keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반.
- purge_document_at_workspace_path_except_doc_id(workspace_path,
keep_doc_id) — CASCADE document/chunks 제거, assets 보존.
keep_doc_id="" 가 "모든 doc 제거" 사용.
- crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch
분기에서 purge_workspace_path_for_parser_bump 호출. helper 가
app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document
row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시
idx_docs_workspace_path UNIQUE 충돌 회피.
- tests:
- c.rs unit tests 4 신규 — typedef_struct_emits_unit /
typedef_enum_emits_unit / typedef_union_emits_unit /
typedef_to_existing_type_stays_glue (negative).
- tier1_c_ingest_searchable: parser_version assertion code-c-v1 →
code-c-v2.
- 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path
+ purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와
공존, 기존 test 모두 PASS.
미해결 (Risks): nested typedef (typedef struct { struct {...} inner; }
Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는
top-level typedef alias 만.
cargo test --workspace --no-fail-fast -j 1 + clippy 통과.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #159 worker-1 review 의 LOW 가독성 nit 반영 — CLI stderr [hint]
line + --json hint shape 통합 test 가 없었음.
- search_plain_emits_short_query_hint_to_stderr — 빈 KB + 2자 query
→ stderr 가 "[hint]" + "3자 이상" 포함 확인.
- search_json_emits_hint_field_for_short_query — 동일 입력 --json
→ search_response.v1.hint 필드 set + 표준 advisory 문자열 정합.
- search_json_omits_hint_field_when_query_is_long_enough — 3자
query → hint 필드 absent (additive serializer 의 None 제외 동작).
wire_search_response 5 → 8 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.
- wire_search_response::search_json_truncates_with_max_tokens +
search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
0-hit. apples/banana/cherry/durian/elder 5-char unique words
로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
regenerate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- HOTFIXES: 새 2026-05-24 절 — v0.17.0 closure 영향 (한국어
lexical 3-gram, 영어 substring 변경, BM25 분포, 디스크 용량,
heading_path JSON 노이즈 관찰). 기존 2026-05-22 한국어 lexical
항목의 Status / Next step 을 closure 표현으로 갱신.
- HANDOFF: 머지 후 발견 deviation 절에 2026-05-24 entry +
기존 2026-05-22 항목을 closure cross-link 로 정리. P10
백로그 한국어 tokenizer 항목 ✅ v0.17.0 + "다음 task 후보"
follow-up 라인의 상태 갱신.
- README: 검색 명령 행에 trigram 동작 + hint + 디스크 용량 한 줄.
- SMOKE: 새 "한국어 trigram 검색 (v0.17.0)" 절 — 도그푸딩 query
시퀀스 (충돌은 raw / 해시 충돌 multi-token / Rust 충돌은
mixed / 충돌 2자 + stderr / --json hint 검증) + 영어 substring
동작 변경 안내.
- SKILL.md: search 절에 hint 필드 안내 한 줄 — agent 가
short query 케이스에서 같은 query 재시도 대신 사용자에게
surface 하도록.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3개 신규 unit tests in tests/fts.rs §7:
1. fts_trigram_korean_3char_substring_hits — Codex sqlite 3.45.1 검증
동작 5개 assert pin: raw 3자 substring hit (충돌은/발생한),
quoted phrase hit (\"해시 충돌\"/\"시 충\"), raw 해시충 0-hit (원문
미존재).
2. fts_trigram_korean_short_query_zero_hit_pinned — 2자 한국어 query
(충돌·키) 0-hit 회귀 감지. trigram 구조 변경 시 먼저 fail.
3. fts_trigram_english_substring_hits — substring recall 동작 변경
pin (token→tokenizer, to 0-hit).
검증: cargo test -p kebab-store-sqlite --test fts → 13/13 PASS
(신규 3 + 기존 10).
Step 1c (multi-token 한국어 query e.g. \"해시 충돌\") 와 Step 5
(lexical BM25 snapshot 갱신) 는 Task A5 의 build_match_string()
재설계 후 진행.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
도그푸딩 실 한국어 위키 문서 (hash-table.md, 4512줄 mediawiki HTML,
CC-BY-SA) 는 크기·라이선스 부담으로 직접 commit 회피. 대신 도그푸딩
query 들 (해시 충돌·충돌은·시 충·해시충·충돌) 을 모두 cover 하는 합성
fixture 작성. trigram tokenizer 의 정확한 매칭 동작 (3자 substring
hit, 2자 0-hit, raw vs quoted phrase) 검증용.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #158 code-reviewer recommendation. Records the dogfood-discovered
k8s multi-resource chunk_id collision + the deliberate decision NOT to
bump chunker_version (dogfood-only stage, single-resource k8s chunk_id
shift is benign churn). Cross-link added to p10-2 spec Risks/notes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Patch bump — bug fix only (P10 dogfood-discovered k8s multi-resource
chunk_id collision). New binary needed to resume dogfooding. No wire
schema change, no DB migration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
UNIQUE constraint failed: chunks.chunk_id
Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.
Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).
Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).
Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.
Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.
Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).
This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
to try_skip_unchanged + the stored_is_tier3_fallback detection branch
(skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
that can fall back: rust / python / ts / js / go / java / kotlin /
yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
(p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
regression tests (the originally-promised PR #155 regression coverage
that also never made it to main).
Smoke tests: 14 + 2 = 16 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.
Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)
All chunks stamped with lang=c and chunker_version=code-c-ast-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).
operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.
All 15 unit tests pass; build and clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.
C symbol = function name only — no namespace, no class nesting (design §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to ✅ (v0.15.0) and updates the 한 줄 요약 + next
candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to ✅.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
Citation::Code { symbol: None, lang: "shell" }, chunker_version
"code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
(no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
chunker_version "code-text-paragraph-v1".
Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.
Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.
Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).
shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.
Tier 1/2 fallback wrapper lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.
4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).
Reviewer nit #1 (PR #153 code-reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>