Compare commits

..

63 Commits

Author SHA1 Message Date
0c69b9621b chore: bump version 0.17.0 → 0.17.1
v0.17.1 patch release — v0.17.0 post-dogfood follow-up 두 PR 머지 후.

- PR #162: [models.llm] request_timeout_secs config + 권장 모델 가이드
- PR #163: sudo 없이 ollama 설치 가이드 + kebab ask --stream UX 권장

둘 다 additive only (config field) + docs only — wire breaking 없음,
기존 사용자 영향 없음. patch bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:34:12 +00:00
0d69d85757 Merge pull request 'docs: sudo 없이 ollama 설치 + ask --stream 권장 (v0.17.0 post-dogfood)' (#163) from docs/ollama-install-and-stream into main
Reviewed-on: #163
2026-05-25 03:26:24 +00:00
a67300317b docs(ollama): sudo 없이 설치 가이드 + ask --stream 권장 (v0.17.0 post-dogfood)
확장 도그푸딩에서 사용된 두 패턴을 README + SMOKE 에 옮김.

(1) sudo / systemd 없이 격리 디렉토리에 ollama 설치 — tarball 받아
    /opt/ollama/{bin,models,logs} 같은 사용자 디렉토리에 풀고
    OLLAMA_MODELS env 로 모델 위치 분리. 컨테이너 / WSL2 / 회사
    머신 등 root 권한 제약 환경에 유용. 도그푸딩 머신에서
    /build/cache/ollama 로 같은 패턴 검증.

(2) cold start 가 긴 모델 (8B+ 또는 첫 호출) 은 `kebab ask --stream`
    권장 — 동일 inference 시간이라도 progressive 토큰이 5분 timeout
    한도 안에서 빠르게 surface 됨. p9-fb-33 의 streaming 경로를
    UX 개선 권고로 명시.

코드 변경 없음 — docs only. README + SMOKE 두 군데 동일 패턴
sub-bullet + bash snippet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:23:35 +00:00
abb05ebc23 Merge pull request 'feat: [models.llm] request_timeout_secs config + 권장 모델 가이드' (#162) from feat/llm-timeout-config into main
Reviewed-on: #162
2026-05-25 03:21:19 +00:00
26fdc4f344 docs(llm-timeout): 0-as-disable 함정 명시 + HOTFIXES typo + 용어 정리
PR #162 워커 리뷰 반영.

- MEDIUM (W2) + LOW (W1): request_timeout_secs = 0 이 reqwest 의
  의미상 disable 이 아닌 instant timeout (모든 요청 즉시 실패).
  LlmCfg field rustdoc + ollama.rs module-level comment + README
  세 군데에 명시 + u64::MAX / 86400 같은 large finite 값 권장.
- NIT (W1): HOTFIXES 2026-05-25 entry 의 '답변이 인 5분' typo →
  '답변이 5분' (1자 삭제).
- NIT (W1): README + HOTFIXES 의 '확장 도그푸딩' 내부 jargon →
  '후속 도그푸딩' 으로 통일.

코드 동작 변경 없음 — doc only. cargo test request_timeout 3 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:14:41 +00:00
3f5e0e6e90 feat(llm): [models.llm] request_timeout_secs config + 권장 모델 가이드
v0.17.0 확장 도그푸딩 (2026-05-25) 에서 발견된 두 가지를
한 PR 에 묶음.

(1) llm.generate_stream 의 hard-coded 300s timeout 을 config 노브로
    빼냄. 8B+ 모델 (gemma4:e4b 등) 은 CPU only 환경에서 5분
    안에 첫 RAG 답변 못 마치고 `error: kb-rag: llm.generate_stream`
    으로 떨어지던 문제.

    - kebab-config::LlmCfg 에 request_timeout_secs: u64 additive
      필드 (#[serde(default = "default_llm_request_timeout_secs")]
      default 300). 옛 config 가 키 누락해도 그대로 파싱 + 동일
      동작.
    - env override KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS.
    - kebab-llm-local::ollama.rs 의 REQUEST_TIMEOUT 상수 제거 →
      OllamaLanguageModel::new 가 Duration::from_secs(
      llm.request_timeout_secs) 로 reqwest client 빌드. doc
      comment 도 동일 갱신.
    - 신규 unit test 3 — default 300 핀 / env override / legacy
      config (필드 누락) backward-compat.

(2) docs — README 사전 요구 절 + docs/SMOKE.md ollama 안내에 한 단락:
    CPU only / RAM ≤ 16 GB 환경 ⇒ ≤ 4B Q4 모델 권장
    (gemma3:4b / qwen2.5:3b / phi3:mini). 8B+ 시도 시 timeout
    패턴 사전 안내. request_timeout_secs 노브 사용법.

    HOTFIXES 2026-05-25 entry — 위 두 변경 + 미진행 사항
    (kebab-parse-image OCR 의 같은 hard-coded 300s 는 scope 외
    follow-up 으로 등재 + ask --stream 권장 강조 후속) 기록.

workspace cargo test -j 1 + clippy 통과. 코드 변경은 backwards-compat
(additive serde field) 라 기존 사용자 영향 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:01:03 +00:00
578a60e3bb docs(v0.17.0): HANDOFF — version + PR-A/B/C closure entries (R1)
- 한 줄 요약 v0.16.1 → v0.17.0 + release notes URL + PR-A/B/C
  한 줄 요약.
- 머지 후 발견 deviation 절: PR-A 외 PR-B / PR-C 의 2026-05-24
  closure entry 추가.
- '다음 task 후보' 의 P10 round 2 follow-up 라인: 세 항목 모두
  v0.17.0 closure 표시.
- 'P10 dogfooding 백로그' 의 chunk_breakdown + C typedef 두
  항목도  v0.17.0 closure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:55:47 +00:00
64f518e08e docs(v0.17.0): HANDOFF + INDEX — v0.17.0 cut sync (R1)
- HANDOFF 한 줄 요약 v0.16.1 → v0.17.0, release notes URL,
  PR-A/B/C 셋 한 줄 요약. 머지 후 발견 deviation 절에 PR-B / PR-C
  closure entry 추가. "다음 task 후보" + "P10 백로그" 의 세 항목
   v0.17.0 closure 표시.
- INDEX 의 P10 섹션 하단에 신규 "P10 Dogfooding Feedback (v0.17.0)"
  subsection — PR-A/B/C 3 항목 listup (Gemini round 2 권장 형식).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:54:39 +00:00
fa9f91ead4 chore: bump version 0.16.1 → 0.17.0
v0.17.0 release cut — PR-A (한국어 trigram FTS tokenizer + lexical
builder + hint surface) + PR-B (C typedef alias unit + parser_version
cascade + orphan purge) + PR-C (code_lang_chunk_breakdown additive
wire field) 셋 머지 후.

Breaking changes:
- V007 migration (chunks_fts unicode61 → trigram) — chunks 원본 /
  embedding / vector 불변, FTS shadow 자동 backfill. 사용자는 다음
  open 시 V007 즉시 적용 (re-ingest 불필요). kebab.sqlite 파일 크기
  ~2-5배 또는 수백 MB 증가.
- 영어 lexical 검색이 substring 매칭으로 동작 변경 (token →
  tokenization/tokenizer 도 hit, recall ↑ / 단어 경계 ↓).
- C parser_version code-c-v1 → code-c-v2 (typedef alias 추출
  cascade). 같은 file 의 옛 doc/chunks/vector 는 same-workspace_path
  orphan purge 가 자동 정리.

Additive (backwards-compat):
- SearchResponse.hint additive field — 한국어 2자 query 등 trigram
  비호환 시 안내.
- schema.v1.stats.code_lang_chunk_breakdown additive field — chunk
  단위 언어별 분포.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:52:14 +00:00
9ee89c2a94 Merge pull request 'feat: v0.17.0 PR-C — code_lang_chunk_breakdown additive wire field' (#161) from feat/code-lang-chunk-breakdown into main
Reviewed-on: #161
2026-05-24 20:35:28 +00:00
13a3361ba2 docs(v0.17.0/PR-C): rustdoc — code_lang_breakdown / repo_breakdown 가
실제로 doc count 임을 명시 (PR #161 워커 리뷰 MEDIUM 반영)

JSON schema description 은 PR-C 본체에서 'code chunk count' →
'doc count' 로 정정했으나 Rust struct field 의 rustdoc 은 같은
오기재를 그대로 carry — Gemini round 2 가 JSON schema 만 봤고
rustdoc 은 miss. 워커 둘 다 동일 finding (MEDIUM).

implementation 변경 없음 — 의미가 doc count 였던 사실이 처음부터
일관. wording 만 맞춤.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:35:01 +00:00
0def913abd feat(v0.17.0/PR-C): code_lang_chunk_breakdown additive wire field
closure of HOTFIXES 2026-05-22 "code_lang_breakdown chunk granularity"
LOW. Chunk-level companion of the existing doc-count metric.

- crates/kebab-store-sqlite/src/store.rs: code_lang_chunk_breakdown()
  method. chunks INNER JOIN documents → COUNT(c.chunk_id) GROUP BY
  metadata_json.code_lang, NULL skipped. BTreeMap<String, u32>.
  + lib unit test code_lang_chunk_breakdown_counts_chunks_not_docs
  (1 rust doc + 3 chunks → rust=3 chunks vs rust=1 doc).
- crates/kebab-app/src/schema.rs: Stats.code_lang_chunk_breakdown
  additive field + collect_stats builder. tests_stats_ext 의
  stats_includes_code_lang_and_repo_breakdown_fields 가 신규 필드도
  검증.
- docs/wire-schema/v1/schema.schema.json: 신규 additive 필드
  명세 + 기존 code_lang_breakdown / repo_breakdown description
  정정 ("code chunk count" → "doc count", Gemini round 2 권고).
- tasks/HOTFIXES.md: 2026-05-24 PR-C closure entry.

wire additive, schema_version bump 불필요. v0.16.x 호출 호환.
cargo test --workspace --no-fail-fast -j 1 + clippy 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:35:01 +00:00
ff9d5f5f86 Merge pull request 'feat: v0.17.0 PR-B — C typedef-wrapped struct/enum/union → typedef alias unit' (#160) from feat/c-typedef-struct-unit into main
Reviewed-on: #160
2026-05-24 20:33:15 +00:00
70a5068c0d docs(v0.17.0/PR-B/B2): HOTFIXES 2026-05-24 closure + p10-1d Risks 갱신
- tasks/HOTFIXES.md: 새 2026-05-24 PR-B closure entry — extractor 의
  type_definition 분기, PARSER_VERSION bump, same-workspace_path
  orphan purge, 사용자 영향, 잔여 nested typedef Risks.
- tasks/HOTFIXES.md: 기존 2026-05-21 typedef 항목의 Status / Next step
  을 v0.17.0 closure 표현으로 갱신 (관찰 기록은 frozen 유지).
- tasks/p10/p10-1d-c-cpp-ast-chunker.md: Risks 의 typedef idiom 라인
  을 closure  + 잔여 nested typedef 안내로 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:32:36 +00:00
93ddece111 feat(v0.17.0/PR-B/B1): C typedef extractor + parser_version bump + orphan purge cascade
closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous
struct/enum/union 이 typedef alias 이름으로 symbol unit 방출.

- crates/kebab-parse-code/src/c.rs: type_definition 분기 추가.
  inner anonymous struct_specifier / enum_specifier / union_specifier
  탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic
  unit (typedef alias). named inner aggregate / plain alias 는
  기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2.
  recover_typedef_alias + extract_typedef_alias_name helper 추가.

- crates/kebab-store-sqlite/src/store.rs: 두 helper 신규
  (parser_version bump cascade 용 doc-id 기반 orphan purge).
  - stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반.
  - purge_document_at_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — CASCADE document/chunks 제거, assets 보존.
  keep_doc_id="" 가 "모든 doc 제거" 사용.

- crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch
  분기에서 purge_workspace_path_for_parser_bump 호출. helper 가
  app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document
  row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시
  idx_docs_workspace_path UNIQUE 충돌 회피.

- tests:
  - c.rs unit tests 4 신규 — typedef_struct_emits_unit /
    typedef_enum_emits_unit / typedef_union_emits_unit /
    typedef_to_existing_type_stays_glue (negative).
  - tier1_c_ingest_searchable: parser_version assertion code-c-v1 →
    code-c-v2.
- 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path
  + purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와
  공존, 기존 test 모두 PASS.

미해결 (Risks): nested typedef (typedef struct { struct {...} inner; }
Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는
top-level typedef alias 만.

cargo test --workspace --no-fail-fast -j 1 + clippy 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:30:57 +00:00
67559fb3ce Merge pull request 'feat: v0.17.0 한국어 trigram FTS tokenizer + lexical builder + hint surface' (#159) from feat/korean-trigram-tokenizer into main
Reviewed-on: #159
2026-05-24 20:29:00 +00:00
d79e432916 test(v0.17.0/A5): CLI hint surface e2e coverage (worker-1 nit)
PR #159 worker-1 review 의 LOW 가독성 nit 반영 — CLI stderr [hint]
line + --json hint shape 통합 test 가 없었음.

- search_plain_emits_short_query_hint_to_stderr — 빈 KB + 2자 query
  → stderr 가 "[hint]" + "3자 이상" 포함 확인.
- search_json_emits_hint_field_for_short_query — 동일 입력 --json
  → search_response.v1.hint 필드 set + 표준 advisory 문자열 정합.
- search_json_omits_hint_field_when_query_is_long_enough — 3자
  query → hint 필드 absent (additive serializer 의 None 제외 동작).

wire_search_response 5 → 8 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:45:11 +00:00
0ee18149e7 test(v0.17.0/A5 follow-up): trigram tokenizer downstream test fixes
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.

- wire_search_response::search_json_truncates_with_max_tokens +
  search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
  max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
  다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
  통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
  fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
  0-hit. apples/banana/cherry/durian/elder 5-char unique words
  로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
  token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
  regenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:21:34 +00:00
8a68289499 docs(v0.17.0/A6): HANDOFF + HOTFIXES + README + SMOKE + SKILL — 한국어 trigram closure
- HOTFIXES: 새 2026-05-24 절 — v0.17.0 closure 영향 (한국어
  lexical 3-gram, 영어 substring 변경, BM25 분포, 디스크 용량,
  heading_path JSON 노이즈 관찰). 기존 2026-05-22 한국어 lexical
  항목의 Status / Next step 을 closure 표현으로 갱신.
- HANDOFF: 머지 후 발견 deviation 절에 2026-05-24 entry +
  기존 2026-05-22 항목을 closure cross-link 로 정리. P10
  백로그 한국어 tokenizer 항목  v0.17.0 + "다음 task 후보"
  follow-up 라인의 상태 갱신.
- README: 검색 명령 행에 trigram 동작 + hint + 디스크 용량 한 줄.
- SMOKE: 새 "한국어 trigram 검색 (v0.17.0)" 절 — 도그푸딩 query
  시퀀스 (충돌은 raw / 해시 충돌 multi-token / Rust 충돌은
  mixed / 충돌 2자 + stderr / --json hint 검증) + 영어 substring
  동작 변경 안내.
- SKILL.md: search 절에 hint 필드 안내 한 줄 — agent 가
  short query 케이스에서 같은 query 재시도 대신 사용자에게
  surface 하도록.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:54:44 +00:00
6ac7fea7b9 feat(v0.17.0/A5): trigram-aware build_match_string + SearchResponse.hint
PR-A 본체. plan Task A4 Step 1c + A5.

- lexical.rs::build_match_string 재설계: whole-phrase + token-AND
  OR-combined, 3자 미만 토큰 drop, 후보 없음 시 None (빈 MATCH
  회피). raw single-quote mode 유지.
- SearchResponse.hint additive — empty result + trimmed < 3 chars
  + non-raw 케이스에 short_query_hint helper 가 set.
- CLI 'kebab search' 가 [hint] stderr 한 줄 (text mode).
- TUI SearchState.short_query_hint + poll_worker stale-aware set
  + fire_search/mark_input_changed reset + dynamic_status 표시.
- docs/wire-schema/v1/search_response.schema.json hint additive.
- 신규 unit tests (lexical 9 PASS, 기존 2 expectation 갱신) +
  통합 회귀 (search_korean: multi_token + mixed, 3 PASS) +
  BM25 snapshot regen (trigram token stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:54:25 +00:00
fe123c0c6d test(A4): korean + english trigram matching at FTS level
3개 신규 unit tests in tests/fts.rs §7:

1. fts_trigram_korean_3char_substring_hits — Codex sqlite 3.45.1 검증
   동작 5개 assert pin: raw 3자 substring hit (충돌은/발생한),
   quoted phrase hit (\"해시 충돌\"/\"시 충\"), raw 해시충 0-hit (원문
   미존재).
2. fts_trigram_korean_short_query_zero_hit_pinned — 2자 한국어 query
   (충돌·키) 0-hit 회귀 감지. trigram 구조 변경 시 먼저 fail.
3. fts_trigram_english_substring_hits — substring recall 동작 변경
   pin (token→tokenizer, to 0-hit).

검증: cargo test -p kebab-store-sqlite --test fts → 13/13 PASS
(신규 3 + 기존 10).

Step 1c (multi-token 한국어 query e.g. \"해시 충돌\") 와 Step 5
(lexical BM25 snapshot 갱신) 는 Task A5 의 build_match_string()
재설계 후 진행.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:57:37 +00:00
753b1ff5e5 task(A4-step0): synthetic korean fixture for trigram tests
도그푸딩 실 한국어 위키 문서 (hash-table.md, 4512줄 mediawiki HTML,
CC-BY-SA) 는 크기·라이선스 부담으로 직접 commit 회피. 대신 도그푸딩
query 들 (해시 충돌·충돌은·시 충·해시충·충돌) 을 모두 cover 하는 합성
fixture 작성. trigram tokenizer 의 정확한 매칭 동작 (3자 substring
hit, 2자 0-hit, raw vs quoted phrase) 검증용.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:54:30 +00:00
8dcedc4b11 feat(p10-r2): V007 trigram migration + design §5.5 + fts diff-check
Task A2 + A3 한 묶음.

migrations/V007__fts_trigram.sql 신규:
- chunks_fts shadow 를 DROP + 재생성 (tokenize = trigram).
- chunks_ai/ad/au trigger 재생성 (V002 와 동일).
- chunks 에서 backfill INSERT — 사용자 re-ingest 불필요, V007 자동.
- V002 는 historical cold-upgrade replay 위해 그대로 유지.

design §5.5 갱신:
- verbatim block 의 tokenize 만 trigram 으로 교체.
- §5.5 본문 상단에 한국어 채택 사유 + trade-off (영어 lexical 변경,
  BM25 분포, 디스크 ~2-10x, contentless 아님) prose 한 단락 추가.

crates/kebab-store-sqlite/tests/fts.rs:
- fts_v002_matches_design_section_5_5_verbatim →
  fts_v007_matches_design_section_5_5_verbatim 으로 rename.
- extract_migration_5_5_verbatim_block() 의 include_str! path 를
  V007__fts_trigram.sql 로 변경. 주석/assertion msg V007 로.
- V002 cold-upgrade test 들 (fts_v002_backfill_*) 은 그대로 유지.

검증: cargo test -p kebab-store-sqlite --test fts → 10/10 PASS
(`fts_v007_matches_design_section_5_5_verbatim` 포함).

Codex round 1/2 의 design §5.5 contentless 정정·trigram tokenizer
채택 사유 명시 발견 반영.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:52:40 +00:00
8781c6112b task(A1): builder baseline + sqlite version + snapshot locations
Task A1 step 1-3 완료. plan A5 의 baseline 노트 슬롯 채움.

핵심 발견:
- build_match_string() (lexical.rs:177-200): trim → strip_single_quotes
  raw FTS verbatim / 그 외 whitespace split + escape_fts5_token (\"...\"
  + inner doubling) + space join (implicit AND).
- raw mode = single quote '...' 가 trimmed 전체 감쌈 (lexical.rs:167).
- SQLite: rusqlite 0.32 + libsqlite3-sys 0.30.1 bundled (in-tree, SQLite
  ~3.46.x) → trigram 사용 가능.
- Snapshot: tests/lexical.rs::lexical_snapshot_run_1 + tests/hybrid.rs::
  hybrid_snapshot_run_1 (KEBAB_UPDATE_SNAPSHOTS=1 로 regenerate).
  inline normalize_bm25_top_score 는 numerical 무관.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:47:24 +00:00
14197b5e02 docs(p10-round-2): HANDOFF + HOTFIXES sync for v0.17.0 follow-up
P10 도그푸딩 round 2 의 follow-up 후보를 HANDOFF "다음 task" /
"P10 백로그" 절에 반영. HOTFIXES 의 round 2 항목 (한국어 lexical
한계 + code_lang_breakdown + ranking deferred) 정합.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:43:31 +00:00
584247f1ea spec+plan(v0.17.0): korean trigram tokenizer + dogfood fixes
P10 도그푸딩 round 2 (2026-05-22) follow-up. SQLite FTS5 tokenizer
unicode61 → trigram 으로 교체해 한국어 lexical 검색 지원 + 작은
버그픽스 2 (C typedef-wrapped struct 미노출, code_lang_breakdown
집계 단위).

Codex + Gemini round 1/2/3 리뷰 반영:
- [r1] 2자 한국어 query 0-hit, build_match_string() multi-token 깨짐,
  contentless → shadow, parser_version cascade, BM25/heading_path/디스크
- [r2] same-workspace_path orphan purge (parser bump cascade 실제 동작),
  trigram 테스트 예시 sqlite 3.45.1 검증, builder 권장안 (whole phrase OR)
- [r3] SMOKE 시나리오 정정, TUI stale hint 방지, search_response.v1 hint
  필드, new purge helpers, single quote raw mode 통일, fixture 도입

PR 구성: PR-A (trigram + builder + 안내), PR-B (C typedef + orphan
purge), PR-C (stats + wire). 셋 머지 후 v0.17.0 release cut.

design: docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md
plan:   docs/superpowers/plans/2026-05-22-korean-trigram-tokenizer.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:43:31 +00:00
a0c0dca321 fix(dogfood): k8s multi-resource YAML chunk_id collision (#158) 2026-05-21 23:57:49 +00:00
667495ae6a docs(dogfood): HOTFIXES entry for k8s multi-resource chunk_id collision
PR #158 code-reviewer recommendation. Records the dogfood-discovered
k8s multi-resource chunk_id collision + the deliberate decision NOT to
bump chunker_version (dogfood-only stage, single-resource k8s chunk_id
shift is benign churn). Cross-link added to p10-2 spec Risks/notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:57:34 +00:00
08d72a12e0 chore: bump version 0.16.0 → 0.16.1 (k8s multi-resource chunk_id fix)
Patch bump — bug fix only (P10 dogfood-discovered k8s multi-resource
chunk_id collision). New binary needed to resume dogfooding. No wire
schema change, no DB migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:54:33 +00:00
1969c8e3b5 fix(dogfood): k8s multi-resource YAML chunk_id collision
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
  UNIQUE constraint failed: chunks.chunk_id

Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.

Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).

Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 23:49:37 +00:00
c6207d196e chore(p10-1d-followup): reviewer nit cleanup — C extractor tests + HOTFIXES + cpp snapshot (#157) 2026-05-21 22:47:38 +00:00
840c6c40a6 test(p10-1d-followup): cpp snapshot exercises actual CppAstExtractor
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).

Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.

Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:43:57 +00:00
b81574afa9 docs(p10-1d-followup): HOTFIXES entry — typedef-wrapped struct/enum in C falls into glue
PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:40:04 +00:00
6beff35a2f test(p10-1d-followup): add in-file unit tests to C AST extractor
Mirrors the cpp.rs 15-test pattern. Covers function_definition (incl.
pointer-return, static/extern/inline), struct_specifier / enum_specifier /
union_specifier (named), anonymous struct/enum/union → glue, typedef-wrapped
struct → glue (per spec risks note), preprocessor directives → glue, empty
file → <module> post-pass, preprocessor-only → <module>, mixed fn + glue →
<top-level> present, determinism (20 runs). 17 tests total.

Reviewer nit #1 (PR #156 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:39:36 +00:00
75a4207aa1 feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete (#156) 2026-05-21 15:48:34 +00:00
86aa180ad7 chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.

Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:38:00 +00:00
802c573c07 docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

- README adds C/C++ to the ingest row + --code-lang c/cpp + Mermaid brace.
- HANDOFF flips p10-1D to  (v0.16.0), updates 한 줄 요약 + 다음 후보.
- ARCHITECTURE adds C/C++ to the code-parser row, extends flowchart pcode
  node, adds chunker tree entries.
- SMOKE adds P10-1D walkthrough section + verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-1D to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:35:59 +00:00
438870ee25 docs(p10-1d): activate C + C++ in frozen design §10
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:32:26 +00:00
192835e5bf test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
  = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
  symbol starts with namespace::Class prefix, lang = "cpp",
  chunker_version = "code-cpp-ast-v1".

Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3,
Tier 3: 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:31:35 +00:00
1034de25a2 fix(p10-3+p10-1d): land the missing try_skip_unchanged fallback-aware fix
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).

This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
   to try_skip_unchanged + the stored_is_tier3_fallback detection branch
   (skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
   that can fall back: rust / python / ts / js / go / java / kotlin /
   yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
   (p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
   regression tests (the originally-promised PR #155 regression coverage
   that also never made it to main).

Smoke tests: 14 + 2 = 16 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:19:17 +00:00
d1560be80d feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:56:45 +00:00
b2a2902e38 feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
Identical chunker body to code-c-ast-v1 (per-language work happens in the
CppAstExtractor, Task C). Snapshot fixture covers nested namespace + class
+ ctor/dtor + method + operator overload + template fn + free fn + top-level
main, verifying namespace::Class::method symbol convention per design §3.4.

5 chunks emitted:
- <top-level> (includes, namespace opening)
- kebab::chunk::MdHeadingV1Chunker (class unit)
- kebab::identity (template function)
- kebab::global_helper (free function in namespace)
- main (top-level main function)

Template function symbols emit without <T> parameters per spec convention.
Namespace::Class::method pattern verified. All tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:46:12 +00:00
03cd41c48f feat(p10-1d): code-c-ast-v1 chunker + snapshot test
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.

Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)

All chunks stamped with lang=c and chunker_version=code-c-ast-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:41:19 +00:00
926042049c feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).

operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.

All 15 unit tests pass; build and clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:37:58 +00:00
e0a29225da feat(p10-1d): C AST extractor (tree-sitter-c)
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.

C symbol = function name only — no namespace, no class nesting (design §3.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:29:36 +00:00
b541567946 build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:19:00 +00:00
a58d400abd docs(p10-1d): implementation plan (11 tasks A-K, subagent-driven)
Tasks: workspace deps / C extractor / C++ extractor / C chunker + snapshot /
C++ chunker + snapshot / ingest dispatch + tier3_fallback_cv extension /
2 smoke tests / frozen design §10 / docs sync / workspace test gate /
version bump 0.15.0 → 0.16.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:15:22 +00:00
8add684ffc docs(p10-1d): task spec for C + C++ AST chunkers
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:12:11 +00:00
7a90df1485 feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell direct + Tier 1/2 0-chunk/Err 자동 picked up (#155) 2026-05-21 12:27:18 +00:00
46f408dc0f chore: bump version 0.14.0 → 0.15.0 (p10-3 Tier 3 paragraph fallback)
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:05:53 +00:00
49e60fb314 docs(p10-3): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
  chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to  (v0.15.0) and updates the 한 줄 요약 + next
  candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
  pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
  verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:43:38 +00:00
6bc7a83d3c docs(p10-3): activate Tier 3 in frozen design §10.1
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:49 +00:00
df3c5b8caf test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:37:44 +00:00
5051ea7534 feat(p10-3): Tier 1/2 → Tier 3 fallback wrapper in ingest_one_code_asset
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.

Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).

shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:32:49 +00:00
88d7fbc182 feat(p10-3): activate shell direct routing through Tier 3 chunker
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.

Tier 1/2 fallback wrapper lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:28:41 +00:00
0b7d8af759 feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.

4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:22:48 +00:00
9342b9543f refactor(p10-3): expose tier2_shared::build_chunk as pub(crate)
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:17:51 +00:00
a8aa03042f docs(p10-3): implementation plan (9 tasks A-I, subagent-driven)
Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests /
shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen
design §10.1+§10 / docs sync (6 files) / workspace test gate / version
bump 0.14.0→0.15.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:16:55 +00:00
9d4a60aac5 docs(p10-3): task spec for Tier 3 paragraph + line-window fallback chunker
Frozen contract for p10-3 single PR: code-text-paragraph-v1 chunker
(blank-line paragraph split + 80-line/20-overlap line-window for oversize),
shell direct routing, Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3
retry with chunker_version + parser_version swap), tier2_shared::build_chunk
pub(crate) exposure, frozen design §10.1 + §10 deltas, version bump
0.14.0 → 0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:55:16 +00:00
8ce7a911ee chore(p10-2-followup): reviewer nit cleanup — Mermaid + 주석 + oversize test (#154) 2026-05-20 14:44:39 +00:00
75c1c7b911 test(p10-2-followup): cover tier2_shared oversize fallback with >200-line k8s ConfigMap
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).

Reviewer nit #1 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:41:16 +00:00
b5c12ecb6f docs(p10-2-followup): clarify synthesize_tier2_document path resolution comment
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.

Reviewer nit #3 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:39:02 +00:00
a1192ce3b2 docs(p10-2-followup): README Mermaid chunker_version list — Java/Kotlin + Tier 2
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.

Reviewer nit #2 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:35:20 +00:00
65 changed files with 9057 additions and 183 deletions

69
Cargo.lock generated
View File

@@ -4127,7 +4127,7 @@ dependencies = [
[[package]]
name = "kebab-app"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"base64 0.22.1",
@@ -4172,12 +4172,13 @@ dependencies = [
[[package]]
name = "kebab-chunk"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
"kebab-core",
"kebab-normalize",
"kebab-parse-code",
"kebab-parse-md",
"serde_json",
"serde_json_canonicalizer",
@@ -4188,7 +4189,7 @@ dependencies = [
[[package]]
name = "kebab-cli"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"clap",
@@ -4209,7 +4210,7 @@ dependencies = [
[[package]]
name = "kebab-config"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"dirs 5.0.1",
@@ -4224,7 +4225,7 @@ dependencies = [
[[package]]
name = "kebab-core"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4238,7 +4239,7 @@ dependencies = [
[[package]]
name = "kebab-embed"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4252,7 +4253,7 @@ dependencies = [
[[package]]
name = "kebab-embed-local"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"fastembed",
@@ -4265,7 +4266,7 @@ dependencies = [
[[package]]
name = "kebab-eval"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4284,7 +4285,7 @@ dependencies = [
[[package]]
name = "kebab-llm"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4293,7 +4294,7 @@ dependencies = [
[[package]]
name = "kebab-llm-local"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-config",
@@ -4310,7 +4311,7 @@ dependencies = [
[[package]]
name = "kebab-mcp"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-app",
@@ -4328,7 +4329,7 @@ dependencies = [
[[package]]
name = "kebab-normalize"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4343,7 +4344,7 @@ dependencies = [
[[package]]
name = "kebab-parse-code"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"gix",
@@ -4353,6 +4354,8 @@ dependencies = [
"time",
"tracing",
"tree-sitter",
"tree-sitter-c",
"tree-sitter-cpp",
"tree-sitter-go",
"tree-sitter-java",
"tree-sitter-javascript",
@@ -4364,7 +4367,7 @@ dependencies = [
[[package]]
name = "kebab-parse-image"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"ab_glyph",
"anyhow",
@@ -4388,7 +4391,7 @@ dependencies = [
[[package]]
name = "kebab-parse-md"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"kebab-core",
@@ -4405,7 +4408,7 @@ dependencies = [
[[package]]
name = "kebab-parse-pdf"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4418,7 +4421,7 @@ dependencies = [
[[package]]
name = "kebab-parse-types"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"kebab-core",
"serde",
@@ -4426,7 +4429,7 @@ dependencies = [
[[package]]
name = "kebab-rag"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4447,7 +4450,7 @@ dependencies = [
[[package]]
name = "kebab-search"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"globset",
@@ -4466,7 +4469,7 @@ dependencies = [
[[package]]
name = "kebab-source-fs"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4485,7 +4488,7 @@ dependencies = [
[[package]]
name = "kebab-store-sqlite"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"blake3",
@@ -4506,7 +4509,7 @@ dependencies = [
[[package]]
name = "kebab-store-vector"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"arrow",
@@ -4530,7 +4533,7 @@ dependencies = [
[[package]]
name = "kebab-tui"
version = "0.14.0"
version = "0.17.1"
dependencies = [
"anyhow",
"crossterm",
@@ -8531,6 +8534,26 @@ dependencies = [
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-c"
version = "0.24.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a9b2eb57a55fed6b00812912e730b7a275cf4fe98bfd6a5d76263d4438371728"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-cpp"
version = "0.23.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "df2196ea9d47b4ab4a31b9297eaa5a5d19a0b121dceb9f118f6790ad0ab94743"
dependencies = [
"cc",
"tree-sitter-language",
]
[[package]]
name = "tree-sitter-go"
version = "0.25.0"

View File

@@ -31,7 +31,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.14.0"
version = "0.17.1"
[workspace.dependencies]
anyhow = "1"
@@ -99,6 +99,9 @@ tree-sitter-go = "0.25.0"
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "0.23.5"
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "0.24.2"
tree-sitter-cpp = "0.23.4"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

View File

@@ -4,7 +4,7 @@
## 한 줄 요약
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. P10-2 (Tier 2 resource-aware) 완료 — 다음 후보 = P10-1D (C/C++) 또는 P10-3 (Tier 3 fallback) 또는 P9-5 (desktop tauri) 또는 보류 중인 P8 (audio).
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) + P10 전체 머지 완료 (현재 **v0.17.0**). `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin / C / C++) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패 케이스) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. **v0.17.0 cut (2026-05-24)**: 한국어 trigram FTS5 tokenizer (PR #159) + C typedef alias unit (PR #160) + `code_lang_chunk_breakdown` additive (PR #161) — 자세한 사용자 영향은 [release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.0). 구조적으로 남은 component 는 P9-5 (desktop tauri) 하나뿐, P8 (audio) 는 사용자 보류.
## Phase 로드맵
@@ -20,7 +20,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)** |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
@@ -32,6 +32,10 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
- **2026-05-24 v0.17.0 PR-C `code_lang_chunk_breakdown` additive (closure of 2026-05-22 LOW)** — `schema.v1.stats` 에 chunk 수 집계 신규 키. 기존 `code_lang_breakdown` (doc count) 와 sister. 또 기존 두 필드 JSON schema description 의 "chunk count" 오기재 → "doc count" 로 정정. wire additive — schema_version bump 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-C).
- **2026-05-24 v0.17.0 PR-B C typedef alias unit (closure of 2026-05-21)** — `kebab-parse-code::c::extract_blocks``type_definition` 분기로 inner anonymous struct/enum/union → declarator 의 typedef alias 이름으로 synthetic unit 방출. `PARSER_VERSION code-c-v1``code-c-v2` bump + 같은-asset/다른-doc_id 케이스용 `purge_workspace_path_for_parser_bump` cascade (`stale_chunk_ids_for_workspace_path_except_doc_id` + `purge_document_at_workspace_path_except_doc_id` helper 신규). 사용자 작업 불필요 (다음 ingest 가 자동 재처리). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-B).
- **2026-05-24 v0.17.0 PR-A 한국어 trigram tokenizer 채택 (closure of 2026-05-22 한국어 lexical)** — `chunks_fts` 가 FTS5 `unicode61``trigram` 으로 V007 migration (자동 backfill, re-ingest 불필요). `lexical.rs::build_match_string` trigram-aware 재설계 — multi-token 한국어 query (`해시 충돌`) 가 whole-phrase 후보로 hit, 한영 혼합 (`Rust 충돌은`) 도 OR-combined. 2자 이하 query 는 0-hit + CLI/TUI/wire `hint` 안내. 영어 lexical 도 substring 매칭으로 바뀜 (recall ↑ / 단어 경계 ↓). `kebab.sqlite` 크기 ~2-5배 증가 (trigram index). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24).
- **2026-05-22 P10 종합 도그푸딩 round 2 (한국어 lexical 검색 한계)** — `kebab search --mode lexical` 의 한국어 query 가 FTS5 `unicode61` 토크나이저에서 거의 0 hit (어절 단위 토큰화 → 부분 매칭 불가). 기본 hybrid 모드는 `multilingual-e5-small` vector 가 carry 해 한국어 검색 정상. **closure**: 위 2026-05-24 v0.17.0 entry.
- **2026-05-20 P10-1B (Rust 1A symbol path 비일관 + expression-level 함수 미방출)** — (a) Rust `code-rust-ast-v1` 은 file-scope nesting 만 (workspace path prefix 없음), 1B 의 Python/TypeScript/JavaScript 는 workspace 경로 → module path prefix 사용 (비일관 수용, retrofit = chunker_version bump + reindex 필요, 사용자 명시 요청까지 보류); (b) TS/JS 의 `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 처리됨 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20) 두 항목.
- **2026-05-19 P10-1A-2 (code_rust_ast_v1.rs + SourceType)** — `AST_CHUNK_MAX_LINES` 상수가 `IngestCodeCfg.ast_chunk_max_lines` 를 읽지 않고 모듈 상수 200 고정 (Chunker trait 이 per-medium config 미노출); `SourceType::Code` variant 부재로 code 파일이 `SourceType::Note` 로 분류됨 — 두 항목 모두 `tasks/HOTFIXES.md` (2026-05-19) 에 기록.
- **2026-05-07 fb-26 (progress.rs)** — `Aborted` unconditional writeln (TTY duplicate) + `Completed` TTY no summary fixed; `KEBAB_PROGRESS=plain` env + quiet suppression added
@@ -81,13 +85,13 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
## 다음 task 후보
- **P9-2 TUI search** — `App.search` slot 채움. Library 의 `/` 가 enable 됨.
- **P9-3 TUI ask** — `App.ask` slot 채움. `?` enable.
- **P9-4 TUI inspect** — `App.inspect` slot 채움. `Enter` enable.
- **P9-5 desktop tauri** — 별도 분기. PDF citation rendering UI 가치 큼.
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 후순위.
구조적으로 미완인 component 는 P9-5 하나뿐. 나머지는 도그푸딩 follow-up (아래 "P10 dogfooding 백로그") 또는 사용자 결정 대기.
P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에 병렬 진행 가능 — 같은 `App` 손대지 않음.
- **P9-5 desktop tauri** — 마지막 남은 P9 component. `kebab-desktop` crate + Tauri 앱, 별도 분기. PDF citation rendering UI 가치 큼. 사용자 우선순위 (P9 우선 · 책/PDF 위주) 와 부합.
- **P10 도그푸딩 round 2 follow-up** — ✅ v0.17.0 cut (2026-05-24) 으로 세 항목 모두 closure (한국어 trigram PR-A + C typedef alias PR-B + code_lang_chunk_breakdown additive PR-C). 상세 cross-link: 아래 "P10 dogfooding 백로그" 절 + `tasks/HOTFIXES.md` (2026-05-24 PR-A/B/C).
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 보류.
- **fb-41 multi-hop reasoning** — ⏳ 미구현, XL, eval 인프라 선행 + brainstorm 필요.
- **Rust symbol path retrofit** — Rust `code-rust-ast-v1` symbol 이 file-scope-only (1B+ 는 module prefix). `code-rust-ast-v2` bump + Rust corpus re-ingest 비용 → 사용자 명시 요청까지 보류. HOTFIXES `2026-05-20`.
### P9 dogfooding 백로그 (fb-26 ~ fb-42) — release 분할
@@ -96,11 +100,20 @@ P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에
- **0.3.0 — agent foundation** ✅ cut 2026-05-07: fb-26 (log), fb-27 (introspection/error wire), fb-28 (readonly/quiet). ~~fb-29 (daemon)~~ → 🚫 **deferred** — fb-30 stdio MCP 가 동일 가치를 daemon 복잡도 없이 제공.
- **0.4.0 — agent integration (MCP)** ✅ cut: fb-30 (MCP stdio), fb-31 (single-file/stdin ingest).
- **0.5.0 — agent surface refinement (additive)** ✅ cut 2026-05-10: fb-32 (stale doc indicator), fb-33 (streaming ask), fb-34 (output budget controls), fb-35 (verbatim fetch), fb-36 (search filter args), fb-37 (trace + stats). 모두 wire schema additive minor.
- **0.6.0 — RAG quality** 🟡 진행: fb-38 (score semantics) ✅ 머지 (2026-05-10), fb-40 (fact-grounded answer / rag-v2 prompt) ✅ 머지 (2026-05-10), fb-39 (retrieval precision tuning, embedding_version cascade) — 미진행 (eval golden set 선행 필요).
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL), fb-42 (bulk multi-query / rerank, Nice).
- **0.6.0 — RAG quality** ✅ 대부분 머지 (2026-05-10): fb-38 (score semantics) ✅, fb-39 (eval foundation — `precision_at_k_chunk` metric) ✅, fb-39b (embedding upgrade — multilingual-e5-large default) ✅, fb-40 (fact-grounded answer / rag-v2 prompt) ✅. 잔여 = fb-39 retrieval precision lever 실제 적용 (eval golden set 확장 선행 필요).
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL) — ⏳ 미구현 · brainstorm 필요; fb-42 (bulk multi-query) ✅ 머지 (2026-05-10, bulk only — rerank hint 은 deferred).
각 fb spec frontmatter 의 `target_version` 필드가 source of truth. INDEX.md 의 release subheader 도 동일 grouping.
### P10 dogfooding 백로그 (2026-05-22 round 2)
P10 종합 도그푸딩 round 2 (`/build/cache/dogfood-p10b/`, OSS 8 repo + 한국어 위키 문서 10편) 에서 발견된 follow-up 후보. 자세한 내용 + 우선순위 근거는 `tasks/HOTFIXES.md` (2026-05-22).
- **한국어 lexical tokenizer** — ✅ v0.17.0 (2026-05-24) PR-A 머지 (#159). V007 trigram migration 자동 backfill + `build_match_string` 재설계 + CLI/TUI/wire hint. HOTFIXES `2026-05-24 PR-A` 참조.
- **code_lang_chunk_breakdown chunk 단위 집계 (LOW)** — ✅ v0.17.0 (2026-05-24) PR-C 머지 (#161). `schema.v1.stats` additive 필드. HOTFIXES `2026-05-24 PR-C` 참조.
- **C typedef-wrapped struct (LOW)** — ✅ v0.17.0 (2026-05-24) PR-B 머지 (#160). `type_definition` 분기 + `PARSER_VERSION code-c-v2` bump + orphan purge cascade. HOTFIXES `2026-05-24 PR-B` 참조.
- **ranking glue chunk 편향 (deferred)** — 자동 heuristic 은 user intent misalignment 위험. 사용자 명시 요청 전까지 surface 변경 0 유지. 1주+ 실사용 후 재 brainstorm.
## 검증된 운영 동작 (release binary, fastembed enabled)
P7-3 머지 직후 25 시나리오 smoke 통과 — markdown + image + PDF 5 자산 워크스페이스에서 doctor / ingest / list / inspect / search (lex/vec/hybrid) / re-ingest / byte-edit re-ingest / corrupt PDF / RAG ask + page citation 모두. 자세한 시나리오 표는 conversation 기록 참조; 워크스페이스에 직접 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md).

View File

@@ -6,6 +6,20 @@
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
- **CPU only / RAM ≤ 16 GB 환경 권장 모델**: gemma4:e4b (8B) 는 CPU 추론에 무거워 RAG 한 답변이 5분을 넘기기 쉽다 — `[models.llm] request_timeout_secs` 의 기본 300 s 한도에 걸려 `error: kb-rag: llm.generate_stream` 으로 떨어진다 (HOTFIXES 2026-05-25). `gemma3:4b` / `qwen2.5:3b` / `phi3:mini` 같은 ≤ 4B Q4 모델로 바꾸면 답변 1-3 분에 안정 동작 (확장 도그푸딩에서 검증). 모델 storage 가 부담이면 `OLLAMA_MODELS=/path` env 로 위치 분리 가능.
- **`request_timeout_secs` 노브 (v0.17.0)**: `[models.llm] request_timeout_secs = 1200` (또는 `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=1200`) 로 한도를 늘려 큰 모델도 시도 가능. 단 응답 동안 RAM 점유가 길어진다. **`= 0` 은 disable 이 아니라 "즉시 timeout"** (reqwest 의 의미상) — "사실상 무제한" 의도면 `u64::MAX` 또는 `86400` 같이 큰 finite 값 사용.
- **sudo 없이 설치 (격리 디렉토리 사용)**: `install.sh``/usr/local/bin/ollama` + `systemd` 유닛까지 건드리는 게 부담이면 binary tarball 만 받아 사용자 디렉토리에 풀고 env 로 모델 위치 분리하면 된다.
```bash
mkdir -p /opt/ollama/{models,logs}
curl -fL https://ollama.com/download/ollama-linux-amd64.tar.zst -o /tmp/ollama.tar.zst
zstd -d /tmp/ollama.tar.zst -o /tmp/ollama.tar && tar -xf /tmp/ollama.tar -C /opt/ollama/
# bin/ollama + lib/ollama/ 가 풀린다. 모델 디렉토리는 OLLAMA_MODELS 로 분리.
OLLAMA_MODELS=/opt/ollama/models OLLAMA_HOST=127.0.0.1:11434 \
/opt/ollama/bin/ollama serve > /opt/ollama/logs/serve.log 2>&1 &
/opt/ollama/bin/ollama pull gemma3:4b
```
루트 디스크 부담을 분리하고 싶을 때 (`~/.ollama/models` 가 기본) 그대로 활용. systemd 가 없는 컨테이너 / WSL2 / 회사 머신 등에서 유용.
- **`kebab ask --stream` 권장 (fb-33)**: 모델 cold start 가 길 때 (8B+ 또는 첫 호출) `--stream` 으로 토큰을 stderr 에 ndjson 으로 흘려 받으면 5 분 timeout 한도 안에서도 첫 토큰이 빨리 보여 사용자 체감이 개선된다. 동일 inference 시간이라도 wait-and-pray 보다 progressive 가 안정적. CLI: `kebab ask "..." --stream 2> events.ndjson > final.json`. MCP host 도 `streaming_ask` capability flag 가 `true` 면 자동 사용 권장.
- **빌드 디스크** — 첫 빌드 시 `target/` 가 610 GB (Lance + DataFusion + fastembed). 여유 확인.
- **fastembed 모델** — 첫 `kebab ingest` 시 `multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
@@ -70,8 +84,8 @@ kebab doctor
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1`, `.go``code-go-ast-v1`, `.java``code-java-ast-v1`, `.kt`/`.kts``code-kotlin-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml``k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile``dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod``manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media``,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min``primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md``markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → `code-python-ast-v1`, `.ts`/`.tsx` → `code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx` → `code-js-ast-v1`, `.go` → `code-go-ast-v1`, `.java` → `code-java-ast-v1`, `.kt`/`.kts` → `code-kotlin-ast-v1`, `.c`/`.h` → `code-c-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` → `code-cpp-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml` → `k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile` → `dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod` → `manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원); **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh` → `code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"` 에 `citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--code-lang shell` / `--code-lang c` / `--code-lang cpp` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. **v0.17.0 trigram tokenizer (한국어 + 영어 동작 변경):** `chunks_fts` 가 FTS5 `trigram` 으로 동작 — 한국어 query 는 3자 이상 substring 매칭 (`해시 충돌` 같은 multi-token 도 whole-phrase 후보로 hit), 영어도 substring 매칭 (`token` 이 `tokenizer` 도 hit, recall ↑ / 단어 경계 ↓). 2자 이하 query 는 0-hit + stderr `[hint] 3자 이상 키워드 권장` + `search_response.v1.hint` 필드 (raw FTS5 mode `'...'` 제외). `kebab.sqlite` 파일 크기는 trigram index 비대화로 ~2-5배 또는 수백 MB 증가 (V007 자동 backfill, re-ingest 불필요). |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
@@ -132,7 +146,7 @@ flowchart TB
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1, code-go-ast-v1)"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin,c,cpp}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]

View File

@@ -73,6 +73,37 @@ pub struct SearchResponse {
/// p9-fb-37: present when caller passed `SearchOpts.trace = true`.
/// Consumers that ignore trace should leave this `None`.
pub trace: Option<kebab_core::SearchTrace>,
/// v0.17.0 A5 Step 4b: human / agent-readable advisory string set
/// when the empty hit list is likely due to a query shorter than the
/// FTS5 trigram tokenizer's 3-char minimum. `None` otherwise. CLI
/// surfaces it on stderr (text mode); MCP / `--json` consumers
/// surface it however they prefer. See
/// `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`
/// §3.3.
pub hint: Option<String>,
}
/// v0.17.0 A5 Step 4b: decide whether to attach a "3자 이상 키워드 권장"
/// hint to a `SearchResponse`. Fires only when the result set is empty
/// *and* the trimmed query is shorter than the trigram tokenizer can
/// resolve. Raw FTS5 mode (`'...'`) opts out — the user explicitly
/// invoked FTS5 syntax. Identical condition powers the CLI stderr line
/// and (separately) the TUI status bar.
pub fn short_query_hint(query_text: &str, hits_empty: bool) -> Option<String> {
if !hits_empty {
return None;
}
let trimmed = query_text.trim();
let bytes = trimmed.as_bytes();
// Raw single-quote mode: user opted into FTS5 syntax, no advisory.
if bytes.len() >= 2 && bytes[0] == b'\'' && bytes[bytes.len() - 1] == b'\'' {
return None;
}
if trimmed.chars().count() < 3 {
Some("3자 이상 키워드 권장 (trigram tokenizer 제약)".to_string())
} else {
None
}
}
/// Facade state — see module docs for lifetime rules.
@@ -418,11 +449,13 @@ impl App {
// Trace path skips the budget loop. Caller will inspect
// `hits.len()` and `trace.timing` rather than paginate.
let hint = short_query_hint(&query.text, hits.is_empty());
return Ok(SearchResponse {
hits,
next_cursor: None,
truncated: false,
trace: Some(trace),
hint,
});
}
@@ -505,11 +538,13 @@ impl App {
None
};
let hint = short_query_hint(&query.text, hits.is_empty());
Ok(SearchResponse {
hits,
next_cursor,
truncated,
trace: None,
hint,
})
}

View File

@@ -96,6 +96,11 @@ fn serialize_search_response(r: &SearchResponse) -> Value {
None => Value::Null,
};
map.insert("trace".to_string(), trace_v);
// v0.17.0 A5 Step 4b: only emit `hint` when set — matches
// the CLI wire wrapper's additive emit pattern.
if let Some(hint) = &r.hint {
map.insert("hint".to_string(), Value::String(hint.clone()));
}
}
v
}

View File

@@ -39,7 +39,7 @@ use std::sync::Arc;
use anyhow::{Context, anyhow};
use serde::{Deserialize, Serialize};
use kebab_chunk::{CodeGoAstV1Chunker, CodeJavaAstV1Chunker, CodeJsAstV1Chunker, CodeKotlinAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, DockerfileFileV1Chunker, K8sManifestResourceV1Chunker, ManifestFileV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_chunk::{CodeCAstV1Chunker, CodeCppAstV1Chunker, CodeGoAstV1Chunker, CodeJavaAstV1Chunker, CodeJsAstV1Chunker, CodeKotlinAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTextParagraphV1Chunker, CodeTsAstV1Chunker, DockerfileFileV1Chunker, K8sManifestResourceV1Chunker, ManifestFileV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_core::{
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
@@ -50,7 +50,7 @@ use kebab_core::{
use kebab_llm_local::OllamaLanguageModel;
use kebab_normalize::build_canonical_document;
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_code::{GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_code::{CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_pdf::PdfTextExtractor;
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
use kebab_source_fs::FsSourceConnector;
@@ -69,7 +69,7 @@ pub mod reset;
pub mod schema;
mod staleness;
pub use app::{App, SearchResponse};
pub use app::{App, SearchResponse, short_query_hint};
pub use ingest_progress::{AggregateCounts, IngestEvent, render_skipped_breakdown};
pub use reset::{ResetReport, ResetScope, enumerate_orphans};
pub use error_wire::{ERROR_V1_ID, ErrorV1, StructuredError, classify};
@@ -795,6 +795,7 @@ fn try_skip_unchanged(
current_chunker_version: &ChunkerVersion,
current_embedding_version: Option<&kebab_core::EmbeddingVersion>,
force_reingest: bool,
fallback_chunker_version: Option<&ChunkerVersion>, // p10-3 fix
) -> anyhow::Result<Option<kebab_core::IngestItem>> {
if force_reingest {
return Ok(None);
@@ -829,12 +830,72 @@ fn try_skip_unchanged(
if existing_doc.source_asset_id != asset.asset_id {
return Ok(None);
}
// p10-3 fix: detect "stored doc was previously Tier 3 fallback".
// When a Tier 1/2 extractor emits empty chunks, the fallback wrapper
// retries with CodeTextParagraphV1Chunker and stores
// last_chunker_version = "code-text-paragraph-v1" + parser_version = "none-v1".
// On the next ingest the caller computes current_parser_version /
// current_chunker_version from the Tier 1/2 dispatch (e.g.
// "k8s-manifest-resource-v1"), which can never match the stored
// fallback values, causing spurious re-ingests. Detect this case
// and bypass the parser/chunker equality checks — only the embedder
// version still must match.
let stored_is_tier3_fallback = fallback_chunker_version.is_some_and(|fbv| {
existing_doc.last_chunker_version.as_ref() == Some(fbv)
&& existing_doc.parser_version.0 == "none-v1"
});
if stored_is_tier3_fallback {
// Embedder version still must match.
let embedder_match = existing_doc.last_embedding_version.as_ref()
== current_embedding_version;
if !embedder_match {
return Ok(None);
}
let candidate_doc_id = existing_doc.doc_id.clone();
tracing::debug!(
target: "kebab-app::ingest",
path = %asset.workspace_path.0,
doc_id = %candidate_doc_id.0,
"skip-unchanged: tier 3 fallback state detected; bypassing parser/chunker equality"
);
return Ok(Some(kebab_core::IngestItem {
kind: kebab_core::IngestItemKind::Unchanged,
doc_id: Some(candidate_doc_id),
doc_path: asset.workspace_path.clone(),
asset_id: Some(asset.asset_id.clone()),
byte_len: Some(asset.byte_len),
block_count: u32::try_from(existing_doc.blocks.len()).ok(),
chunk_count: None,
parser_version: Some(existing_doc.parser_version.clone()),
chunker_version: existing_doc.last_chunker_version.clone(),
warnings: Vec::new(),
error: None,
}));
}
// 2. Parser unchanged: parser_version is baked into id_for_doc so
// a version bump yields a different doc_id and the row above
// would have been missing. Checking here explicitly keeps the
// logic self-documenting and guards against future id_for_doc
// changes.
if existing_doc.parser_version != *current_parser_version {
// v0.17.0 PR-B: parser_version bump cascade. Same bytes (same
// asset_id) → asset-keyed `stale_chunk_ids_at` is a no-op, but
// the stale `documents` row at this workspace_path still
// collides with `idx_docs_workspace_path` on the next INSERT
// and the LanceDB rows under the old chunk_ids orphan. Sweep
// both stores here, before returning Ok(None), so the caller's
// full-ingest path lands a clean slate. The `keep_doc_id = ""`
// sentinel removes every doc at this path (the new doc_id is
// not yet known here — it's computed downstream from the new
// PARSER_VERSION).
purge_workspace_path_for_parser_bump(app, asset).with_context(|| {
format!(
"parser-bump orphan purge at {}",
asset.workspace_path.0
)
})?;
return Ok(None);
}
// 3. Chunker unchanged.
@@ -948,11 +1009,12 @@ fn ingest_one_asset(
force_reingest,
);
}
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added.
// p10-1A-2 / 1B: code ingest dispatch. p10-2: Tier 2 langs added. p10-3: shell added. p10-1D: c/cpp added.
MediaType::Code(lang)
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod") =>
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell" | "c" | "cpp") =>
{
return ingest_one_code_asset(
app,
@@ -1016,6 +1078,7 @@ fn ingest_one_asset(
&MdHeadingV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1210,6 +1273,7 @@ fn ingest_one_image_asset(
&MdHeadingV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1438,6 +1502,53 @@ fn record_image_analysis_failure(
warning_notes.push(note);
}
/// v0.17.0 PR-B: parser-bump cascade. When a code extractor ships a
/// new `PARSER_VERSION` (e.g. `code-c-v1` → `code-c-v2`), the same
/// (workspace_path, asset_id) pair re-emerges with a fresh `doc_id`.
/// The existing asset-keyed [`purge_vector_orphans_for_workspace_path`]
/// only fires on asset_id changes (file bytes edited) and is a no-op
/// here. Without an explicit doc-keyed sweep the next INSERT raises
/// `idx_docs_workspace_path` UNIQUE and the LanceDB rows under the
/// stale chunk_ids orphan. This helper:
///
/// 1. Fetches every stale chunk_id at `workspace_path` from SQLite
/// (`keep_doc_id = ""` means "all existing docs are stale" —
/// `try_skip_unchanged` calls this before the new doc_id is
/// computed).
/// 2. Deletes the matching vectors from every Lance table (no-op if
/// embeddings are disabled).
/// 3. Sweeps the SQLite `documents` row (CASCADE drops `blocks` /
/// `chunks` / `embedding_records`). The `assets` row stays — same
/// bytes, same asset_id, only the derived `doc_id` changed.
fn purge_workspace_path_for_parser_bump(
app: &App,
asset: &RawAsset,
) -> anyhow::Result<()> {
let path = &asset.workspace_path.0;
let stale = app
.sqlite
.stale_chunk_ids_for_workspace_path_except_doc_id(path, "")
.context("SqliteStore::stale_chunk_ids_for_workspace_path_except_doc_id")?;
if !stale.is_empty() {
if let Some(vec_store) = app.vector().context("App::vector")? {
use kebab_core::VectorStore as _;
vec_store
.delete_by_chunk_ids(&stale)
.context("VectorStore::delete_by_chunk_ids (parser-bump orphans)")?;
}
}
app.sqlite
.purge_document_at_workspace_path_except_doc_id(path, "")
.context("SqliteStore::purge_document_at_workspace_path_except_doc_id")?;
tracing::debug!(
target: "kebab-app",
path = %path,
count = stale.len(),
"purged orphan vectors + document for parser_version bump"
);
Ok(())
}
/// HOTFIXES 2026-05-02 P7-3 follow-up: when a tracked file's bytes
/// change, `purge_orphan_at_workspace_path` (in `kebab-store-sqlite`)
/// sweeps the SQLite chain (documents → blocks / chunks / embedding_records)
@@ -1656,6 +1767,7 @@ fn ingest_one_pdf_asset(
&PdfPageV1Chunker.chunker_version(),
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
None,
)? {
return Ok(item);
}
@@ -1835,11 +1947,16 @@ fn ingest_one_code_asset(
// p10-2: Tier 2 has no parse step — sentinel "none-v1".
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
=> ParserVersion("none-v1".to_string()),
// p10-3: shell direct routes to Tier 3 (no parse step).
"shell" => ParserVersion("none-v1".to_string()),
// p10-1D: C + C++ AST extractors.
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// p10-1b Task D/G/J/L: chunker_version per-lang.
let chunker_version = match code_lang {
let mut chunker_version = match code_lang {
"rust" => CodeRustAstV1Chunker.chunker_version(),
"python" => CodePythonAstV1Chunker.chunker_version(),
"typescript" => CodeTsAstV1Chunker.chunker_version(),
@@ -1852,9 +1969,26 @@ fn ingest_one_code_asset(
"dockerfile" => DockerfileFileV1Chunker.chunker_version(),
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker.chunker_version(),
// p10-3:
"shell" => CodeTextParagraphV1Chunker.chunker_version(),
// p10-1D: C + C++ AST chunkers.
"c" => CodeCAstV1Chunker.chunker_version(),
"cpp" => CodeCppAstV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
// p10-3 fix: if this lang can fall back to Tier 3, compute the fallback
// chunker_version so try_skip_unchanged can detect the stored-as-Tier-3
// state and skip parser/chunker equality checks.
let tier3_fallback_cv: Option<ChunkerVersion> = match code_lang {
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "c" | "cpp" // p10-1D
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
_ => None,
};
if let Some(item) = try_skip_unchanged(
app,
asset,
@@ -1862,6 +1996,7 @@ fn ingest_one_code_asset(
&chunker_version,
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
tier3_fallback_cv.as_ref(),
)? {
return Ok(item);
}
@@ -1877,70 +2012,159 @@ fn ingest_one_code_asset(
};
// p10-1b Task D/G/J/L: extractor per-lang.
let mut canonical = match code_lang {
// p10-3: capture Result so Tier 1 extractor errors can fall back to Tier 3.
let canonical_result: anyhow::Result<kebab_core::CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
.context("kb-parse-code::RustAstExtractor::extract (code:rust)"),
"python" => PythonAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
.context("kb-parse-code::PythonAstExtractor::extract (code:python)"),
"typescript" => TypescriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)"),
"javascript" => JavascriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)"),
"go" => GoAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::GoAstExtractor::extract (code:go)")?,
.context("kb-parse-code::GoAstExtractor::extract (code:go)"),
"java" => JavaAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavaAstExtractor::extract (code:java)")?,
.context("kb-parse-code::JavaAstExtractor::extract (code:java)"),
"kotlin" => KotlinAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::KotlinAstExtractor::extract (code:kotlin)")?,
.context("kb-parse-code::KotlinAstExtractor::extract (code:kotlin)"),
// p10-2 Tier 2: no extractor — synthesize Document directly from raw bytes.
"yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" => {
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)?
synthesize_tier2_document(asset, &bytes, code_lang, &parser_version)
}
// p10-3: shell reuses the same synthesizer.
"shell" => synthesize_tier2_document(asset, &bytes, "shell", &parser_version),
// p10-1D: C + C++ AST extractors.
"c" => CAstExtractor::new()
.extract(&ctx, &bytes)
.context("kebab-parse-code::CAstExtractor::extract (code:c)"),
"cpp" => CppAstExtractor::new()
.extract(&ctx, &bytes)
.context("kebab-parse-code::CppAstExtractor::extract (code:cpp)"),
other => anyhow::bail!("unreachable (extract): {other}"),
};
// p10-3: Tier 1 extractor failure → fall back to Tier 3 synthesized doc.
// Tier 2 (yaml/dockerfile/…) and shell errors are real (e.g. non-UTF-8) — propagate.
let mut canonical = match canonical_result {
Ok(d) => d,
Err(e) if code_lang == "shell"
|| matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod") =>
{
return Err(e).context("synthesize_tier2_document failed for tier 2/3 lang");
}
Err(e) => {
// Tier 1 extractor errored — fall back to Tier 3 synthesized doc.
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1 extract errored; falling back to tier 3 synthesized doc"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
let tier3_parser_version = ParserVersion("none-v1".to_string());
synthesize_tier2_document(asset, &bytes, code_lang, &tier3_parser_version)
.context("synthesize_tier2_document for tier 3 fallback after extract error")?
}
};
// p10-1b Task D/G/J/L: chunker per-lang.
let chunks = match code_lang {
"rust" => CodeRustAstV1Chunker
// p10-3: track whether the extract stage already fell back to Tier 3.
// Tier 2 langs already have "none-v1" parser_version normally, so exclude them
// from the extract_fell_back guard with the !matches! exclusion.
let extract_fell_back = canonical.parser_version.0 == "none-v1"
&& !matches!(code_lang, "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod" | "shell");
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
// Tier 1 lang whose extractor errored — go straight to Tier 3 chunker.
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)")?,
"java" => CodeJavaAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJavaAstV1Chunker::chunk (code:java)")?,
"kotlin" => CodeKotlinAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeKotlinAstV1Chunker::chunk (code:kotlin)")?,
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::K8sManifestResourceV1Chunker::chunk")?,
"dockerfile" => DockerfileFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::DockerfileFileV1Chunker::chunk")?,
"toml" | "json" | "xml" | "groovy" | "go-mod"
=> ManifestFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::ManifestFileV1Chunker::chunk")?,
other => anyhow::bail!("unreachable (chunk): {other}"),
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 after extract fallback)")
} else {
match code_lang {
"rust" => CodeRustAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)"),
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)"),
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)"),
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)"),
"go" => CodeGoAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeGoAstV1Chunker::chunk (code:go)"),
"java" => CodeJavaAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJavaAstV1Chunker::chunk (code:java)"),
"kotlin" => CodeKotlinAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeKotlinAstV1Chunker::chunk (code:kotlin)"),
// p10-2 Tier 2:
"yaml" => K8sManifestResourceV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::K8sManifestResourceV1Chunker::chunk"),
"dockerfile" => DockerfileFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::DockerfileFileV1Chunker::chunk"),
"toml" | "json" | "xml" | "groovy" | "go-mod" => ManifestFileV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::ManifestFileV1Chunker::chunk"),
// p10-3:
"shell" => CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (code:shell)"),
// p10-1D: C + C++ AST chunkers.
"c" => CodeCAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kebab-chunk::CodeCAstV1Chunker::chunk (code:c)"),
"cpp" => CodeCppAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kebab-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
other => anyhow::bail!("unreachable (chunk): {other}"),
}
};
// p10-3: Tier 1/2 0-chunk OR error → Tier 3 fallback retry.
// "shell" direct path is already Tier 3 — don't retry-double-up.
let chunks: Vec<Chunk> = match chunks_result {
Ok(v) if !v.is_empty() => v,
other if code_lang == "shell" => other?, // shell propagates directly
Ok(_empty) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
"tier1/2 emitted 0 chunks; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback)")?
}
Err(e) => {
tracing::warn!(
workspace_path = %asset.workspace_path.0,
code_lang = code_lang,
error = %e,
"tier1/2 chunker errored; falling back to tier 3 (code-text-paragraph-v1)"
);
chunker_version = CodeTextParagraphV1Chunker.chunker_version();
canonical.parser_version = ParserVersion("none-v1".to_string());
CodeTextParagraphV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTextParagraphV1Chunker::chunk (tier 3 fallback after error)")?
}
};
// Stamp chunker + embedding versions so incremental skip detection has
@@ -2101,11 +2325,15 @@ fn synthesize_tier2_document(
},
];
// Resolve abs path for repo detection (mirrors RustAstExtractor pattern).
let workspace_root = std::path::PathBuf::new(); // not needed for detect_repo walk
// Resolve absolute path for repo detection. FsSourceConnector always
// emits absolute paths in SourceUri::File (verified in connector.rs); Kb
// URIs were rejected earlier in ingest_one_code_asset (returns Skipped),
// so the fallback below is purely defensive. This does NOT mirror
// RustAstExtractor — that extractor joins ctx.workspace_root for relative
// paths, but Tier 2 trusts the connector invariant.
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => p.clone(),
kebab_core::SourceUri::Kb(_) => workspace_root,
kebab_core::SourceUri::Kb(_) => std::path::PathBuf::new(),
};
let (repo, git_branch, git_commit) = match kebab_parse_code::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),

View File

@@ -63,14 +63,26 @@ pub struct Stats {
/// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold.
#[serde(default)]
pub stale_doc_count: u64,
/// p10-1A-1: code language breakdown (chunk counts by canonical lowercase
/// language identifier). Empty until 1A-2 produces code chunks.
/// p10-1A-1: code language breakdown (**doc** counts by canonical
/// lowercase language identifier). Empty until 1A-2 produces code
/// docs. v0.17.0 PR-C: doc-count semantics corrected here (the
/// previous "chunk counts" wording was a longstanding mis-label —
/// implementation has always been `COUNT(*) FROM documents
/// GROUP BY code_lang`). Use `code_lang_chunk_breakdown` for the
/// chunk-level companion.
#[serde(default)]
pub code_lang_breakdown: std::collections::BTreeMap<String, u32>,
/// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value).
/// Empty until 1A-2 produces code chunks.
/// p10-1A-1: repo breakdown (**doc** counts by `metadata.repo`
/// value). Empty until 1A-2 produces code docs. v0.17.0 PR-C:
/// doc-count wording corrected (mirror of code_lang_breakdown).
#[serde(default)]
pub repo_breakdown: std::collections::BTreeMap<String, u32>,
/// v0.17.0 PR-C: sister of [`Self::code_lang_breakdown`] returning
/// chunk counts instead of doc counts. Indexing-pressure metric —
/// one PDF spec → 200 chunks vs one Rust file → 5 chunks shows up
/// here in a way `code_lang_breakdown` (doc count) hides.
#[serde(default)]
pub code_lang_chunk_breakdown: std::collections::BTreeMap<String, u32>,
}
const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION");
@@ -171,6 +183,9 @@ fn collect_stats(
// p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
// placeholder — mirror of code_lang_breakdown for the repo field.
repo_breakdown: store.repo_breakdown()?,
// v0.17.0 PR-C: chunk-level companion (closes HOTFIXES
// 2026-05-22 "code_lang_breakdown chunk granularity" LOW).
code_lang_chunk_breakdown: store.code_lang_chunk_breakdown()?,
})
}
@@ -210,6 +225,11 @@ mod tests_stats_ext {
v.get("repo_breakdown").is_some(),
"Stats JSON must include repo_breakdown: {v}"
);
// v0.17.0 PR-C: chunk-level companion field.
assert!(
v.get("code_lang_chunk_breakdown").is_some(),
"Stats JSON must include code_lang_chunk_breakdown (v0.17.0 PR-C): {v}"
);
// Empty BTreeMap serializes as `{}` — confirm it's an object, not null.
assert!(
v["code_lang_breakdown"].is_object(),
@@ -219,6 +239,10 @@ mod tests_stats_ext {
v["repo_breakdown"].is_object(),
"repo_breakdown must be an object: {v}"
);
assert!(
v["code_lang_chunk_breakdown"].is_object(),
"code_lang_chunk_breakdown must be an object: {v}"
);
}
#[test]

View File

@@ -850,6 +850,181 @@ fn tier2_cargo_toml_ingest_searchable() {
);
}
/// p10-3 Task E: a `.sh` file is ingested via the shell direct-Tier-3 path
/// and the resulting `Citation::Code` hit must carry `lang="shell"`,
/// `symbol=None`, `line_start >= 1`, and
/// `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_shell_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n\nkebab ingest --json\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "shell file ingested: {report:?}");
let sh_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh item present");
assert_eq!(
sh_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 for shell (Tier 3 direct)"
);
assert_eq!(
sh_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 for shell"
);
let query = kebab_core::SearchQuery {
text: "kebab".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["shell".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'kebab'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"citation.lang must be 'shell'"
);
assert_eq!(*symbol, None, "Tier 3 symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("shell"),
"SearchHit.code_lang must be 'shell'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"shell chunks must be stamped with the Tier 3 chunker_version"
);
}
/// p10-3 Task E: a docker-compose-shaped YAML file (no `apiVersion`/`kind`)
/// is ingested; the k8s chunker returns `Ok(vec![])` and the Tier 3 fallback
/// wrapper retries with `CodeTextParagraphV1Chunker`. The resulting
/// `Citation::Code` hit must carry `lang="yaml"`, `symbol=None`,
/// `line_start >= 1`, and `chunker_version = "code-text-paragraph-v1"`.
#[test]
fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
let env = TestEnv::lexical_only();
// docker-compose-shaped YAML — version + services but no apiVersion/kind.
// The k8s chunker returns Ok(vec![]); Tier 3 fallback should pick this up.
std::fs::write(
env.workspace_root.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n ports:\n - 8080:80\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(
report.new >= 1,
"expected non-k8s yaml ingested via Tier 3, got {} new docs",
report.new
);
let yaml_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml item present");
assert_eq!(
yaml_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("none-v1"),
"parser_version must be none-v1 after Tier 3 fallback"
);
assert_eq!(
yaml_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"chunker_version must be code-text-paragraph-v1 after Tier 3 fallback"
);
let query = kebab_core::SearchQuery {
text: "nginx".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'nginx'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"citation.lang must be 'yaml'"
);
assert_eq!(*symbol, None, "Tier 3 fallback symbol must be None");
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("yaml"),
"SearchHit.code_lang must be 'yaml'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-text-paragraph-v1",
"non-k8s yaml fallback must be stamped code-text-paragraph-v1"
);
}
/// Re-ingesting the same `.rs` file without changes must report
/// `Unchanged` (incremental-skip path exercised).
#[test]
@@ -889,3 +1064,328 @@ fn rust_file_re_ingest_is_unchanged() {
);
assert_eq!(item2.doc_id, item1.doc_id);
}
/// p10-3 fix regression: a docker-compose YAML that falls back to Tier 3
/// (k8s chunker returns empty, CodeTextParagraphV1Chunker retries) must
/// report Unchanged on the second ingest rather than re-processing.
/// Before the fix, try_skip_unchanged returned None because the stored
/// last_chunker_version ("code-text-paragraph-v1" / parser_version
/// "none-v1") never matched the caller's dispatch values.
#[test]
fn tier3_yaml_fallback_reingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("docker-compose.yml"),
"version: '3'\nservices:\n api:\n image: nginx:latest\n",
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
);
assert_eq!(
item1.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-text-paragraph-v1"),
"first ingest must use Tier 3 fallback chunker"
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("docker-compose.yml"))
.expect("docker-compose.yml in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"second ingest must be Unchanged, got {:?}", item2.kind
);
}
/// p10-1d Task G: a `.c` file with a single top-level function is ingested
/// and the resulting `Citation::Code` hit must carry `lang="c"`,
/// `symbol="parse_record"` (function name only — no nesting in C), and
/// `chunker_version = "code-c-ast-v1"`.
#[test]
fn tier1_c_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("parser.c"),
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "c file ingested: {report:?}");
let c_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("parser.c"))
.expect("parser.c item present");
assert_eq!(
c_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-c-v2"),
"parser_version must be code-c-v2 (v0.17.0 PR-B: typedef-wrapped struct/enum/union 이 typedef alias unit 으로 방출)"
);
assert_eq!(
c_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-c-ast-v1"),
"chunker_version must be code-c-ast-v1"
);
let query = kebab_core::SearchQuery {
text: "parse_record".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["c".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'parse_record'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("c"), "citation.lang must be 'c'");
assert_eq!(
symbol.as_deref(),
Some("parse_record"),
"C symbol must be function name only (no nesting)"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("c"),
"SearchHit.code_lang must be 'c'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-c-ast-v1",
"C chunks must be stamped with code-c-ast-v1"
);
}
/// p10-1d Task G: a `.cpp` file with nested namespace + class is ingested
/// and the resulting `Citation::Code` hit must carry `lang="cpp"`, a
/// `symbol` that starts with `"kebab::chunk::Foo"` (namespace::Class or
/// namespace::Class::method), and `chunker_version = "code-cpp-ast-v1"`.
#[test]
fn tier1_cpp_ingest_searchable() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("chunker.cpp"),
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
assert!(report.new >= 1, "cpp file ingested: {report:?}");
let cpp_item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("chunker.cpp"))
.expect("chunker.cpp item present");
assert_eq!(
cpp_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-cpp-v1"),
"parser_version must be code-cpp-v1"
);
assert_eq!(
cpp_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-cpp-ast-v1"),
"chunker_version must be code-cpp-ast-v1"
);
let query = kebab_core::SearchQuery {
text: "bar".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["cpp".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'bar'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("cpp"), "citation.lang must be 'cpp'");
// Symbol could be "kebab::chunk::Foo" (class) or "kebab::chunk::Foo::bar"
// (method) depending on which chunk ranks first.
assert!(
symbol.as_deref().is_some_and(|s| s.starts_with("kebab::chunk::Foo")),
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("cpp"),
"SearchHit.code_lang must be 'cpp'"
);
assert_eq!(
h.chunker_version.0.as_str(),
"code-cpp-ast-v1",
"C++ chunks must be stamped with code-cpp-ast-v1"
);
}
/// P10 dogfood regression: a k8s YAML with 2 documents (Deployment + Service
/// separated by `---`) must ingest without a UNIQUE constraint violation.
/// Before the fix, push_chunks_with_oversize emitted split_key=None for each
/// resource, giving every resource chunk the same id_hash → identical chunk_id
/// → SQLite UNIQUE constraint failure on the second resource.
#[test]
fn tier2_k8s_multi_resource_yaml_ingests_without_collision() {
let env = TestEnv::lexical_only();
let k8s_dir = env.workspace_root.join("k8s");
std::fs::create_dir_all(&k8s_dir).unwrap();
std::fs::write(
k8s_dir.join("k8s-multi.yaml"),
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: api\n namespace: prod\nspec:\n replicas: 2\n---\napiVersion: v1\nkind: Service\nmetadata:\n name: api\n namespace: prod\nspec:\n selector:\n app: api\n",
)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
// The bug: this would land in report with an error + UNIQUE constraint message.
let item = report
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("k8s-multi.yaml"))
.expect("k8s-multi.yaml in report");
assert!(
item.error.is_none(),
"multi-resource k8s yaml must ingest without error, got: {:?}",
item.error
);
assert!(
matches!(item.kind, IngestItemKind::New),
"expected New, got {:?}",
item.kind
);
// Both resources must be searchable (≥2 hits: Deployment/prod/api + Service/prod/api).
let query = kebab_core::SearchQuery {
text: "api".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 10,
filters: kebab_core::SearchFilters {
code_lang: vec!["yaml".to_string()],
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
assert!(
hits.len() >= 2,
"expected ≥2 hits (Deployment + Service), got {}",
hits.len()
);
}
/// p10-3 fix regression: a shell file (direct Tier 3, not a fallback)
/// must also report Unchanged on re-ingest. Shell goes straight to
/// CodeTextParagraphV1Chunker so `stored_is_tier3_fallback` is false
/// (parser_version is "none-v1" and chunker matches the current dispatch),
/// but the normal equality path should pass regardless.
#[test]
fn tier3_shell_reingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("deploy.sh"),
"#!/usr/bin/env bash\nset -e\necho hello\n",
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
.expect("items present")
.iter()
.find(|i| i.doc_path.0.ends_with("deploy.sh"))
.expect("deploy.sh in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"shell reingest must be Unchanged, got {:?}", item2.kind
);
}

View File

@@ -38,12 +38,16 @@ fn fetch_chunk_returns_target_only_when_no_context() {
#[test]
fn fetch_chunk_with_context_returns_neighbors() {
let env = common::TestEnv::new();
let body = "# H1\n\nA1\n\n# H2\n\nA2\n\n# H3\n\nA3\n\n# H4\n\nA4\n\n# H5\n\nA5\n";
// v0.17.0 trigram tokenizer: terms must be ≥3 Unicode chars to
// match. The earlier fixture used 2-char tokens like `A1`/`A3` for
// section bodies — those zero-hit under trigram. Use 5-char unique
// words per section so the query can pin one chunk deterministically.
let body = "# H1\n\napples\n\n# H2\n\nbanana\n\n# H3\n\ncherry\n\n# H4\n\ndurian\n\n# H5\n\nelder\n";
common::ingest_md(&env, "multi.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "A3".to_string(),
text: "cherry".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),

View File

@@ -46,3 +46,88 @@ fn korean_lexical_query_returns_korean_document() {
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// A4 Step 1c — multi-token Korean query (`해시 충돌`) must hit when
/// the lexical builder routes it through a whole-phrase MATCH candidate.
///
/// Expected: FAIL until A5 (`build_match_string` redesign) lands — the
/// current builder emits `"해시" "충돌"` AND, but FTS5 trigram tokenizer
/// has no 2-char terms so each side is 0-hit. A5 introduces a whole-
/// phrase candidate (`"해시 충돌"`) OR'd with the token AND, restoring
/// hits for the dominant Korean usage pattern.
#[test]
fn lexical_multi_token_korean_query_hits() {
let env = TestEnv::lexical_only();
// Copy the synthetic Korean fixture (introduced in A4 Step 0) into
// the test workspace. The fixture contains the exact phrase
// "해시 충돌" multiple times.
let dest = env.workspace_root.join("hash-table.md");
let src = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("..")
.join("fixtures")
.join("search")
.join("korean")
.join("hash-table.md");
std::fs::copy(&src, &dest).expect("copy korean fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(
env.config.clone(),
common::lexical_query("해시 충돌"),
)
.expect("search must succeed");
assert!(
!hits.is_empty(),
"multi-token Korean query '해시 충돌' must hit the hash-table fixture; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let any_hash_table = hits.iter().any(|h| h.doc_path.0.contains("hash-table"));
assert!(
any_hash_table,
"expected at least one hit on the hash-table fixture, got: {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// A4 Step 1c — mixed Korean+English multi-token query (`Rust 충돌은`).
/// Both tokens are ≥3 chars, so the redesigned builder (A5) emits
/// `("Rust 충돌은") OR ("Rust" AND "충돌은")`. With trigram tokenizer
/// each side has substring coverage in the document, so the AND branch
/// alone is enough. Expected: FAIL pre-A5, PASS post-A5.
#[test]
fn lexical_mixed_korean_english_multi_token_query_hits() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("rust-hash.md");
std::fs::write(
&doc_path,
"# Rust 해시 테이블\n\nRust 의 std::collections::HashMap 에서 \
해시 충돌은 SipHash 로 완화한다.\n",
)
.expect("write rust-hash fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(
env.config.clone(),
common::lexical_query("Rust 충돌은"),
)
.expect("search must succeed");
assert!(
!hits.is_empty(),
"mixed Korean+English multi-token query 'Rust 충돌은' must hit the rust-hash fixture; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let any_rust_hash = hits.iter().any(|h| h.doc_path.0.contains("rust-hash"));
assert!(
any_rust_hash,
"expected at least one hit on the rust-hash fixture, got: {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}

View File

@@ -16,12 +16,13 @@ tracing = { workspace = true }
serde_yaml = { workspace = true }
[dev-dependencies]
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration
# test to build a CanonicalDocument from a fixture Markdown file. Forbidden as
# regular deps per design §8 (chunker consumes CanonicalDocument from kb-core
# only); `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
# confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }
# kb-parse-md / kb-normalize / kb-parse-code are dev-only — used by the
# snapshot integration tests to build a CanonicalDocument from fixture files.
# Forbidden as regular deps per design §8 (chunker consumes CanonicalDocument
# from kb-core only); `cargo tree -p kb-chunk --depth 1` (default scope,
# excludes dev-deps) confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-parse-code = { path = "../kebab-parse-code" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }

View File

@@ -0,0 +1,322 @@
//! `code-c-ast-v1` — maps a tree-sitter-derived C AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-c-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCAstV1Chunker;
impl Chunker for CodeCAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-c-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.c".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("c".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_c_ast_v1() {
assert_eq!(CodeCAstV1Chunker.chunker_version(),
ChunkerVersion("code-c-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-c-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-cpp-ast-v1` — maps a tree-sitter-derived C++ AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-cpp-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCppAstV1Chunker;
impl Chunker for CodeCppAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-cpp-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.cpp".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("cpp".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_cpp_ast_v1() {
assert_eq!(CodeCppAstV1Chunker.chunker_version(),
ChunkerVersion("code-cpp-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-cpp-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<Vec<_>>().join("");
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCppAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCppAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,170 @@
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
//!
//! Splits code/text files on blank-line paragraph boundaries. Paragraphs
//! with more than 80 lines are further split into 80-line windows with a
//! 20-line overlap (stride 60) — the same oversize pattern used by Tier 1/2
//! chunkers but without AST structure, hence no symbol.
//!
//! Per spec §9.3: all emitted chunks carry `symbol: None`.
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
/// Lines-per-window for the oversize fallback (Tier 3).
const FALLBACK_LINES_PER_CHUNK: usize = 80;
/// Overlap between consecutive windows.
const FALLBACK_LINES_OVERLAP: usize = 20;
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeTextParagraphV1Chunker;
impl Chunker for CodeTextParagraphV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full source text.
let (text, lang_str) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let mut chunks = Vec::new();
for para in split_paragraphs(text) {
push_paragraph(&mut chunks, doc, policy, &para, lang_str)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"code-text-paragraph-v1 chunked",
);
Ok(chunks)
}
}
/// A contiguous run of non-blank lines from the source text.
struct Paragraph {
/// Lines joined with `\n` (no trailing newline).
text: String,
/// 1-indexed line number of the first line in the source file.
line_start: u32,
/// 1-indexed line number of the last line in the source file.
line_end: u32,
}
/// Split `text` into `Paragraph`s separated by blank (all-whitespace) lines.
///
/// Blank lines are treated as boundaries and are NOT included in any
/// paragraph's line range. Paragraphs that would consist entirely of blank
/// lines are skipped.
fn split_paragraphs(text: &str) -> Vec<Paragraph> {
let mut paragraphs = Vec::new();
let mut current: Vec<&str> = Vec::new();
let mut current_start: Option<u32> = None;
for (idx, line) in text.lines().enumerate() {
let line_no = (idx + 1) as u32;
let is_blank = line.trim().is_empty();
if is_blank {
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
current.clear();
}
} else {
if current_start.is_none() {
current_start = Some(line_no);
}
current.push(line);
}
}
// Flush any trailing paragraph not terminated by a blank line.
if let Some(start) = current_start {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
}
paragraphs
}
/// Emit one or more chunks for a single paragraph.
///
/// Paragraphs with ≤ `FALLBACK_LINES_PER_CHUNK` lines become a single chunk.
/// Larger paragraphs are split into overlapping windows of
/// `FALLBACK_LINES_PER_CHUNK` lines with stride `FALLBACK_LINES_PER_CHUNK -
/// FALLBACK_LINES_OVERLAP`. The last window may be shorter. Window starts
/// are passed as `split_key` so `id_for_chunk` can produce distinct ids
/// across windows.
fn push_paragraph(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
para: &Paragraph,
lang: &str,
) -> Result<()> {
let n_lines = (para.line_end - para.line_start + 1) as usize;
if n_lines <= FALLBACK_LINES_PER_CHUNK {
// Use line_start as split_key so each paragraph gets a distinct
// chunk_id even when block_ids is empty (no symbol, no AST structure).
// Without this, all short paragraphs from the same doc share the same
// base_policy_hash and therefore the same id_for_chunk result.
out.push(build_chunk_no_symbol(
doc,
policy,
&para.text,
para.line_start,
para.line_end,
lang,
VERSION_LABEL,
Some(para.line_start),
));
return Ok(());
}
// Oversize: line-window split with overlap.
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
let lines: Vec<&str> = para.text.lines().collect();
let mut i = 0usize;
loop {
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
let window_text = lines[i..end].join("\n");
let window_start = para.line_start + i as u32;
let window_end = para.line_start + (end as u32) - 1;
// Use window_start as split_key so chunk_ids are unique across windows.
out.push(build_chunk_no_symbol(
doc,
policy,
&window_text,
window_start,
window_end,
lang,
VERSION_LABEL,
Some(window_start),
));
if end == lines.len() {
break;
}
i += stride;
}
Ok(())
}

View File

@@ -43,6 +43,7 @@ impl Chunker for DockerfileFileV1Chunker {
"<dockerfile>",
"dockerfile",
VERSION_LABEL,
None,
)?;
tracing::debug!(

View File

@@ -85,6 +85,7 @@ impl Chunker for K8sManifestResourceV1Chunker {
&symbol,
"yaml",
VERSION_LABEL,
Some(slice.line_start),
)?;
}

View File

@@ -15,6 +15,8 @@
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod code_c_ast_v1;
mod code_cpp_ast_v1;
mod code_go_ast_v1;
mod code_java_ast_v1;
mod code_js_ast_v1;
@@ -28,7 +30,10 @@ mod tier2_shared;
pub mod k8s_manifest_resource_v1;
pub mod dockerfile_file_v1;
pub mod manifest_file_v1;
pub mod code_text_paragraph_v1;
pub use code_c_ast_v1::CodeCAstV1Chunker;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
pub use code_java_ast_v1::CodeJavaAstV1Chunker;
pub use code_js_ast_v1::CodeJsAstV1Chunker;
@@ -41,3 +46,4 @@ pub use pdf_page_v1::PdfPageV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;

View File

@@ -44,6 +44,7 @@ impl Chunker for ManifestFileV1Chunker {
"<manifest>",
lang,
VERSION_LABEL,
None,
)?;
tracing::debug!(

View File

@@ -25,6 +25,13 @@ pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
///
/// `base_split_key` is used as the `split_key` for the non-oversize single-chunk
/// case. Callers that emit multiple chunks from the same document (e.g.
/// `K8sManifestResourceV1Chunker` — one call per k8s resource) MUST pass
/// `Some(line_start)` so that each call produces a distinct `chunk_id`.
/// Single-chunk callers (dockerfile-file-v1, manifest-file-v1) pass `None` to
/// keep chunk_ids stable (no sibling can collide when there's only one chunk).
#[allow(clippy::too_many_arguments)]
pub(crate) fn push_chunks_with_oversize(
out: &mut Vec<Chunk>,
@@ -36,6 +43,7 @@ pub(crate) fn push_chunks_with_oversize(
symbol: &str,
lang: &str,
chunker_version: &str,
base_split_key: Option<u32>,
) -> Result<()> {
let n_lines = (line_end - line_start + 1).max(1);
let cv = ChunkerVersion(chunker_version.to_string());
@@ -51,7 +59,7 @@ pub(crate) fn push_chunks_with_oversize(
line_end,
symbol,
lang,
None,
base_split_key,
));
return Ok(());
}
@@ -88,7 +96,7 @@ pub(crate) fn push_chunks_with_oversize(
/// for normal single-chunk emission. Mirrors the `Some(part_ls)` / `None`
/// split_key pattern in 1A-2.
#[allow(clippy::too_many_arguments)]
fn build_chunk(
pub(crate) fn build_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
@@ -105,7 +113,49 @@ fn build_chunk(
symbol: Some(symbol.to_string()),
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, chunker_version, base_policy_hash, text, span, split_key)
}
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
///
/// Accepts `policy: &ChunkPolicy` and `chunker_version: &str` (string slice)
/// so callers don't need to pre-compute the hash and version wrapper.
/// `split_key` is `Some(window_start)` for oversize line-window splits.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk_no_symbol(
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
lang: &str,
chunker_version: &str,
split_key: Option<u32>,
) -> Chunk {
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
let span = SourceSpan::Code {
line_start,
line_end,
symbol: None,
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, &cv, &base_policy_hash, text, span, split_key)
}
/// Core chunk-building logic shared by `build_chunk` and `build_chunk_no_symbol`.
///
/// Takes a pre-built `SourceSpan` so the only difference between the two
/// public helpers is whether `symbol` is `Some` or `None`. All id/hash/
/// token mechanics are identical.
fn build_chunk_from_span(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
span: SourceSpan,
split_key: Option<u32>,
) -> Chunk {
// id_hash mirrors code_rust_ast_v1's make_chunk logic:
// split_key Some(k) => "{base_policy_hash}#L{k}"
// split_key None => base_policy_hash
@@ -114,7 +164,7 @@ fn build_chunk(
None => base_policy_hash.to_string(),
};
// block_ids: Tier 2 chunkers have no per-block structure (the whole file
// block_ids: Tier 2/3 chunkers have no per-block structure (the whole file
// is one Block::Code), so we pass an empty slice — same as using the doc-
// level slice without explicit block granularity.
let block_ids: Vec<BlockId> = vec![];

View File

@@ -0,0 +1,196 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_go_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.c".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units:
// 0. imports + defines (lines 14, ≤200)
// 1. status_t enum typedef (lines 69, ≤200)
// 2. record_t struct typedef (lines 1116, ≤200)
// 3. static counter decl glue (line 18, ≤200)
// 4. parse_record fn (lines 2023, ≤200)
// 5. print_record fn (lines 2527, ≤200)
// 6. main fn (lines 2933, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
18,
"#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;".to_string(),
),
(
"parse_record",
20,
23,
"int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}".to_string(),
),
(
"print_record",
25,
27,
"void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}".to_string(),
),
(
"main",
29,
33,
"int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("c".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.c".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-c-ast-v1".into()),
}
}
#[test]
fn code_c_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.c.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-c-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_c_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,325 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C++ code `CanonicalDocument`.
//!
//! Two complementary tests:
//! 1. `code_cpp_ast_chunks_snapshot` — hand-built `fixed_doc()` validates the
//! chunker's 1:1 mapping (design §6.3 / §8 boundary: no parse-code dep needed).
//! 2. `code_cpp_ast_extractor_snapshot` — invokes `CppAstExtractor` against the
//! real `tests/fixtures/sample.cpp` fixture, validating the extractor → chunker
//! end-to-end pipeline. `kebab-parse-code` is a dev-dep (same pattern as
//! `kebab-parse-md` in Markdown snapshot tests).
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCppAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use kebab_parse_code::CppAstExtractor;
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.cpp".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units (C++ specific):
// 0. includes + namespace opening (lines 14, ≤200)
// 1. class definition (lines 620, ≤200)
// 2. template function (lines 2225, ≤200)
// 3. namespace closing + free fn (lines 2729, ≤200)
// 4. main fn (lines 3134, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
4,
"#include <string>\n#include <vector>\n\nnamespace kebab {".to_string(),
),
(
"kebab::chunk::MdHeadingV1Chunker",
6,
20,
"class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};".to_string(),
),
(
"kebab::identity",
22,
25,
"template <typename T>\nT identity(T value) {\n return value;\n}".to_string(),
),
(
"kebab::global_helper",
27,
29,
"void global_helper() {\n // free function in kebab namespace\n}".to_string(),
),
(
"main",
31,
34,
"int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("cpp".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.cpp".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-cpp-ast-v1".into()),
}
}
// ---------------------------------------------------------------------------
// Helper: run the real CppAstExtractor against tests/fixtures/sample.cpp
// ---------------------------------------------------------------------------
fn extract_cpp_fixture() -> CanonicalDocument {
use kebab_core::{
AssetId, AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, RawAsset,
SourceUri, WorkspacePath,
};
use std::path::PathBuf;
let bytes = std::fs::read(fixtures_dir().join("sample.cpp")).expect("read sample.cpp fixture");
let src = String::from_utf8(bytes).expect("fixture is valid UTF-8");
let wp = WorkspacePath("tests/fixtures/sample.cpp".to_string());
let asset = RawAsset {
asset_id: AssetId("e".repeat(64)),
source_uri: SourceUri::File(PathBuf::from("tests/fixtures/sample.cpp")),
workspace_path: wp,
media_type: kebab_core::MediaType::Code("cpp".to_string()),
byte_len: src.len() as u64,
checksum: Checksum("f".repeat(64)),
discovered_at: time::OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from("tests/fixtures/sample.cpp"),
sha: Checksum("f".repeat(64)),
},
};
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
// ---------------------------------------------------------------------------
// Test 1 (hand-built): chunker-only 1:1 mapping validation
// ---------------------------------------------------------------------------
#[test]
fn code_cpp_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.cpp.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-cpp-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_cpp_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}
// ---------------------------------------------------------------------------
// Test 2 (real extractor): end-to-end extractor → chunker pipeline
// ---------------------------------------------------------------------------
/// Validates that the real `CppAstExtractor` processes `sample.cpp` and
/// emits the expected set of symbols through the full chunker pipeline.
///
/// `sample.cpp` contains:
/// - `#include` directives + nested namespace `kebab::chunk` → glue + struct unit
/// - `class MdHeadingV1Chunker` with methods (ctor, dtor, chunk_doc, operator())
/// - `template <typename T> T identity(T value)` (template fn)
/// - `void kebab::global_helper()` (free fn in namespace)
/// - `int main()` (global free fn)
#[test]
fn code_cpp_ast_extractor_snapshot() {
let doc = extract_cpp_fixture();
// Verify the extractor emits all expected named units.
let block_syms: Vec<Option<String>> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
_ => None,
},
_ => None,
}).collect();
// Must include namespace-qualified class and its methods
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
"class unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
"ctor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
"dtor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
"chunk_doc unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
"operator() unit missing: {block_syms:?}"
);
// Template function (inside kebab::chunk namespace in the fixture)
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::identity")),
"identity template fn unit missing: {block_syms:?}"
);
// Free function in outer namespace
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::global_helper")),
"global_helper unit missing: {block_syms:?}"
);
// Global main
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("main")),
"main unit missing: {block_syms:?}"
);
}
/// End-to-end chunker output from real extractor is deterministic.
#[test]
fn code_cpp_ast_extractor_chunks_deterministic() {
let doc1 = extract_cpp_fixture();
let doc2 = extract_cpp_fixture();
assert_eq!(doc1.blocks, doc2.blocks, "extractor output non-deterministic");
let policy = fixed_policy();
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
assert_eq!(
chunks1.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
chunks2.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
"chunker output non-deterministic"
);
}

View File

@@ -0,0 +1,270 @@
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing raw text into a single `Block::Code`, mirroring the pattern used
//! in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::CodeTextParagraphV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `text`
/// and the supplied `lang` label.
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
let wp = WorkspacePath("scripts/sample.sh".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-text-paragraph-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "sample.sh".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-text-paragraph-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// `sample_shell.sh` has 4 paragraphs separated by 3 blank lines:
/// - paragraph 1: lines 1-2 (shebang + set -euo pipefail)
/// - paragraph 2: lines 4-7 (env setup block)
/// - paragraph 3: lines 9-11 (ingest block)
/// - paragraph 4: lines 13-15 (report block)
///
/// We assert:
/// - exactly 4 chunks (one per paragraph)
/// - all symbols are None (Tier 3 spec §9.3)
/// - all langs are "shell"
/// - line ranges are strictly ascending and do NOT include the blank lines
/// (lines 3, 8, 12 must not appear in any range)
#[test]
fn shell_multi_paragraph_splits_on_blank_lines() {
let fixture_path = fixtures_dir().join("sample_shell.sh");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
4,
"expected 4 chunks (one per paragraph), got {}: {chunks:#?}",
chunks.len()
);
// All symbols must be None (Tier 3 requirement).
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.is_none(),
"chunk[{i}] symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// All langs must be "shell".
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"chunk[{i}] lang must be 'shell', got {lang:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// Line ranges must be strictly ascending with no overlap,
// and blank lines (3, 8, 12) must not be included in any range.
let expected_ranges: &[(u32, u32)] = &[(1, 2), (4, 7), (9, 11), (13, 15)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"line ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
}
/// `sample_long_paragraph.txt` has exactly 200 non-blank lines and no blank
/// lines, so the entire file is one paragraph. 200 > 80 (FALLBACK_LINES_PER_CHUNK),
/// so the oversize window split fires with stride 60:
/// - window 1: lines 1-80
/// - window 2: lines 61-140
/// - window 3: lines 121-200
///
/// All chunk_ids must be distinct (the #L{window_start} split_key suffix).
#[test]
fn single_long_paragraph_line_window_split() {
let fixture_path = fixtures_dir().join("sample_long_paragraph.txt");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
assert_eq!(
text.lines().count(),
200,
"fixture must have exactly 200 lines"
);
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
3,
"expected 3 window chunks for 200-line paragraph, got {}: {chunks:#?}",
chunks.len()
);
let expected_ranges: &[(u32, u32)] = &[(1, 80), (61, 140), (121, 200)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"window ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
// All chunk_ids must be distinct (#L{window_start} suffix differentiates them).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize window chunks must have distinct chunk_ids"
);
}
/// An empty source file (no non-blank lines) must yield zero chunks.
#[test]
fn empty_file_emits_zero_chunks() {
let doc = text_doc("shell", "");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
0,
"empty file must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// The `lang` field on each emitted chunk must match the `lang` passed to
/// `text_doc`, regardless of content. `symbol` must be `None` (Tier 3 spec).
#[test]
fn lang_field_preserved_from_input_doc() {
let doc = text_doc("yaml", "key1: value1\nkey2: value2\n");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(!chunks.is_empty(), "expected at least one chunk");
match &chunks[0].source_spans[0] {
SourceSpan::Code { lang, symbol, .. } => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"lang must be 'yaml', got {lang:?}"
);
assert!(
symbol.is_none(),
"symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("expected Code span, got {other:?}"),
}
}

View File

@@ -0,0 +1,86 @@
[
{
"block_ids": [
"8149e12ca002489acb4a0f74c97a061a"
],
"chunk_id": "ec3cf06ae56c8e9796bbc9196438b7c5",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 18,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;",
"token_estimate": 78
},
{
"block_ids": [
"1baaa89f21a47b2f32d6396a24a85454"
],
"chunk_id": "c2d7a81c898106733ef2e703774a6a4a",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 23,
"line_start": 20,
"symbol": "parse_record"
}
],
"text": "int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}",
"token_estimate": 41
},
{
"block_ids": [
"8d0e14cbcc6d1e92d7878ab796ea68b8"
],
"chunk_id": "0e4d7b131ab64eba03b51903b5d8f96d",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 27,
"line_start": 25,
"symbol": "print_record"
}
],
"text": "void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}",
"token_estimate": 35
},
{
"block_ids": [
"9c2ede84423871b615d48c38fefb1853"
],
"chunk_id": "e076f8edb2ff141d7e99b4106bb95157",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 33,
"line_start": 29,
"symbol": "main"
}
],
"text": "int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}",
"token_estimate": 38
}
]

View File

@@ -0,0 +1,107 @@
[
{
"block_ids": [
"53292605459065d170cd36c118e20546"
],
"chunk_id": "50a5b324300d9082eac4ce2a422810e1",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 4,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <string>\n#include <vector>\n\nnamespace kebab {",
"token_estimate": 18
},
{
"block_ids": [
"f349acad94c9fa4cf9ad1c0a93e83610"
],
"chunk_id": "0e6bc7c522665af8a4b0f66afb9d29c8",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 20,
"line_start": 6,
"symbol": "kebab::chunk::MdHeadingV1Chunker"
}
],
"text": "class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};",
"token_estimate": 95
},
{
"block_ids": [
"8b9811387717d0bd4abf84abcc35b8b1"
],
"chunk_id": "d9326d252905b665b2adb9a416c20451",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 25,
"line_start": 22,
"symbol": "kebab::identity"
}
],
"text": "template <typename T>\nT identity(T value) {\n return value;\n}",
"token_estimate": 21
},
{
"block_ids": [
"1754cb6b971f6a4cb292f144a4f0570b"
],
"chunk_id": "56ee5f991de4a413c016da8dc4acfc35",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 29,
"line_start": 27,
"symbol": "kebab::global_helper"
}
],
"text": "void global_helper() {\n // free function in kebab namespace\n}",
"token_estimate": 22
},
{
"block_ids": [
"14b5f3393d6d25f822f5b70763d24acd"
],
"chunk_id": "c0d7c043cdd575c530db3909b54cc906",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 34,
"line_start": 31,
"symbol": "main"
}
],
"text": "int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}",
"token_estimate": 23
}
]

View File

@@ -0,0 +1,33 @@
#include <stdio.h>
#include <stdlib.h>
#define MAX_BUF 4096
typedef enum {
OK = 0,
ERR_PARSE,
ERR_IO,
} status_t;
typedef struct {
int id;
char name[64];
status_t status;
} record_t;
static int counter = 0;
int parse_record(const char *line, record_t *out) {
if (line == NULL || out == NULL) return ERR_PARSE;
return OK;
}
void print_record(const record_t *r) {
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
}
int main(void) {
record_t r = { .id = 1, .name = "foo", .status = OK };
print_record(&r);
return 0;
}

View File

@@ -0,0 +1,40 @@
#include <string>
#include <vector>
namespace kebab {
namespace chunk {
class MdHeadingV1Chunker {
public:
MdHeadingV1Chunker() = default;
~MdHeadingV1Chunker() = default;
std::string chunk_doc(const std::string& doc) {
return doc;
}
int operator()(int x) const {
return x * 2;
}
private:
int counter_ = 0;
};
template <typename T>
T identity(T value) {
return value;
}
} // namespace chunk
void global_helper() {
// free function in kebab namespace
}
} // namespace kebab
int main() {
kebab::chunk::MdHeadingV1Chunker c;
return 0;
}

View File

@@ -0,0 +1,200 @@
line 001
line 002
line 003
line 004
line 005
line 006
line 007
line 008
line 009
line 010
line 011
line 012
line 013
line 014
line 015
line 016
line 017
line 018
line 019
line 020
line 021
line 022
line 023
line 024
line 025
line 026
line 027
line 028
line 029
line 030
line 031
line 032
line 033
line 034
line 035
line 036
line 037
line 038
line 039
line 040
line 041
line 042
line 043
line 044
line 045
line 046
line 047
line 048
line 049
line 050
line 051
line 052
line 053
line 054
line 055
line 056
line 057
line 058
line 059
line 060
line 061
line 062
line 063
line 064
line 065
line 066
line 067
line 068
line 069
line 070
line 071
line 072
line 073
line 074
line 075
line 076
line 077
line 078
line 079
line 080
line 081
line 082
line 083
line 084
line 085
line 086
line 087
line 088
line 089
line 090
line 091
line 092
line 093
line 094
line 095
line 096
line 097
line 098
line 099
line 100
line 101
line 102
line 103
line 104
line 105
line 106
line 107
line 108
line 109
line 110
line 111
line 112
line 113
line 114
line 115
line 116
line 117
line 118
line 119
line 120
line 121
line 122
line 123
line 124
line 125
line 126
line 127
line 128
line 129
line 130
line 131
line 132
line 133
line 134
line 135
line 136
line 137
line 138
line 139
line 140
line 141
line 142
line 143
line 144
line 145
line 146
line 147
line 148
line 149
line 150
line 151
line 152
line 153
line 154
line 155
line 156
line 157
line 158
line 159
line 160
line 161
line 162
line 163
line 164
line 165
line 166
line 167
line 168
line 169
line 170
line 171
line 172
line 173
line 174
line 175
line 176
line 177
line 178
line 179
line 180
line 181
line 182
line 183
line 184
line 185
line 186
line 187
line 188
line 189
line 190
line 191
line 192
line 193
line 194
line 195
line 196
line 197
line 198
line 199
line 200

View File

@@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -euo pipefail
# First paragraph: env setup
export KEBAB_HOME="${KEBAB_HOME:-$HOME/.local/share/kebab}"
mkdir -p "$KEBAB_HOME"
cd "$KEBAB_HOME"
# Second paragraph: ingest
echo "ingesting workspace..."
kebab ingest --config /etc/kebab/config.toml
# Third paragraph: report
echo "done"
kebab schema --json | jq '.stats'

View File

@@ -140,6 +140,17 @@ fn k8s_multi_doc_emits_one_chunk_per_resource() {
for chunk in &chunks {
assert_eq!(chunk.chunker_version.0, "k8s-manifest-resource-v1");
}
// Every chunk from a multi-resource file must have a distinct chunk_id.
// Without the fix, all non-oversize resources get split_key=None which
// collapses to the same id_hash (= base_policy_hash) → UNIQUE constraint
// violation on the second resource.
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"every k8s resource chunk must have a distinct chunk_id (multi-resource collision regression)"
);
}
/// A YAML document with an indentation error (tab in a space-indented context)
@@ -203,3 +214,75 @@ rules:
other => panic!("expected Code span, got {other:?}"),
}
}
/// 200+ line resource exercises `tier2_shared::push_chunks_with_oversize`'s
/// line-window split branch. All chunks must share the same symbol
/// (`<Kind>/<ns>/<name>`); their line ranges must form a contiguous
/// partition; chunk_ids must all differ (the `#L{k}` suffix on `id_for_chunk`
/// ensures uniqueness across windows). Spec p10-2 risks section explicitly
/// flags "거대 ConfigMap" — this test covers that path.
#[test]
fn k8s_oversize_splits_into_line_windows_sharing_symbol() {
// ConfigMap with 250 data keys → ~256 total lines, > AST_CHUNK_MAX_LINES (200).
let mut yaml = String::from(
"apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: big\n namespace: prod\ndata:\n",
);
for i in 0..250 {
yaml.push_str(&format!(" key{i}: value{i}\n"));
}
let doc = yaml_doc(&yaml);
let chunks = K8sManifestResourceV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for oversize resource, got {}",
chunks.len()
);
// Every chunk must share the same symbol + lang.
let expected_symbol = "ConfigMap/prod/big";
for (i, c) in chunks.iter().enumerate() {
match &c.source_spans[0] {
SourceSpan::Code { symbol, lang, .. } => {
assert_eq!(
symbol.as_deref(),
Some(expected_symbol),
"chunk[{i}] symbol must equal `{expected_symbol}`"
);
assert_eq!(lang.as_deref(), Some("yaml"));
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// chunk_ids must all be distinct (oversize fallback's #L{k} suffix).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize chunks must have distinct chunk_ids (the #L{{k}} suffix should disambiguate)"
);
// Line ranges must form a contiguous partition: chunk[i].line_end + 1 == chunk[i+1].line_start.
let ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
for w in ranges.windows(2) {
let (_, prev_end) = w[0];
let (next_start, _) = w[1];
assert_eq!(
prev_end + 1,
next_start,
"line ranges must be contiguous: {} → {} (got gap or overlap)",
prev_end,
next_start
);
}
}

View File

@@ -933,6 +933,15 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
let next = resp.next_cursor.as_deref().unwrap_or("(none)");
eprintln!("[truncated; use --cursor {next} for the next page]");
}
// v0.17.0 A5 Step 4: short-query advisory. `resp.hint`
// is `Some` only when the result list is empty and the
// trimmed query is shorter than the trigram tokenizer
// can resolve (raw FTS5 mode opts out). stderr so it
// doesn't pollute the stdout hit list. `--json` skips
// this branch entirely; the field rides the wire.
if let Some(hint) = &resp.hint {
eprintln!("[hint] {hint}");
}
if *trace {
if let Some(t) = &resp.trace {
eprintln!();

View File

@@ -92,6 +92,14 @@ pub fn wire_search_response(r: &kebab_app::SearchResponse) -> Value {
map.insert("trace".to_string(), trace_v);
}
}
// v0.17.0 A5 Step 4b: emit `hint` only when set. Keeps responses
// that don't carry a hint backward-compatible with v0 consumers
// that don't know the field.
if let Some(hint) = &r.hint {
if let Value::Object(ref mut map) = v {
map.insert("hint".to_string(), Value::String(hint.clone()));
}
}
tag_object(v, "search_response.v1")
}
@@ -292,6 +300,7 @@ mod tests {
next_cursor: Some("opaque-cursor-abc".to_string()),
truncated: true,
trace: None,
hint: None,
};
let v = wire_search_response(&r);
assert_eq!(schema_of(&v), Some("search_response.v1"));
@@ -405,6 +414,7 @@ mod tests {
}],
timing: TraceTiming { lexical_ms: 5, vector_ms: 0, fusion_ms: 1, total_ms: 7 },
}),
hint: None,
};
let v = wire_search_response(&r);
assert_eq!(schema_of(&v), Some("search_response.v1"));
@@ -420,6 +430,7 @@ mod tests {
next_cursor: None,
truncated: false,
trace: None,
hint: None,
};
let v = wire_search_response(&r);
assert!(v.get("trace").is_none(), "trace field absent when None");

View File

@@ -47,8 +47,20 @@ fn search_json_emits_search_response_v1_wrapper() {
fn search_json_truncates_with_max_tokens() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
let body: String = "rust ownership is a memory model. ".repeat(10);
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
// v0.17.0 trigram tokenizer makes FTS5 snippet() tokens 3-char wide
// (was full words under unicode61), so an individual snippet stays
// around ~60 chars — too short to ever exceed the snippet-shorten
// budget cap on a single-hit fixture. To still exercise the budget
// loop deterministically, we ingest multiple hits and pick a budget
// small enough that the loop has to *pop* hits, which flips
// truncated=true regardless of snippet length.
for i in 0..5 {
fs::write(
workspace.join(format!("d{i}.md")),
format!("# T{i}\n\nrust ownership is a memory model.\n"),
)
.unwrap();
}
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
@@ -211,8 +223,15 @@ fn search_stale_cursor_returns_error_v1_with_stale_cursor_code() {
fn search_plain_emits_truncated_hint_to_stderr() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
let body: String = "rust ownership is a memory model. ".repeat(10);
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
// v0.17.0 trigram tokenizer — same multi-doc rationale as
// `search_json_truncates_with_max_tokens` above.
for i in 0..5 {
fs::write(
workspace.join(format!("d{i}.md")),
format!("# T{i}\n\nrust ownership is a memory model.\n"),
)
.unwrap();
}
common::ingest(&cfg, &workspace);
let (_stdout, stderr) = common::run_search_with_args(
@@ -224,3 +243,76 @@ fn search_plain_emits_truncated_hint_to_stderr() {
"stderr must carry truncated hint: {stderr:?}"
);
}
#[test]
fn search_plain_emits_short_query_hint_to_stderr() {
// v0.17.0 A5 Step 6: 2-char query under trigram tokenizer emits
// empty hits + stderr `[hint]` advisory. Empty workspace is enough
// — hits are always empty so the hint condition depends only on
// query length (<3 chars trimmed) + non-raw mode + hits.is_empty.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
common::ingest(&cfg, &workspace);
let (_stdout, stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "ab"],
);
assert!(
stderr.contains("[hint]"),
"stderr must carry short-query hint: {stderr:?}"
);
assert!(
stderr.contains("3자 이상"),
"hint message must mention '3자 이상' (Korean advisory): {stderr:?}"
);
}
#[test]
fn search_json_emits_hint_field_for_short_query() {
// v0.17.0 A5 Step 6: --json mode carries the same advisory on the
// `search_response.v1.hint` additive field. Empty hits + 2-char
// query + non-raw mode trips the helper. Verifies the MCP-visible
// surface (agents read the field instead of parsing stderr).
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "ab"],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
assert!(
v["hits"].as_array().unwrap().is_empty(),
"empty hits expected for short query in empty KB: {v}"
);
assert_eq!(
v["hint"].as_str().expect("hint field set on short empty result"),
"3자 이상 키워드 권장 (trigram tokenizer 제약)",
"hint must carry the standard advisory: {v}"
);
}
#[test]
fn search_json_omits_hint_field_when_query_is_long_enough() {
// v0.17.0 A5 Step 6 (negative case): 3+ char query never trips
// hint, even on an empty KB. Verifies `serialize_search_response`
// omits the additive `hint` field when `None` so existing wire
// consumers stay backward-compatible.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "abc"],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
assert!(
v.get("hint").is_none(),
"hint must be absent for ≥3-char queries: {v}"
);
}

View File

@@ -122,6 +122,23 @@ pub struct LlmCfg {
pub endpoint: String,
pub temperature: f32,
pub seed: u64,
/// v0.17.0 post-dogfood: Hard ceiling on a single HTTP exchange to
/// the LLM endpoint (Ollama, etc.). Cold-loading an 8B+ model on
/// CPU-only hosts can spend 60-90s on model load + several minutes
/// on a first inference, blowing past the old hard-coded 300s cap
/// and surfacing as `error: kb-rag: llm.generate_stream` to the
/// user. Config-driven so 16-GB / CPU-only deployments using small
/// (≤4B) models can keep the original 300s and large-model dogfood
/// can dial it up (e.g. 1200s) without rebuilding.
///
/// **Edge case — `0` is NOT a disable sentinel.**
/// `reqwest::ClientBuilder::timeout(Duration::from_secs(0))` sets a
/// 0-second read timeout, so every request fails *immediately* with
/// `error: kb-rag: ollama timeout`. To approximate "no cap", use a
/// large finite value (e.g. `u64::MAX` ≈ 5.8 × 10¹¹ years, or
/// just a generous number like `86400`).
#[serde(default = "default_llm_request_timeout_secs")]
pub request_timeout_secs: u64,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
@@ -147,6 +164,13 @@ fn default_cache_capacity() -> usize {
256
}
/// v0.17.0 post-dogfood: matches the legacy hard-coded ceiling so
/// existing configs that omit the field keep behaving identically.
/// Overridable per config / `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS`.
fn default_llm_request_timeout_secs() -> u64 {
300
}
fn default_stale_threshold_days() -> u32 {
30
}
@@ -363,12 +387,14 @@ impl Config {
// gemma4 계열 통일 — OCR (P6-2) + caption (P6-3)
// 어댑터가 같은 family 사용. 사용자가 더 큰
// variant (gemma4:26b 등) 원하면 자기 config.toml
// 에서 override.
// 에서 override. CPU-only / ≤16 GB RAM 환경이면
// gemma3:4b 같은 ≤4B Q4 모델 권장 (README 참조).
model: "gemma4:e4b".to_string(),
context_tokens: 32768,
endpoint: "http://127.0.0.1:11434".to_string(),
temperature: 0.0,
seed: 0,
request_timeout_secs: default_llm_request_timeout_secs(),
},
},
search: SearchCfg {
@@ -621,6 +647,11 @@ impl Config {
self.models.llm.seed = n;
}
}
"KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS" => {
if let Ok(n) = v.parse::<u64>() {
self.models.llm.request_timeout_secs = n;
}
}
// search
"KEBAB_SEARCH_DEFAULT_K" => {
@@ -873,6 +904,103 @@ mod tests {
assert!((c.models.llm.temperature - 0.7).abs() < 1e-6);
}
/// v0.17.0 post-dogfood: matches the legacy hard-coded 300s cap so
/// existing configs that omit the new field are not affected.
#[test]
fn default_llm_request_timeout_secs_is_300() {
assert_eq!(Config::defaults().models.llm.request_timeout_secs, 300);
}
#[test]
fn env_overrides_models_llm_request_timeout_secs() {
let mut env = HashMap::new();
env.insert(
"KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS".to_string(),
"1200".to_string(),
);
let c = Config::defaults().apply_env(&env);
assert_eq!(c.models.llm.request_timeout_secs, 1200);
}
/// v0.17.0 post-dogfood: a config file written before the field
/// existed (no `request_timeout_secs` key) must still parse and fall
/// back to the 300s default — backwards-compat invariant.
#[test]
fn legacy_config_without_request_timeout_secs_uses_default() {
let toml_src = r#"
schema_version = 1
[workspace]
root = "/tmp/x"
exclude = []
[storage]
data_dir = "/tmp/x"
sqlite = "/tmp/x/kebab.sqlite"
vector_dir = "/tmp/x/lancedb"
asset_dir = "/tmp/x/assets"
artifact_dir = "/tmp/x/artifacts"
model_dir = "/tmp/x/models"
runs_dir = "/tmp/x/runs"
copy_threshold_mb = 100
[indexing]
max_parallel_extractors = 2
max_parallel_embeddings = 1
watch_filesystem = false
[chunking]
target_tokens = 500
overlap_tokens = 80
respect_markdown_headings = true
chunker_version = "md-heading-v1"
[models.embedding]
provider = "fastembed"
model = "multilingual-e5-large"
version = "v1"
dimensions = 1024
batch_size = 64
[models.llm]
provider = "ollama"
model = "gemma3:4b"
context_tokens = 4096
endpoint = "http://127.0.0.1:11434"
temperature = 0.0
seed = 0
[search]
default_k = 10
hybrid_fusion = "rrf"
rrf_k = 60
snippet_chars = 220
[rag]
prompt_template_version = "rag-v2"
score_gate = 0.3
explain_default = false
max_context_tokens = 8000
[image.ocr]
enabled = false
engine = "ollama-vision"
model = "gemma3:4b"
languages = ["eng"]
max_pixels = 1600
[image.caption]
enabled = false
max_pixels = 768
prompt_template_version = "caption-v1"
[ui]
theme = "dark"
"#;
let c: Config = toml::from_str(toml_src).expect("parse legacy config");
assert_eq!(c.models.llm.request_timeout_secs, 300);
}
#[test]
fn env_overrides_indexing_watch_filesystem_bool() {
let mut env = HashMap::new();

View File

@@ -5,7 +5,7 @@
"chunk_id": "chunk000000000000000000000000000000",
"doc_id": "doc00000000000000000000000000000000",
"heading_path": [],
"score": 0.3429983854293823
"score": 0.35202541947364807
},
"has_answer": false,
"hits_count": 1,
@@ -19,7 +19,7 @@
"chunk_id": "chunk000000000000000000000000000002",
"doc_id": "doc00000000000000000000000000000002",
"heading_path": [],
"score": 0.3585492968559265
"score": 0.3414848744869232
},
"has_answer": false,
"hits_count": 1,

View File

@@ -48,10 +48,17 @@ use serde::{Deserialize, Serialize};
use crate::error::LlmError;
/// Hard ceiling on a single HTTP exchange. Cold-loading a 14B model on
/// first call can take ~30s; 5 minutes is generous without being
/// open-ended.
const REQUEST_TIMEOUT: Duration = Duration::from_secs(300);
// v0.17.0 post-dogfood: the per-request ceiling now lives in
// `kebab_config::LlmCfg::request_timeout_secs` (default 300s) so users
// running larger models on CPU-only hosts can extend it without a
// rebuild. Cold-loading an 8B+ model on first call routinely takes
// 60-90 s plus multi-minute inference; 300s was the legacy hard
// ceiling and remains the default for back-compat.
//
// Edge case: `request_timeout_secs = 0` becomes
// `Duration::from_secs(0)` which is reqwest's "fail immediately", NOT
// "disable". The field doc explains the workaround (use u64::MAX or a
// large finite value).
/// `reqwest::blocking` adapter implementing [`LanguageModel`] over Ollama's
/// local HTTP API. Construction is cheap and offline; the first network
@@ -79,7 +86,7 @@ impl OllamaLanguageModel {
pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self> {
let llm = &config.models.llm;
let client = reqwest::blocking::Client::builder()
.timeout(REQUEST_TIMEOUT)
.timeout(Duration::from_secs(llm.request_timeout_secs))
.build()?;
Ok(Self {
client,
@@ -262,9 +269,11 @@ struct OllamaLine {
///
/// Timeout invariant: the iterator has no inherent stop condition for an
/// indefinitely-stalled server — only the underlying
/// `reqwest::blocking::Client`'s read timeout (`REQUEST_TIMEOUT`, 300s)
/// breaks the hang. Callers needing tighter cancellation should adjust
/// the client timeout in [`OllamaLanguageModel::new`].
/// `reqwest::blocking::Client`'s read timeout (configured via
/// `kebab_config::LlmCfg::request_timeout_secs`, default 300 s) breaks
/// the hang. Callers needing tighter / looser bounds should set
/// `[models.llm] request_timeout_secs = N` (or
/// `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=N`) before building.
struct OllamaStream {
reader: BufReader<reqwest::blocking::Response>,
line_buf: Vec<u8>,

View File

@@ -22,6 +22,8 @@ tree-sitter-javascript = { workspace = true }
tree-sitter-go = { workspace = true }
tree-sitter-java = { workspace = true }
tree-sitter-kotlin-ng = { workspace = true }
tree-sitter-c = { workspace = true }
tree-sitter-cpp = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }

View File

@@ -0,0 +1,720 @@
//! `kebab-parse-code::c` — tree-sitter C AST extractor (P10-1D Task B).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("c")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit:
//!
//! - `function_definition` → 1 unit, symbol = function name (extracted
//! from the declarator's innermost `identifier`, handles pointer-returning
//! functions where the declarator is wrapped in `pointer_declarator`).
//! - `struct_specifier` (named) → 1 unit, symbol = struct name.
//! - `enum_specifier` (named) → 1 unit, symbol = enum name.
//! - `union_specifier` (named) → 1 unit, symbol = union name.
//!
//! Everything else (`declaration`, `preproc_*`, `type_definition`,
//! `linkage_specification`, etc.) collapses into a single `<top-level>`
//! glue chunk. If the file produces zero units **and** zero glue, the
//! `<module>` post-pass emits one unit covering the whole file (1A-2
//! pattern).
//!
//! C symbol = function name only — no namespace, no class nesting
//! (design §3.4 C row). Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, strip_extension};
pub const PARSER_VERSION: &str = "code-c-v2";
/// C AST extractor. Per-unit blocks via tree-sitter-c 0.24.2
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct CAstExtractor;
impl CAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for CAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for CAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "c")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for CAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec())
.map_err(|e| anyhow::anyhow!("kebab-parse-code: C source is not valid UTF-8: {e}"))?;
let blocks = build_blocks(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("c".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted C doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
/// Walk down the declarator chain of a `function_definition` to find
/// the innermost `identifier` — the function name.
///
/// The tree for `int *foo(int x) { ... }` looks like:
/// ```text
/// function_definition
/// type: primitive_type "int"
/// declarator: pointer_declarator
/// declarator: function_declarator
/// declarator: identifier "foo"
/// parameters: parameter_list
/// body: compound_statement
/// ```
/// We walk `declarator` fields recursively until we reach an `identifier`
/// or run out of nodes. Returns `None` if no identifier is found
/// (malformed / unsupported declarator shape).
fn extract_fn_name<'a>(decl_node: tree_sitter::Node, src: &'a str) -> Option<&'a str> {
let mut cur = decl_node;
loop {
match cur.kind() {
"identifier" => return Some(&src[cur.start_byte()..cur.end_byte()]),
// pointer_declarator, function_declarator, array_declarator,
// attributed_declarator, parenthesized_declarator —
// all carry a `declarator` field pointing deeper.
_ => {
if let Some(inner) = cur.child_by_field_name("declarator") {
cur = inner;
} else {
// No further `declarator` field; give up.
return None;
}
}
}
}
}
fn build_blocks(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_c::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-c language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue is accumulated as (start, end) pairs and flushed into one
// "<top-level>" block (or "<module>" if no real unit exists).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
let mut glue: Vec<(u32, u32)> = Vec::new();
/// Walk preceding `comment` siblings to extend the unit's line range
/// upward, folding doc / line comments into the unit (1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
let mut cur = root.walk();
for child in root.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"function_definition" => {
if let Some(decl) = child.child_by_field_name("declarator") {
if let Some(name) = extract_fn_name(decl, source) {
flush_glue(&mut glue, &mut units);
units.push((name.to_string(), s, e, true));
} else {
// Could not extract name — treat as glue.
glue.push((s, e));
}
} else {
glue.push((s, e));
}
}
"struct_specifier" | "enum_specifier" | "union_specifier" => {
if let Some(name_node) = child.child_by_field_name("name") {
let name = &source[name_node.start_byte()..name_node.end_byte()];
flush_glue(&mut glue, &mut units);
units.push((name.to_string(), s, e, true));
} else {
// Anonymous struct/enum/union at the top level (not
// wrapped in typedef) — glue. typedef-wrapped case
// is recovered in the `type_definition` arm below.
glue.push((s, e));
}
}
"type_definition" => {
// v0.17.0 PR-B: typedef-wrapped anonymous aggregate
// recovery. `typedef struct { ... } Foo;` exposes only
// the alias `Foo` as a useful symbol — the inner
// struct_specifier has no `name` field. Pre-v0.17.0
// this whole construct collapsed into glue and hid the
// alias from search (HOTFIXES 2026-05-21). v2 recovers
// the alias from the `declarator` field and emits a
// synthetic unit so `Citation::Code.symbol = "Foo"`.
// Plain `typedef int MyInt;` (no inner aggregate) stays
// glue — there's no struct body to name.
if let Some(name) = recover_typedef_alias(child, source) {
flush_glue(&mut glue, &mut units);
units.push((name, s, e, true));
} else {
glue.push((s, e));
}
}
// Everything else: preprocessor directives, plain declarations
// (global var / fn prototype), linkage_specification, etc.
// — all collapse into glue.
_ => {
glue.push((s, e));
}
}
}
flush_glue(&mut glue, &mut units);
// Post-pass: if the file has no real semantic unit (only glue, or
// completely empty), rename the single glue unit to "<module>" and
// emit it. If there are zero units AND zero glue, synthesise a
// one-line "<module>" covering the whole file.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if units.is_empty() {
// Completely empty file or whitespace/comments only.
let total = lines.len() as u32;
units.push((
"<module>".to_string(),
1,
total.max(1),
false,
));
}
// If there is only glue (no real unit) the single pushed "<top-level>"
// label should be "<module>" — rename it now.
if !has_real_unit {
for (sym, _, _, _) in units.iter_mut() {
if sym == "<top-level>" {
*sym = "<module>".to_string();
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("c".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("c".to_string()),
code,
}));
}
Ok(blocks)
}
/// v0.17.0 PR-B: try to recover the typedef alias name from a
/// `type_definition` node *iff* the inner type-specifier is an
/// anonymous struct/enum/union. Returns `None` for any other shape
/// (named aggregate handled elsewhere, plain type alias has no body
/// worth naming).
fn recover_typedef_alias(node: tree_sitter::Node, source: &str) -> Option<String> {
let mut has_anon_aggregate = false;
let mut cursor = node.walk();
for sub in node.children(&mut cursor) {
match sub.kind() {
"struct_specifier" | "enum_specifier" | "union_specifier" => {
if sub.child_by_field_name("name").is_none() {
has_anon_aggregate = true;
} else {
// Named inner aggregate (e.g. `typedef struct Pt {...} P;`)
// — the named struct itself is the primary symbol and
// is *not* extracted at the top level today (it lives
// inside `type_definition`, not as a sibling
// `struct_specifier`). For v2 we keep behavior conservative:
// return None so the type_definition stays glue, matching
// pre-v2 behavior for this minor case. Real-world C tends
// to use one of: bare named struct, typedef alias only,
// or typedef on anonymous body — the latter is what we fix.
return None;
}
}
_ => {}
}
}
if !has_anon_aggregate {
return None;
}
let decl = node.child_by_field_name("declarator")?;
extract_typedef_alias_name(decl, source).map(str::to_string)
}
/// Extract the typedef alias identifier from a declarator subtree.
/// Handles the common shapes: direct `type_identifier`, or one wrapped
/// in pointer / function declarator nodes (the alias is always the
/// rightmost `type_identifier` descendant).
fn extract_typedef_alias_name<'a>(
decl: tree_sitter::Node,
source: &'a str,
) -> Option<&'a str> {
if decl.kind() == "type_identifier" {
return Some(&source[decl.start_byte()..decl.end_byte()]);
}
let mut cursor = decl.walk();
for sub in decl.children(&mut cursor) {
if let Some(found) = extract_typedef_alias_name(sub, source) {
return Some(found);
}
}
None
}
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
units.push(("<top-level>".to_string(), s, e, false));
glue.clear();
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
pub(crate) mod tests_support {
use kebab_core::*;
use std::path::PathBuf;
use time::OffsetDateTime;
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
workspace_path: WorkspacePath(workspace_path.to_string()),
media_type: MediaType::Code(lang.to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from(workspace_path),
sha: Checksum("b".repeat(64)),
},
}
}
pub fn extract_c(src: &str, path: &str) -> kebab_core::CanonicalDocument {
use super::CAstExtractor;
use kebab_core::Extractor;
let asset = fixed_code_asset(path, "c");
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
doc.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => symbol.clone(),
_ => None,
},
_ => None,
})
.collect()
}
#[test]
fn extractor_supports_only_media_code_c() {
let e = CAstExtractor::new();
assert!(e.supports(&MediaType::Code("c".into())));
assert!(!e.supports(&MediaType::Code("cpp".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn c_extractor_simple_function() {
let src = "int add(int a, int b) { return a + b; }\n";
let doc = tests_support::extract_c(src, "x/math.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "add"), "got {s:?}");
}
#[test]
fn c_extractor_pointer_return_function() {
let src = "int *find(int *arr, int n) { return arr; }\n";
let doc = tests_support::extract_c(src, "x/find.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "find"), "ptr-return fn missing: {s:?}");
}
#[test]
fn c_extractor_static_function() {
let src = "static void helper(void) {}\n";
let doc = tests_support::extract_c(src, "x/helper.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "helper"), "static fn missing: {s:?}");
}
#[test]
fn c_extractor_extern_function() {
let src = "extern int compute(int x);\n";
// extern prototype is a declaration → glue
let doc = tests_support::extract_c(src, "x/compute.c");
let s = syms(&doc);
// declaration (prototype) falls into glue → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for extern proto: {s:?}"
);
}
#[test]
fn c_extractor_inline_function() {
let src = "inline int square(int x) { return x * x; }\n";
let doc = tests_support::extract_c(src, "x/square.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "square"), "inline fn missing: {s:?}");
}
#[test]
fn c_extractor_named_struct() {
let src = "struct Point { int x; int y; };\n";
let doc = tests_support::extract_c(src, "x/point.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Point"), "struct missing: {s:?}");
}
#[test]
fn c_extractor_named_enum() {
let src = "enum Color { RED, GREEN, BLUE };\n";
let doc = tests_support::extract_c(src, "x/color.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Color"), "enum missing: {s:?}");
}
#[test]
fn c_extractor_named_union() {
let src = "union Data { int i; float f; };\n";
let doc = tests_support::extract_c(src, "x/data.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Data"), "union missing: {s:?}");
}
#[test]
fn c_extractor_anonymous_struct_falls_into_glue() {
// Anonymous struct (no name field) → glue → "<module>" (only glue, no real unit)
let src = "struct { int x; int y; } origin;\n";
let doc = tests_support::extract_c(src, "x/anon.c");
let s = syms(&doc);
// anonymous struct is a declaration containing anonymous struct_specifier → glue
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for anon struct: {s:?}"
);
// Must NOT emit a unit named after anything else
assert!(
!s.iter().any(|x| x == "origin"),
"unexpected 'origin' unit: {s:?}"
);
}
#[test]
fn c_extractor_typedef_struct_emits_unit() {
// v0.17.0 PR-B: `typedef struct { ... } Foo;` was previously a
// hotfix-tracked deviation (HOTFIXES.md 2026-05-21) — the inner
// struct_specifier is anonymous so the named-struct arm didn't
// fire, dropping the whole construct into glue and hiding the
// `Foo` alias from symbol search. The v2 extractor recovers the
// typedef alias from the `declarator` field on the
// `type_definition` node and emits a synthetic unit with that
// name. parser_version bumped `code-c-v1` → `code-c-v2`.
let src = "typedef struct { int x; int y; } Point;\n";
let doc = tests_support::extract_c(src, "x/typedef.c");
let s = syms(&doc);
// The typedef alias surfaces as a Code symbol.
assert!(
s.iter().any(|x| x == "Point"),
"expected 'Point' unit from typedef alias: {s:?}"
);
// No `<module>` (the file has exactly one semantic unit now,
// the typedef alias — no glue-only fallback needed).
assert!(
!s.iter().any(|x| x == "<module>"),
"no <module> fallback expected when typedef emits a unit: {s:?}"
);
}
#[test]
fn c_extractor_typedef_enum_emits_unit() {
// Parallel coverage for enum_specifier — same typedef-alias
// synthesis path. `typedef enum { A, B } Color;` → unit `Color`.
let src = "typedef enum { A, B } Color;\n";
let doc = tests_support::extract_c(src, "x/typedef_enum.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "Color"),
"expected 'Color' unit from typedef enum alias: {s:?}"
);
}
#[test]
fn c_extractor_typedef_union_emits_unit() {
// Parallel coverage for union_specifier.
let src = "typedef union { int i; float f; } IntOrFloat;\n";
let doc = tests_support::extract_c(src, "x/typedef_union.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "IntOrFloat"),
"expected 'IntOrFloat' unit from typedef union alias: {s:?}"
);
}
#[test]
fn c_extractor_typedef_to_existing_type_stays_glue() {
// Negative case: `typedef int MyInt;` has no inner struct/enum/
// union — there's no struct body to attach the alias to, so the
// construct falls into glue (becomes `<module>` when alone).
// Confirms the new arm only fires for anonymous-struct typedef.
let src = "typedef int MyInt;\n";
let doc = tests_support::extract_c(src, "x/typedef_alias.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for plain typedef alias: {s:?}"
);
assert!(
!s.iter().any(|x| x == "MyInt"),
"plain typedef alias must not emit a unit: {s:?}"
);
}
#[test]
fn c_extractor_preprocessor_directives_are_glue() {
let src = "#include <stdio.h>\n#define MAX 100\n#ifdef DEBUG\n#endif\n";
let doc = tests_support::extract_c(src, "x/macros.c");
let s = syms(&doc);
// Only preprocessor → no real unit → "<module>"
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
assert_eq!(s.len(), 1, "expected exactly 1 block: {s:?}");
}
#[test]
fn c_extractor_multiple_functions_correct_count() {
let src = "int foo(void) { return 1; }\nint bar(void) { return 2; }\nint baz(void) { return 3; }\n";
let doc = tests_support::extract_c(src, "x/multi.c");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "foo"), "foo missing: {s:?}");
assert!(s.iter().any(|x| x == "bar"), "bar missing: {s:?}");
assert!(s.iter().any(|x| x == "baz"), "baz missing: {s:?}");
assert_eq!(s.len(), 3, "expected 3 units: {s:?}");
}
#[test]
fn c_extractor_empty_file_produces_module() {
let src = "";
let doc = tests_support::extract_c(src, "x/empty.c");
let s = syms(&doc);
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
}
#[test]
fn c_extractor_preprocessor_only_produces_module() {
let src = "#include <stdlib.h>\n#define VERSION \"1.0\"\n";
let doc = tests_support::extract_c(src, "x/header.c");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<module>"),
"expected <module> for preproc-only file: {s:?}"
);
}
#[test]
fn c_extractor_mixed_functions_and_glue() {
let src = r#"#include <stdio.h>
int compute(int x) {
return x * 2;
}
extern int lookup(int key);
void print_result(int v) {
printf("%d\n", v);
}
"#;
let doc = tests_support::extract_c(src, "x/mixed.c");
let s = syms(&doc);
// Two real functions + one glue block
assert!(s.iter().any(|x| x == "compute"), "compute missing: {s:?}");
assert!(s.iter().any(|x| x == "print_result"), "print_result missing: {s:?}");
assert!(
s.iter().any(|x| x == "<top-level>"),
"<top-level> glue missing: {s:?}"
);
}
#[test]
fn c_extractor_deterministic_across_runs() {
let src = r#"
struct Node { int val; };
int sum(int a, int b) { return a + b; }
void noop(void) {}
"#;
let a = tests_support::extract_c(src, "x/det.c");
for _ in 0..20 {
assert_eq!(
tests_support::extract_c(src, "x/det.c").blocks,
a.blocks
);
}
}
}

View File

@@ -0,0 +1,883 @@
//! `kebab-parse-code::cpp` — tree-sitter C++ AST extractor (P10-1D Task C).
//!
//! Implements [`kebab_core::Extractor`] for [`MediaType::Code("cpp")`].
//! Walks the tree-sitter parse tree and emits one [`Block::Code`] per
//! top-level AST semantic unit, each carrying [`SourceSpan::Code`] with
//! the unit's `::` separated symbol path (design §3.4 C++ row).
//!
//! ## Symbol formation
//!
//! Symbol = `namespace::Class::method` via recursive `build_blocks`:
//!
//! - `namespace_definition` (named) → push namespace name, recurse into body.
//! - Anonymous namespace (`namespace { ... }`) → push `<anonymous>`, recurse.
//! - `nested_namespace_specifier` (`outer::inner`) → push all segments, recurse.
//! - `class_specifier` / `struct_specifier` (named) → emit class unit + recurse
//! into body with class name pushed.
//! - `function_definition` → emit method/function unit. Symbol is built from
//! the prefix chain + the extracted declarator name component.
//! - Out-of-class method def (`void Foo::bar() {}`) — the declarator's inner
//! node is a `qualified_identifier`; its scope chain is prepended to the
//! current prefix to form the full symbol.
//! - `template_declaration` → recurse into named children with same prefix;
//! the inner function/class body is matched by its own arm. Template params
//! are NOT included in the symbol.
//! - `enum_specifier` (named) → emit type unit.
//! - `concept_definition` (C++20) → emit type unit.
//! - `linkage_specification` (extern "C") → recurse into body with same prefix.
//!
//! ## Constructor / destructor / operator overload
//!
//! - Constructor: `function_declarator > identifier` matching the class name.
//! Symbol = `Class::Class` (name duplicated, same convention as Java).
//! - Destructor: `function_declarator > destructor_name`. Symbol = `Class::~Foo`.
//! - Operator overload: `function_declarator > operator_name`. Symbol = `Class::operator+`.
//! - Conversion operator: `function_definition.declarator` is `operator_cast`.
//! Symbol = `Class::operator <type>` (e.g. `Class::operator bool`).
//!
//! ## Glue
//!
//! Everything not in the unit list collapses into a single `<top-level>` glue
//! chunk (preproc, declarations, using, typedef, etc.). If the file produces
//! zero units AND zero glue, the `<module>` post-pass emits one unit covering
//! the whole file.
//!
//! Per design §3.4 / §9.1 / §9 versioning.
use anyhow::Result;
use kebab_core::{
Block, CanonicalDocument, CodeBlock, CommonBlock, Extractor, Lang, MediaType, Metadata,
ParserVersion, Provenance, ProvenanceEvent, ProvenanceKind, SourceSpan, SourceType, TrustLevel,
id_for_block, id_for_doc,
};
use serde_json::Map;
use time::OffsetDateTime;
use crate::scaffold::{filename_from_workspace_path, strip_extension};
pub const PARSER_VERSION: &str = "code-cpp-v1";
/// C++ AST extractor. Per-unit blocks via tree-sitter-cpp 0.23.4
/// (`LANGUAGE: LanguageFn`) parsed by tree-sitter 0.26.
pub struct CppAstExtractor;
impl CppAstExtractor {
pub fn new() -> Self {
Self
}
}
impl Default for CppAstExtractor {
fn default() -> Self {
Self::new()
}
}
impl Extractor for CppAstExtractor {
fn supports(&self, m: &MediaType) -> bool {
matches!(m, MediaType::Code(l) if l == "cpp")
}
fn parser_version(&self) -> ParserVersion {
ParserVersion(PARSER_VERSION.to_string())
}
fn extract(
&self,
ctx: &kebab_core::ExtractContext<'_>,
bytes: &[u8],
) -> Result<CanonicalDocument> {
let asset = ctx.asset;
if !self.supports(&asset.media_type) {
anyhow::bail!(
"kebab-parse-code: unsupported media_type for CppAstExtractor: {:?}",
asset.media_type
);
}
let parser_version = self.parser_version();
let doc_id = id_for_doc(&asset.workspace_path, &asset.asset_id, &parser_version);
let source = String::from_utf8(bytes.to_vec()).map_err(|e| {
anyhow::anyhow!("kebab-parse-code: C++ source is not valid UTF-8: {e}")
})?;
let blocks = build_blocks_top(&source, &doc_id)?;
let unit_count = blocks.len() as u32;
let now = OffsetDateTime::now_utc();
let mut events: Vec<ProvenanceEvent> = Vec::with_capacity(2);
events.push(ProvenanceEvent {
at: asset.discovered_at,
agent: "kb-source-fs".to_string(),
kind: ProvenanceKind::Discovered,
note: None,
});
events.push(ProvenanceEvent {
at: now,
agent: "kb-parse-code".to_string(),
kind: ProvenanceKind::Parsed,
note: Some(format!(
"parser_version={}; unit_count={}",
parser_version.0, unit_count
)),
});
let title = {
let fname = filename_from_workspace_path(&asset.workspace_path.0);
strip_extension(&fname)
};
let abs_path = match &asset.source_uri {
kebab_core::SourceUri::File(p) => {
if p.is_absolute() {
p.clone()
} else {
ctx.workspace_root.join(p)
}
}
kebab_core::SourceUri::Kb(_) => ctx.workspace_root.to_path_buf(),
};
let (repo, git_branch, git_commit) = match crate::repo::detect_repo(&abs_path) {
Some(r) => (Some(r.name), r.branch, r.commit),
None => (None, None, None),
};
let metadata = Metadata {
aliases: Vec::new(),
tags: Vec::new(),
created_at: asset.discovered_at,
updated_at: asset.discovered_at,
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Map::new(),
repo,
git_branch,
git_commit,
code_lang: Some("cpp".to_string()),
};
tracing::debug!(
target: "kebab-parse-code",
"extracted C++ doc_id={} workspace_path={} units={}",
doc_id.0,
asset.workspace_path.0,
unit_count
);
Ok(CanonicalDocument {
doc_id,
source_asset_id: asset.asset_id.clone(),
workspace_path: asset.workspace_path.clone(),
title,
lang: Lang("und".to_string()),
blocks,
metadata,
provenance: Provenance { events },
parser_version,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
})
}
}
// ---------------------------------------------------------------------------
// Core block-building logic
// ---------------------------------------------------------------------------
/// Top-level entry: parse source, walk the `translation_unit` root, assemble
/// units + glue, apply the `<module>` post-pass, and emit `Block::Code`s.
fn build_blocks_top(
source: &str,
doc_id: &kebab_core::DocumentId,
) -> anyhow::Result<Vec<kebab_core::Block>> {
let mut parser = tree_sitter::Parser::new();
parser
.set_language(&tree_sitter_cpp::LANGUAGE.into())
.map_err(|e| anyhow::anyhow!("set tree-sitter-cpp language: {e}"))?;
let tree = parser
.parse(source.as_bytes(), None)
.ok_or_else(|| anyhow::anyhow!("tree-sitter failed to parse C++ source"))?;
let lines: Vec<&str> = source.split('\n').collect();
let root = tree.root_node();
// units: (symbol, line_start, line_end, is_real_semantic_unit).
// Glue is accumulated as (start, end) pairs and flushed into one
// "<top-level>" block (or "<module>" if no real unit exists).
let mut units: Vec<(String, u32, u32, bool)> = Vec::new();
let mut glue: Vec<(u32, u32)> = Vec::new();
build_blocks(root, source, &[], &mut units, &mut glue);
flush_glue(&mut glue, &mut units);
// Post-pass: if the file has no real semantic unit (only glue, or
// completely empty), rename the single glue unit to "<module>".
// If there are zero units AND zero glue, synthesize a one-line
// "<module>" covering the whole file.
let has_real_unit = units.iter().any(|(_, _, _, is_real)| *is_real);
if units.is_empty() {
let total = lines.len() as u32;
units.push(("<module>".to_string(), 1, total.max(1), false));
}
if !has_real_unit {
for (sym, _, _, _) in units.iter_mut() {
if sym == "<top-level>" {
*sym = "<module>".to_string();
}
}
}
let total_lines = lines.len() as u32;
let mut blocks = Vec::with_capacity(units.len());
for (ordinal, (symbol, ls, le, _is_real)) in units.into_iter().enumerate() {
let line_start = ls.max(1);
let line_end = le.min(total_lines.max(1));
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol),
lang: Some("cpp".to_string()),
};
let block_id = id_for_block(doc_id, "code", &[], ordinal as u32, &span);
let code = lines[(line_start as usize - 1)..=(line_end as usize - 1)].join("\n");
blocks.push(Block::Code(CodeBlock {
common: CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span,
},
lang: Some("cpp".to_string()),
code,
}));
}
Ok(blocks)
}
/// Walk preceding `comment` siblings to extend the unit's line range upward,
/// folding leading doc / line comments into the unit (1B pattern).
fn unit_start(n: &tree_sitter::Node) -> u32 {
let mut start = n.start_position().row as u32 + 1;
let mut prev = n.prev_sibling();
while let Some(p) = prev {
if p.kind() == "comment" {
start = p.start_position().row as u32 + 1;
prev = p.prev_sibling();
} else {
break;
}
}
start
}
fn flush_glue(glue: &mut Vec<(u32, u32)>, units: &mut Vec<(String, u32, u32, bool)>) {
if glue.is_empty() {
return;
}
let s = glue.iter().map(|(a, _)| *a).min().unwrap();
let e = glue.iter().map(|(_, b)| *b).max().unwrap();
units.push(("<top-level>".to_string(), s, e, false));
glue.clear();
}
/// Walk a scope node (translation_unit, declaration_list, field_declaration_list)
/// emitting unit + glue blocks. `prefix` is the current namespace/class chain
/// (e.g. `["kebab", "Chunk", "Foo"]`).
///
/// After returning, any pending glue in `glue` is NOT flushed — callers
/// responsible for flushing at the scope boundary (top-level flush in
/// `build_blocks_top`). Within recursive scope bodies (namespace/class) we
/// do flush before returning so that glue doesn't leak across scopes.
fn build_blocks(
node: tree_sitter::Node,
source: &str,
prefix: &[String],
units: &mut Vec<(String, u32, u32, bool)>,
glue: &mut Vec<(u32, u32)>,
) {
let mut cur = node.walk();
for child in node.named_children(&mut cur) {
let s = unit_start(&child);
let e = child.end_position().row as u32 + 1;
match child.kind() {
"namespace_definition" => {
// Flush pending glue before starting this namespace block.
flush_glue(glue, units);
let name_node = child.child_by_field_name("name");
let body = child
.child_by_field_name("body")
.unwrap_or(child);
match name_node {
None => {
// Anonymous namespace: push "<anonymous>", recurse.
let mut new_prefix = prefix.to_vec();
new_prefix.push("<anonymous>".to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
Some(nn) => match nn.kind() {
"namespace_identifier" => {
let name = &source[nn.start_byte()..nn.end_byte()];
let mut new_prefix = prefix.to_vec();
new_prefix.push(name.to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
"nested_namespace_specifier" => {
// e.g. `namespace outer::inner { ... }`
// All named children are namespace_identifier nodes.
let mut new_prefix = prefix.to_vec();
let mut nc = nn.walk();
for seg in nn.named_children(&mut nc) {
new_prefix.push(source[seg.start_byte()..seg.end_byte()].to_string());
}
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
_ => {
// Unknown name kind — treat entire namespace as glue.
glue.push((s, e));
}
},
}
}
"class_specifier" | "struct_specifier" => {
let name_node = child.child_by_field_name("name");
let Some(nn) = name_node else {
// Anonymous class/struct — glue.
glue.push((s, e));
continue;
};
let name = match nn.kind() {
"type_identifier" => &source[nn.start_byte()..nn.end_byte()],
_ => {
// template_type or qualified_identifier — use full text
// as the symbol segment (includes template args).
&source[nn.start_byte()..nn.end_byte()]
}
};
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
if let Some(body) = child.child_by_field_name("body") {
let mut new_prefix = prefix.to_vec();
new_prefix.push(name.to_string());
build_blocks(body, source, &new_prefix, units, glue);
flush_glue(glue, units);
}
}
"function_definition" => {
let decl = child.child_by_field_name("declarator");
let Some(decl_node) = decl else {
glue.push((s, e));
continue;
};
match extract_fn_symbol(decl_node, source, prefix) {
Some(sym) => {
flush_glue(glue, units);
units.push((sym, s, e, true));
}
None => {
glue.push((s, e));
}
}
}
"template_declaration" => {
// Unwrap: recurse into named children with same prefix.
// The inner function/class/concept will be matched by their own
// arms. template_parameter_list is not a unit; it will fall
// through to glue (it's not a named child of the template_declaration
// that matches any of our arms).
build_blocks(child, source, prefix, units, glue);
// Do NOT flush glue here — template body may be part of a glue group.
}
"enum_specifier" => {
if let Some(nn) = child.child_by_field_name("name") {
let name = &source[nn.start_byte()..nn.end_byte()];
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
} else {
// Anonymous enum — glue.
glue.push((s, e));
}
}
"concept_definition" => {
// C++20. Has required "name" field (identifier).
if let Some(nn) = child.child_by_field_name("name") {
let name = &source[nn.start_byte()..nn.end_byte()];
flush_glue(glue, units);
let sym = build_symbol(prefix, &[name]);
units.push((sym, s, e, true));
} else {
glue.push((s, e));
}
}
"linkage_specification" => {
// extern "C" { ... } — glue-wrapper, but recurse into body
// with same prefix so inner definitions are extracted.
let body = child.child_by_field_name("body").unwrap_or(child);
// The linkage_spec itself is glue; inner defs handled by recursion.
// Don't emit the wrapper as a unit; but also don't push it as glue
// since recursion will push its inner children individually.
build_blocks(body, source, prefix, units, glue);
}
// Everything else: preproc, declarations, using, typedef, etc.
_ => {
glue.push((s, e));
}
}
}
}
/// Join prefix + extras into a `::` separated symbol.
fn build_symbol(prefix: &[String], extras: &[&str]) -> String {
let mut parts: Vec<&str> = prefix.iter().map(String::as_str).collect();
parts.extend_from_slice(extras);
parts.join("::")
}
/// Extract the symbol for a `function_definition` given its top-level
/// `declarator` node. Returns `None` if the name cannot be determined.
///
/// The declarator chain may be:
/// - `function_declarator` (plain fn or method)
/// - `pointer_declarator` wrapping `function_declarator` (fn returning pointer)
/// - `reference_declarator` wrapping `function_declarator` (fn returning ref)
/// - `operator_cast` (conversion operator — e.g. `operator bool`)
///
/// The inner `function_declarator.declarator` is one of:
/// - `identifier` → free fn or constructor, symbol = `prefix::name`
/// - `field_identifier` → method in class body, symbol = `prefix::name`
/// - `destructor_name` → `~Foo`, symbol = `prefix::~Foo`
/// - `operator_name` → `operator+` etc., symbol = `prefix::operator+`
/// - `qualified_identifier` → out-of-class def `Foo::bar` or `ns::Foo::bar`;
/// the scope chain is extracted and prepended to prefix.
///
/// For `qualified_identifier`, the scope hierarchy (which may itself be a
/// `qualified_identifier`) is flattened into a list of segments. These
/// segments REPLACE the current prefix (since out-of-class defs carry their
/// full scope explicitly). Example: `void ns::Foo::bar() {}` at top level
/// with prefix=[] → segments=[ns, Foo, bar] → symbol = `ns::Foo::bar`.
fn extract_fn_symbol(
decl_node: tree_sitter::Node,
source: &str,
prefix: &[String],
) -> Option<String> {
// Walk down pointer/reference wrapper layers to reach the
// function_declarator (or operator_cast at definition level).
let fn_decl = unwrap_to_fn_declarator(decl_node, source)?;
match fn_decl.kind() {
"operator_cast" => {
// e.g. `operator bool() const` — the function_definition.declarator
// IS the operator_cast (no function_declarator wrapper).
// Symbol = `prefix::operator <type>`.
let type_node = fn_decl.child_by_field_name("type")?;
let type_text = &source[type_node.start_byte()..type_node.end_byte()];
Some(build_symbol(prefix, &[&format!("operator {type_text}")]))
}
"function_declarator" => {
let inner = fn_decl.child_by_field_name("declarator")?;
extract_name_node(inner, source, prefix)
}
_ => None,
}
}
/// Walk pointer_declarator / reference_declarator chains down to the
/// first `function_declarator` or `operator_cast` node.
///
/// Returns `None` if no such node is found (e.g. a function definition
/// whose declarator is malformed or unknown).
fn unwrap_to_fn_declarator<'a>(
mut node: tree_sitter::Node<'a>,
_source: &str,
) -> Option<tree_sitter::Node<'a>> {
loop {
match node.kind() {
"function_declarator" | "operator_cast" => return Some(node),
"pointer_declarator" => {
node = node.child_by_field_name("declarator")?;
}
"reference_declarator" | "rvalue_reference_declarator" => {
// reference_declarator has no `declarator` field; its child
// is in the unnamed children list.
let mut walker = node.walk();
node = node.named_children(&mut walker).next()?;
}
_ => return None,
}
}
}
/// Given the innermost name node of a function_declarator, produce the symbol.
fn extract_name_node(
inner: tree_sitter::Node,
source: &str,
prefix: &[String],
) -> Option<String> {
match inner.kind() {
"identifier" | "field_identifier" => {
let name = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[name]))
}
"destructor_name" => {
// destructor_name text includes the `~` prefix (e.g. "~Foo").
let full = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[full]))
}
"operator_name" => {
// Full text e.g. "operator+", "operator->", "operator()".
let full = &source[inner.start_byte()..inner.end_byte()];
Some(build_symbol(prefix, &[full]))
}
"template_function" | "template_method" => {
// Template function like `foo<int>()`. Use the `name` field
// (the identifier / field_identifier before `<`).
let name_node = inner.child_by_field_name("name")?;
let name = &source[name_node.start_byte()..name_node.end_byte()];
Some(build_symbol(prefix, &[name]))
}
"qualified_identifier" => {
// Out-of-class method definition. Flatten the nested
// qualified_identifier chain into ordered segments.
// Example: `ns::Foo::method`
// qualified_identifier {
// scope: namespace_identifier "ns"
// name: qualified_identifier {
// scope: namespace_identifier "Foo"
// name: identifier "method"
// }
// }
// → ["ns", "Foo", "method"]
//
// These segments are combined with the current prefix so that a
// top-level out-of-class def `void Foo::bar() {}` inside a
// namespace body with prefix=["ns"] produces `ns::Foo::bar`.
let mut segments: Vec<String> = Vec::new();
flatten_qualified_id(inner, source, &mut segments);
if segments.is_empty() {
return None;
}
// Build: prefix + all segments (scope chain + leaf).
let mut all: Vec<&str> = prefix.iter().map(String::as_str).collect();
for seg in &segments {
all.push(seg.as_str());
}
Some(all.join("::"))
}
_ => None,
}
}
/// Recursively flatten a `qualified_identifier` node into ordered string
/// segments. For `ns::Foo::method` this produces `["ns", "Foo", "method"]`.
fn flatten_qualified_id(node: tree_sitter::Node, source: &str, out: &mut Vec<String>) {
// A qualified_identifier has:
// scope: namespace_identifier | (None for global-scope `::foo`)
// name: identifier | field_identifier | destructor_name |
// operator_name | qualified_identifier | template_function |
// template_method | ...
let scope_node = node.child_by_field_name("scope");
let name_node = node.child_by_field_name("name");
if let Some(s) = scope_node {
out.push(source[s.start_byte()..s.end_byte()].to_string());
}
match name_node {
Some(n) if n.kind() == "qualified_identifier" => {
// Recurse: more nesting.
flatten_qualified_id(n, source, out);
}
Some(n) => {
// Leaf name — push its text.
out.push(source[n.start_byte()..n.end_byte()].to_string());
}
None => {}
}
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
pub(crate) mod tests_support {
use kebab_core::*;
use std::path::PathBuf;
use time::OffsetDateTime;
pub fn fixed_code_asset(workspace_path: &str, lang: &str) -> RawAsset {
RawAsset {
asset_id: AssetId("a".repeat(64)),
source_uri: SourceUri::File(PathBuf::from(workspace_path)),
workspace_path: WorkspacePath(workspace_path.to_string()),
media_type: MediaType::Code(lang.to_string()),
byte_len: 0,
checksum: Checksum("b".repeat(64)),
discovered_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from(workspace_path),
sha: Checksum("b".repeat(64)),
},
}
}
pub fn extract_cpp(src: &str, path: &str) -> kebab_core::CanonicalDocument {
use super::CppAstExtractor;
use kebab_core::Extractor;
let asset = fixed_code_asset(path, "cpp");
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
}
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{Block, MediaType, SourceSpan};
fn syms(doc: &kebab_core::CanonicalDocument) -> Vec<String> {
let mut s: Vec<String> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => symbol.clone(),
_ => None,
},
_ => None,
})
.collect();
s.sort();
s
}
#[test]
fn extractor_supports_only_media_code_cpp() {
let e = CppAstExtractor::new();
assert!(e.supports(&MediaType::Code("cpp".into())));
assert!(!e.supports(&MediaType::Code("c".into())));
assert!(!e.supports(&MediaType::Code("rust".into())));
assert!(!e.supports(&MediaType::Markdown));
}
#[test]
fn free_function() {
let src = "void foo() {}\n";
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "foo"), "got {s:?}");
}
#[test]
fn namespace_and_class() {
let src = r#"
namespace ns {
class Foo {
public:
void method() {}
Foo() {}
~Foo() {}
int operator+(const Foo& o) { return 0; }
};
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "ns::Foo"), "ns::Foo missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::method"), "method missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::Foo"), "ctor missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::~Foo"), "dtor missing: {s:?}");
assert!(s.iter().any(|x| x == "ns::Foo::operator+"), "op+ missing: {s:?}");
}
#[test]
fn anonymous_namespace() {
let src = r#"
namespace {
void hidden_fn() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "<anonymous>::hidden_fn"),
"anon fn missing: {s:?}"
);
}
#[test]
fn nested_namespace_specifier() {
let src = r#"
namespace outer::inner {
void fn_in_nested() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "outer::inner::fn_in_nested"),
"nested ns fn missing: {s:?}"
);
}
#[test]
fn out_of_class_method_def() {
let src = r#"
void ns::Foo::method() { }
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "ns::Foo::method"),
"out-of-class method missing: {s:?}"
);
}
#[test]
fn template_declaration() {
let src = r#"
template<typename T>
class Bar {
void tmpl_method() {}
};
template<typename T>
void tmpl_free_fn(T x) {}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Bar"), "Bar class missing: {s:?}");
assert!(
s.iter().any(|x| x == "Bar::tmpl_method"),
"Bar::tmpl_method missing: {s:?}"
);
assert!(
s.iter().any(|x| x == "tmpl_free_fn"),
"tmpl_free_fn missing: {s:?}"
);
}
#[test]
fn enum_and_concept() {
let src = r#"
enum class Color { Red, Green };
template<typename T>
concept Printable = requires(T t) { t.print(); };
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "Color"), "Color missing: {s:?}");
assert!(s.iter().any(|x| x == "Printable"), "Printable missing: {s:?}");
}
#[test]
fn extern_c_block() {
let src = r#"
extern "C" {
void c_fn1() {}
void c_fn2() {}
}
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "c_fn1"), "c_fn1 missing: {s:?}");
assert!(s.iter().any(|x| x == "c_fn2"), "c_fn2 missing: {s:?}");
}
#[test]
fn conversion_operator() {
let src = r#"
class Foo {
operator bool() const { return true; }
};
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "Foo::operator bool"),
"conversion op missing: {s:?}"
);
}
#[test]
fn empty_file_produces_module() {
let src = "";
let doc = tests_support::extract_cpp(src, "x/empty.cpp");
let s = syms(&doc);
assert_eq!(s, vec!["<module>"], "expected <module>: got {s:?}");
}
#[test]
fn glue_only_produces_module() {
let src = "#include <vector>\nusing namespace std;\n";
let doc = tests_support::extract_cpp(src, "x/glue.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "<module>"), "expected <module>: got {s:?}");
}
#[test]
fn ptr_returning_function() {
let src = "int* ptr_fn(int x) { return &x; }\n";
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(s.iter().any(|x| x == "ptr_fn"), "ptr_fn missing: {s:?}");
}
#[test]
fn ref_returning_operator() {
let src = r#"
class Foo {
Foo& operator=(const Foo& o) { return *this; }
};
"#;
let doc = tests_support::extract_cpp(src, "x/foo.cpp");
let s = syms(&doc);
assert!(
s.iter().any(|x| x == "Foo::operator="),
"operator= missing: {s:?}"
);
}
#[test]
fn deterministic_across_runs() {
let src = r#"
namespace ns {
class Foo {
void method() {}
};
}
void free_fn() {}
"#;
let a = tests_support::extract_cpp(src, "x/foo.cpp");
for _ in 0..20 {
assert_eq!(tests_support::extract_cpp(src, "x/foo.cpp").blocks, a.blocks);
}
}
}

View File

@@ -13,6 +13,8 @@
//! `kebab-parse-*` crates per design §8: must NOT depend on store / embed
//! / llm / rag.
pub mod c;
pub mod cpp;
pub mod go;
pub mod java;
pub mod javascript;
@@ -25,6 +27,8 @@ pub(crate) mod scaffold;
pub mod skip;
pub mod typescript;
pub use c::{PARSER_VERSION as C_PARSER_VERSION, CAstExtractor};
pub use cpp::{PARSER_VERSION as CPP_PARSER_VERSION, CppAstExtractor};
pub use go::{PARSER_VERSION as GO_PARSER_VERSION, GoAstExtractor};
pub use java::{PARSER_VERSION as JAVA_PARSER_VERSION, JavaAstExtractor};
pub use javascript::{PARSER_VERSION as JS_PARSER_VERSION, JavascriptAstExtractor};

View File

@@ -162,18 +162,35 @@ impl Retriever for LexicalRetriever {
/// Translate a user-typed query into an FTS5 match string.
///
/// Rules (from the task spec):
/// v0.17.0 — trigram-aware redesign (see design §5.5 + plan
/// `docs/superpowers/plans/2026-05-22-korean-trigram-tokenizer.md`
/// Task A5). The FTS5 tokenizer is `trigram` so any term shorter than
/// three Unicode chars has no index entry and would zero out an AND
/// branch. Korean compounds typically split into 2-char eojeols (e.g.
/// `해시 충돌`), so a naive token AND drops the dominant usage pattern.
///
/// - The query is wrapped in a single pair of `'...'` → strip the quotes
/// and pass the inner text through verbatim. The user has explicitly
/// opted into FTS5 syntax (e.g. `'rust AND cargo'`, `'foo*'`).
/// Rules:
///
/// - Otherwise: split on whitespace, escape every token by wrapping it
/// in `"..."` (FTS5 string literal), with any inner `"` doubled. Join
/// with spaces — FTS5 default operator is implicit AND.
/// - Raw mode (unchanged): the query is wrapped in a single pair of
/// `'...'` → strip the quotes and pass the inner text through verbatim.
/// The user has explicitly opted into FTS5 syntax (e.g.
/// `'rust AND cargo'`, `'foo*'`).
///
/// - An empty / whitespace-only token list → return `None` (caller
/// short-circuits to `Ok(vec![])`).
/// - Otherwise build up to two MATCH candidates:
/// 1. **whole-phrase**: the entire trimmed input wrapped as one FTS5
/// string literal, *only* if it has ≥3 Unicode chars. FTS5 treats
/// a quoted string with spaces as a phrase match.
/// 2. **token AND**: whitespace-split tokens, kept only when each has
/// ≥3 Unicode chars (shorter ones are dropped — they would zero
/// out the AND under trigram).
///
/// - Combine: `(whole) OR (token_and)` when both exist *and differ*;
/// either alone when only one exists; `None` when neither exists
/// (caller short-circuits to `Ok(vec![])`, avoiding an FTS5 syntax
/// error from an empty MATCH).
///
/// - A single-token long query (`러스트`, `foo`) yields `whole == token_and`
/// → return the bare quoted form so the OR doesn't duplicate.
fn build_match_string(text: &str) -> Option<String> {
let trimmed = text.trim();
if trimmed.is_empty() {
@@ -186,14 +203,27 @@ fn build_match_string(text: &str) -> Option<String> {
}
return Some(inner_trim.to_string());
}
let tokens: Vec<String> = trimmed
.split_whitespace()
.map(escape_fts5_token)
.collect();
if tokens.is_empty() {
None
} else {
Some(tokens.join(" "))
const MIN_TRIGRAM_CHARS: usize = 3;
let whole_candidate: Option<String> = (trimmed.chars().count() >= MIN_TRIGRAM_CHARS)
.then(|| escape_fts5_token(trimmed));
let token_and_candidate: Option<String> = {
let toks: Vec<String> = trimmed
.split_whitespace()
.filter(|t| t.chars().count() >= MIN_TRIGRAM_CHARS)
.map(escape_fts5_token)
.collect();
(!toks.is_empty()).then(|| toks.join(" "))
};
match (whole_candidate, token_and_candidate) {
(None, None) => None,
(Some(w), None) => Some(w),
(None, Some(a)) => Some(a),
(Some(w), Some(a)) if w == a => Some(w),
(Some(w), Some(a)) => Some(format!("({w}) OR ({a})")),
}
}
@@ -555,30 +585,31 @@ mod tests {
}
#[test]
fn build_match_string_default_is_quoted_and_anded() {
fn build_match_string_default_emits_or_of_phrase_and_and() {
// Two long tokens: both whole-phrase and token-AND candidates
// exist and differ, so the builder combines them with OR.
let s = build_match_string("rust cargo").unwrap();
// Two tokens, each quoted, joined by a space (implicit AND).
assert_eq!(s, r#""rust" "cargo""#);
assert_eq!(s, r#"("rust cargo") OR ("rust" "cargo")"#);
}
#[test]
fn build_match_string_escapes_special_chars() {
// `*`, `(`, `)`, `:`, `^`, `"` should all be wrapped inside
// FTS5 string-literal quotes so they're treated as literal
// text rather than FTS5 operators.
// text rather than FTS5 operators. Every token is ≥3 chars,
// so both the whole-phrase and token-AND candidates exist.
let s = build_match_string(r#"foo* (bar) baz:qux ^head he"llo"#).unwrap();
assert_eq!(
s,
r#""foo*" "(bar)" "baz:qux" "^head" "he""llo""#
r#"("foo* (bar) baz:qux ^head he""llo") OR ("foo*" "(bar)" "baz:qux" "^head" "he""llo")"#
);
// The doubled `""` is FTS5's way of embedding a literal quote
// inside a string literal.
// inside a string literal. Appears in both whole-phrase and
// token-AND halves.
assert!(s.contains(r#"he""llo"#));
// Sanity: every special character lives between matching `"`
// delimiters — there is no bare-token (unquoted) span anywhere.
// We check this by confirming the string starts and ends with `"`
// and the count of unescaped `"` is even (each token is wrapped).
assert!(s.starts_with('"') && s.ends_with('"'));
// Sanity: the combined expression is `(...) OR (...)` so it
// starts with `(` and ends with `)`.
assert!(s.starts_with('(') && s.ends_with(')'));
}
#[test]
@@ -588,6 +619,55 @@ mod tests {
assert_eq!(s, "foo OR bar*");
}
// ── v0.17.0 trigram-aware redesign coverage ──────────────────────────
/// 2-char Korean query (`충돌`) yields neither a whole-phrase nor a
/// token-AND candidate → `None`. Caller short-circuits to an empty
/// hit list rather than executing an FTS5 syntax error on `""` MATCH.
#[test]
fn build_match_string_short_korean_returns_none() {
assert!(build_match_string("충돌").is_none());
assert!(build_match_string("").is_none());
assert!(build_match_string(" 충돌 ").is_none());
}
/// `해시 충돌` — both tokens are 2 chars (dropped from the AND), but
/// the whole-phrase candidate (`"해시 충돌"`, 5 chars total) survives.
/// This is the dominant Korean usage pattern targeted by A5.
#[test]
fn build_match_string_whole_phrase_only_when_all_tokens_short() {
let s = build_match_string("해시 충돌").unwrap();
assert_eq!(s, r#""해시 충돌""#);
}
/// Single long token: whole-phrase and token-AND candidates collapse
/// to the same string. The builder returns the bare quoted form so
/// the MATCH expression doesn't carry a redundant `(x) OR (x)`.
#[test]
fn build_match_string_single_long_token_no_duplicate_or() {
assert_eq!(build_match_string("러스트").unwrap(), r#""러스트""#);
assert_eq!(build_match_string("rust").unwrap(), r#""rust""#);
}
/// Mixed Korean+English multi-token query where every token is ≥3
/// chars: both candidates exist and differ, OR-combined.
#[test]
fn build_match_string_mixed_lang_emits_or_of_phrase_and_and() {
let s = build_match_string("Rust 충돌은").unwrap();
assert_eq!(s, r#"("Rust 충돌은") OR ("Rust" "충돌은")"#);
}
/// One ≥3 token + one <3 token: short token is dropped from the
/// AND, leaving a single long token there; whole-phrase exists
/// independently. Both candidates differ → OR-combined.
#[test]
fn build_match_string_drops_short_token_in_and_keeps_whole() {
// "키" (1 char) dropped from AND; "해시테이블" (5 chars) kept.
// Whole phrase "키 해시테이블" (7 chars) keeps the short token.
let s = build_match_string("키 해시테이블").unwrap();
assert_eq!(s, r#"("키 해시테이블") OR ("해시테이블")"#);
}
#[test]
fn normalize_bm25_top_score_in_unit_interval() {
// A "perfect" hit is bm25 = -1.0 → normalized 0.5.

View File

@@ -19,9 +19,9 @@
"indexed_at": "2024-01-01T00:00:00Z",
"rank": 1,
"retrieval": {
"fusion_score": 1.4490997273242101e-6,
"fusion_score": 1.4615362715630908e-6,
"lexical_rank": 1,
"lexical_score": 1.4490997273242101e-6,
"lexical_score": 1.4615362715630908e-6,
"method": "lexical",
"vector_rank": null,
"vector_score": null
@@ -51,9 +51,9 @@
"indexed_at": "2024-01-01T00:00:00Z",
"rank": 2,
"retrieval": {
"fusion_score": 9.641424867368187e-7,
"fusion_score": 9.207039965986041e-7,
"lexical_rank": 2,
"lexical_score": 9.641424867368187e-7,
"lexical_score": 9.207039965986041e-7,
"method": "lexical",
"vector_rank": null,
"vector_score": null

View File

@@ -464,6 +464,74 @@ impl SqliteStore {
}
Ok(out)
}
/// v0.17.0 PR-B: sister of [`Self::stale_chunk_ids_at`] for the
/// `parser_version` bump cascade. When `doc_id` depends on
/// `parser_version` (design §9) and an extractor ships a new
/// `PARSER_VERSION`, the next ingest computes a fresh `doc_id` for
/// the *same* `(workspace_path, asset_id)` pair. The existing
/// asset_id-keyed [`Self::stale_chunk_ids_at`] does NOT fire (same
/// asset), so the legacy `chunks` rows and their LanceDB shadows
/// would orphan. This helper queries by `workspace_path` instead,
/// excluding the freshly-computed `keep_doc_id` so a re-entry
/// during the same ingest doesn't re-sweep the new row.
///
/// Caller usage: pass the *new* `doc_id` if known; pass an empty
/// string when called before the new INSERT (the case in
/// `try_skip_unchanged`) — all existing docs at `workspace_path`
/// are then collected as stale.
pub fn stale_chunk_ids_for_workspace_path_except_doc_id(
&self,
workspace_path: &str,
keep_doc_id: &str,
) -> Result<Vec<kebab_core::ChunkId>> {
let conn = self.lock_conn();
let mut stmt = conn
.prepare(
"SELECT c.chunk_id
FROM chunks c
INNER JOIN documents d ON c.doc_id = d.doc_id
WHERE d.workspace_path = ?1 AND d.doc_id != ?2",
)
.map_err(StoreError::from)?;
let rows = stmt
.query_map(params![workspace_path, keep_doc_id], |row| {
row.get::<_, String>(0)
})
.map_err(StoreError::from)?;
let mut out: Vec<kebab_core::ChunkId> = Vec::new();
for row in rows {
let id = row.map_err(StoreError::from)?;
out.push(kebab_core::ChunkId(id));
}
Ok(out)
}
/// v0.17.0 PR-B: sweep the SQLite document chain (`documents` →
/// `blocks` / `chunks` / `embedding_records` via CASCADE) for every
/// row at `workspace_path` whose `doc_id` differs from `keep_doc_id`.
/// Pair with [`Self::stale_chunk_ids_for_workspace_path_except_doc_id`]
/// — caller fetches the chunk_ids first, hands them to
/// `VectorStore::delete_by_chunk_ids`, then calls this sweep.
/// `assets` row is preserved (same bytes, same asset_id — only the
/// derived `doc_id` changed).
///
/// `keep_doc_id = ""` deletes every doc at `workspace_path`
/// (semantics mirror the sister helper above — used by
/// `try_skip_unchanged` before the new INSERT exists).
pub fn purge_document_at_workspace_path_except_doc_id(
&self,
workspace_path: &str,
keep_doc_id: &str,
) -> Result<()> {
let conn = self.lock_conn();
conn.execute(
"DELETE FROM documents WHERE workspace_path = ?1 AND doc_id != ?2",
params![workspace_path, keep_doc_id],
)
.map_err(StoreError::from)?;
Ok(())
}
}
/// Sweep stale `assets` + `documents` + downstream rows when the file
@@ -824,6 +892,45 @@ impl SqliteStore {
Ok(out)
}
/// v0.17.0 PR-C: per-code-language **chunk** count for
/// `schema.v1.stats`. Companion to [`Self::code_lang_breakdown`] —
/// that one returns *document* counts. Stats observers wanting
/// indexing-pressure granularity (a single PDF spec → 200 chunks,
/// vs a single Rust file → 5 chunks) need the chunk-level view.
///
/// SQL joins `chunks → documents`, reads
/// `metadata_json->'$.code_lang'` on the doc side, groups by the
/// language, and skips rows where `code_lang IS NULL`. Returns
/// `BTreeMap<String, u32>` mirroring the doc-count helper above
/// so callers can serialize both with the same shape.
pub fn code_lang_chunk_breakdown(
&self,
) -> anyhow::Result<std::collections::BTreeMap<String, u32>> {
use anyhow::Context;
let conn = self.read_conn();
let mut stmt = conn
.prepare(
"SELECT json_extract(d.metadata_json, '$.code_lang') AS cl, \
COUNT(c.chunk_id) \
FROM chunks c \
INNER JOIN documents d ON c.doc_id = d.doc_id \
WHERE cl IS NOT NULL \
GROUP BY cl",
)
.context("prepare code_lang_chunk_breakdown")?;
let rows = stmt
.query_map([], |r| {
Ok((r.get::<_, String>(0)?, r.get::<_, i64>(1)? as u32))
})
.context("query code_lang_chunk_breakdown")?;
let mut out = std::collections::BTreeMap::new();
for row in rows {
let (k, v) = row.context("read code_lang_chunk_breakdown row")?;
out.insert(k, v);
}
Ok(out)
}
/// p10-1A-2 follow-up (dogfooding 2026-05-20): per-repo doc count for
/// `schema.v1`.
///
@@ -973,6 +1080,108 @@ mod tests {
assert_eq!(bd.len(), 1, "expected exactly 1 entry, got: {bd:?}");
}
/// v0.17.0 PR-C: `code_lang_chunk_breakdown` counts *chunks* (not
/// docs) grouped by `documents.metadata_json.code_lang`. Differs
/// from `code_lang_breakdown` (doc count) by joining `chunks` and
/// summing chunk rows so one Rust file with 3 chunks reports
/// `rust=3` here vs `rust=1` in the doc-count helper.
///
/// Uses a side rusqlite connection (FK enforcement off) so a single
/// doc + multiple chunks fixture can be inserted without standing
/// up `assets` companions.
#[test]
fn code_lang_chunk_breakdown_counts_chunks_not_docs() {
let (dir, store) = open_fresh_store();
let db_path = dir.path().join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).unwrap();
conn.pragma_update(None, "foreign_keys", "OFF").unwrap();
// 1 Rust doc + 3 chunks → chunk_breakdown rust=3 / doc_breakdown rust=1.
conn.execute(
"INSERT INTO documents (
doc_id, asset_id, workspace_path,
source_type, trust_level, parser_version,
doc_version, schema_version,
metadata_json, provenance_json,
created_at, updated_at
) VALUES (
'doc-rust-1', 'asset-1', 'src/main.rs',
'reference', 'primary', 'test-v1',
1, 1,
'{\"code_lang\":\"rust\"}', '{}',
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
)",
[],
)
.unwrap();
for i in 0..3u32 {
conn.execute(
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES (?, 'doc-rust-1', ?, '[]', NULL, '[]', 0, 'cv1', 'h', '[]', '2024-01-01T00:00:00Z')",
rusqlite::params![format!("rust-chunk-{i:0>26}"), format!("body {i}")],
)
.unwrap();
}
// 1 markdown doc + 1 chunk → code_lang = null → must be skipped.
conn.execute(
"INSERT INTO documents (
doc_id, asset_id, workspace_path,
source_type, trust_level, parser_version,
doc_version, schema_version,
metadata_json, provenance_json,
created_at, updated_at
) VALUES (
'doc-md-1', 'asset-2', 'notes/readme.md',
'markdown', 'primary', 'test-v1',
1, 1,
'{\"code_lang\":null}', '{}',
'2024-01-01T00:00:00Z', '2024-01-01T00:00:00Z'
)",
[],
)
.unwrap();
conn.execute(
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json, section_label,
source_spans_json, token_estimate, chunker_version,
policy_hash, block_ids_json, created_at
) VALUES ('md-chunk-00000000000000000000000', 'doc-md-1', 'm', '[]', NULL, '[]', 0, 'cv1', 'h', '[]', '2024-01-01T00:00:00Z')",
[],
)
.unwrap();
drop(conn);
let chunk_bd = store.code_lang_chunk_breakdown().unwrap();
assert_eq!(
chunk_bd.get("rust"),
Some(&3u32),
"expected rust=3 chunks (1 doc × 3 chunks): {chunk_bd:?}"
);
assert!(
!chunk_bd.contains_key("null"),
"null code_lang must be skipped: {chunk_bd:?}"
);
assert_eq!(
chunk_bd.len(),
1,
"expected exactly 1 language entry: {chunk_bd:?}"
);
// Sanity: the existing doc-count helper still returns 1 for rust,
// proving the two metrics differ as intended.
let doc_bd = store.code_lang_breakdown().unwrap();
assert_eq!(
doc_bd.get("rust"),
Some(&1u32),
"doc-count helper unchanged: {doc_bd:?}"
);
}
/// p10-1A-2 follow-up: `repo_breakdown` counts docs by
/// `metadata_json.repo`.
///

View File

@@ -370,17 +370,19 @@ fn extract_design_5_5_fts_block() -> String {
fts_slice[..last_end + "END;".len()].to_string()
}
/// Extract the §5.5 verbatim block from the V002 migration, between the
/// `── §5.5 verbatim block ──` anchor markers the file already carries.
/// Extract the §5.5 verbatim block from the V007 migration (replaced V002
/// 's unicode61 tokenizer with trigram — V002 stays in place for
/// historical cold-upgrade replay but V007 is now the source of truth),
/// between the `── §5.5 verbatim block ──` anchor markers V007 carries.
fn extract_migration_5_5_verbatim_block() -> String {
let migration = include_str!("../../../migrations/V002__fts.sql");
let migration = include_str!("../../../migrations/V007__fts_trigram.sql");
// The opening anchor line ends with `── §5.5 verbatim block ─...`.
let open_marker = "§5.5 verbatim block";
let close_marker = "End §5.5 verbatim block";
let open_idx = migration
.find(open_marker)
.expect("V002 must carry the `§5.5 verbatim block` opening anchor");
.expect("V007 must carry the `§5.5 verbatim block` opening anchor");
let after_open_line = open_idx
+ migration[open_idx..]
.find('\n')
@@ -389,7 +391,7 @@ fn extract_migration_5_5_verbatim_block() -> String {
let close_idx = migration[after_open_line..]
.find(close_marker)
.expect("V002 must carry the `End §5.5 verbatim block` closing anchor")
.expect("V007 must carry the `End §5.5 verbatim block` closing anchor")
+ after_open_line;
// Walk back from the close marker to the start of its comment line.
let close_line_start = migration[..close_idx]
@@ -400,12 +402,14 @@ fn extract_migration_5_5_verbatim_block() -> String {
migration[after_open_line..close_line_start].to_string()
}
/// CI diff guard: the §5.5 block in `migrations/V002__fts.sql` must
/// match the design doc verbatim (whitespace-normalized). If the
/// design doc moves the section, renames the heading, or edits the
/// SQL, this test fails first. Same for migration drift.
/// CI diff guard: the §5.5 block in `migrations/V007__fts_trigram.sql`
/// must match the design doc verbatim (whitespace-normalized). V007
/// replaced V002 's unicode61 tokenizer with trigram (2026-05-23).
/// V002 stays in place for historical replay of cold-upgrade paths
/// but is no longer compared against the design doc — V007 is now
/// the source of truth.
#[test]
fn fts_v002_matches_design_section_5_5_verbatim() {
fn fts_v007_matches_design_section_5_5_verbatim() {
let design = extract_design_5_5_fts_block();
let migration_block = extract_migration_5_5_verbatim_block();
@@ -428,7 +432,7 @@ fn fts_v002_matches_design_section_5_5_verbatim() {
let migration_n = normalize_ws(&migration_block);
assert_eq!(
design_n, migration_n,
"V002__fts.sql §5.5 block must match design doc §5.5 verbatim \
"V007__fts_trigram.sql §5.5 block must match design doc §5.5 verbatim \
(whitespace-normalized). If you intentionally changed one, \
update the other in the same commit."
);
@@ -477,3 +481,115 @@ fn fts_store_drop_releases_wal_files() {
.expect("main DB file should be removable after store drop");
}
}
// ── 7. Trigram tokenizer behavior (V007) — Korean + English ──────────
/// V007 의 trigram tokenizer 가 한국어 3자 이상 연속 substring 을
/// 매칭하는지. Codex round 1/2 가 sqlite 3.45.1 로 검증한 동작을 pin:
/// - raw query 가 3자 이상 공백 없는 substring 인 경우 hit.
/// - raw query 가 공백을 포함하면 FTS5 가 토큰 경계로 분리 →
/// 양 토큰이 3자 미만이면 0-hit.
/// - quoted phrase ("..." 안에 공백 포함) 는 통째로 substring 매칭.
#[test]
fn fts_trigram_korean_3char_substring_hits() {
let env = common::TestEnv::new();
let store = SqliteStore::open(&env.config()).unwrap();
store.run_migrations().unwrap();
let conn = raw_conn_no_fk(&env);
insert_chunk(
&conn,
&"k".repeat(32),
&"d".repeat(32),
"[]",
"해시 충돌은 키와 값을 매핑할 때 발생한다",
);
// raw 3+ chars 공백 없는 연속 substring → hit.
assert_eq!(
count_match(&conn, "충돌은"),
1,
"raw 3-char 공백 없는 substring '충돌은' must hit"
);
assert_eq!(
count_match(&conn, "발생한"),
1,
"raw 3-char 공백 없는 substring '발생한' must hit"
);
// quoted phrase (공백 포함) → substring 매칭으로 hit.
assert_eq!(
count_match(&conn, "\"해시 충돌\""),
1,
"quoted whole phrase '해시 충돌' (5 chars including space)"
);
assert_eq!(
count_match(&conn, "\"시 충\""),
1,
"quoted phrase '시 충' across the space boundary"
);
// raw with no whitespace but substring not present in source → 0-hit.
assert_eq!(
count_match(&conn, "해시충"),
0,
"원문에 공백 없는 '해시충' trigram 이 없으므로 0-hit"
);
}
/// V007 trigram 의 핵심 제약: 3 Unicode chars 미만 query 는 색인 단위가
/// 없어 항상 0-hit. design §3.4 + 사용자 결정 (lexical core 정상 0-hit,
/// CLI/TUI wrapper 가 안내 메시지 출력). 회귀 감지 — trigram 구조 변경
/// 또는 다른 tokenizer 도입 시 이 test 가 먼저 fail 한다.
#[test]
fn fts_trigram_korean_short_query_zero_hit_pinned() {
let env = common::TestEnv::new();
let store = SqliteStore::open(&env.config()).unwrap();
store.run_migrations().unwrap();
let conn = raw_conn_no_fk(&env);
insert_chunk(
&conn,
&"k".repeat(32),
&"d".repeat(32),
"[]",
"해시 충돌은 키와 값을 매핑할 때 발생한다",
);
// 2자 한국어 query — 도그푸딩에서 보고된 핵심 케이스 ('충돌'/'값').
assert_eq!(count_match(&conn, "충돌"), 0, "2-char Korean query");
// 1자 한국어 query.
assert_eq!(count_match(&conn, ""), 0, "1-char Korean query");
}
/// V007 trigram 은 영어에도 substring 매칭으로 동작 — recall ↑, 단어
/// 경계 정밀도 ↓. design §3.4 의 동작 변경을 명시적으로 핀.
#[test]
fn fts_trigram_english_substring_hits() {
let env = common::TestEnv::new();
let store = SqliteStore::open(&env.config()).unwrap();
store.run_migrations().unwrap();
let conn = raw_conn_no_fk(&env);
insert_chunk(
&conn,
&"e".repeat(32),
&"d".repeat(32),
"[]",
"the tokenizer normalizes whitespace before matching",
);
// trigram substring — 'token' hits inside 'tokenizer'.
assert_eq!(
count_match(&conn, "token"),
1,
"substring of 'tokenizer' — trigram recall"
);
assert_eq!(
count_match(&conn, "izer"),
1,
"substring of 'tokenizer'"
);
// 3-char-minimum applies to English too.
assert_eq!(count_match(&conn, "to"), 0, "2-char English query");
}

View File

@@ -153,6 +153,12 @@ pub struct SearchState {
/// `Ctrl-L`); the previous draft kept one for "symmetry" but
/// it was dead code.
pub worker_rx: Option<std::sync::mpsc::Receiver<SearchWorkerMessage>>,
/// v0.17.0 A5 Step 5: advisory text shown when the last completed
/// search returned no hits and the (trimmed) query is shorter than
/// the FTS5 trigram tokenizer's 3-char minimum. `None` whenever
/// the input changes (so a stale hint never overlaps a fresh
/// typing session) or the next search returns ≥1 hit.
pub short_query_hint: Option<String>,
}
/// p9-fb-08: payload posted by the search worker on completion.
@@ -179,6 +185,7 @@ impl Default for SearchState {
preview: None,
generation: 0,
worker_rx: None,
short_query_hint: None,
}
}
}

View File

@@ -393,6 +393,20 @@ fn dynamic_status(app: &App) -> String {
if app.search.as_ref().map(|s| s.searching).unwrap_or(false) {
return "searching…".to_string();
}
// v0.17.0 A5 Step 5: short-query advisory has higher priority than
// the idle slot but lower than active operations (streaming /
// searching / ingest progress) — the user should always see what
// is happening *now* before reading guidance about the last
// empty result. Slot only fires while focused on Search.
if app.focus == Pane::Search {
if let Some(hint) = app
.search
.as_ref()
.and_then(|s| s.short_query_hint.as_deref())
{
return hint.to_string();
}
}
if let Some(state) = app.ingest_state.as_ref() {
return crate::ingest_progress::status_line(state);
}

View File

@@ -333,7 +333,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
s.mode = cycle_mode(s.mode);
// Force re-search at the new mode if there's a query.
if !s.input.as_str().trim().is_empty() {
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
}
KeyOutcome::Continue
}
@@ -360,7 +360,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
(KeyCode::Backspace, _) => {
if !s.input.is_empty() {
s.input.pop_char();
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
}
KeyOutcome::Continue
}
@@ -388,7 +388,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
}
(KeyCode::Delete, _) => {
if s.input.delete_after().is_some() {
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
}
KeyOutcome::Continue
}
@@ -402,7 +402,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
s.preview = None;
} else {
s.input.push_char('j');
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
}
KeyOutcome::Continue
}
@@ -412,7 +412,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
s.preview = None;
} else {
s.input.push_char('k');
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
}
KeyOutcome::Continue
}
@@ -426,7 +426,7 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
// bindings (and don't currently match any Search
// command, so they're a safe fall-through to Continue).
s.input.push_char(c);
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
mark_input_changed(s);
KeyOutcome::Continue
}
// Normal mode + un-handled Char → no-op (no typing in
@@ -435,6 +435,16 @@ pub fn handle_key_search(state: &mut App, key: KeyEvent) -> KeyOutcome {
}
}
/// v0.17.0 A5 Step 5: every input-mutation site in `handle_key_search`
/// funnels through this helper so the debounce stamp and the
/// short-query advisory stay in sync. Reset is eager — the stale
/// advisory from the previous result set must not visually overlap
/// with a fresh typing session.
fn mark_input_changed(s: &mut crate::app::SearchState) {
s.input_dirty_at = Some(time::OffsetDateTime::now_utc());
s.short_query_hint = None;
}
fn cycle_mode(m: SearchMode) -> SearchMode {
match m {
SearchMode::Lexical => SearchMode::Vector,
@@ -603,6 +613,11 @@ pub(crate) fn fire_search(state: &mut App) -> anyhow::Result<()> {
s.generation = s.generation.wrapping_add(1);
s.searching = true;
s.input_dirty_at = None;
// v0.17.0 A5 Step 5: hint belongs to the *prior* result set —
// a fresh worker spawn invalidates it so the status bar
// doesn't keep showing the old advisory while the new
// query is in flight.
s.short_query_hint = None;
let q_text = s.input.as_str().to_string();
s.last_query = Some((q_text.clone(), s.mode));
(q_text, s.mode, s.generation)
@@ -676,6 +691,18 @@ pub fn poll_worker(state: &mut App) {
s.searching = false;
match result {
Ok(hits) => {
// v0.17.0 A5 Step 5: stale-aware short-query hint.
// The worker carries no copy of the query text;
// we ground the advisory on `s.last_query` which
// was snapshotted at `fire_search` time and (by
// the generation guard above) still matches what
// the user submitted for *this* result set. If
// input has drifted since spawn, the gen-check
// already returned early.
let q_text =
s.last_query.as_ref().map(|(t, _)| t.as_str()).unwrap_or("");
s.short_query_hint =
kebab_app::short_query_hint(q_text, hits.is_empty());
s.hits = hits;
s.selected_hit = 0;
s.preview = None;
@@ -683,6 +710,7 @@ pub fn poll_worker(state: &mut App) {
Err(e) => {
s.hits.clear();
s.selected_hit = 0;
s.short_query_hint = None;
state.error_overlay =
Some(crate::error_popup::ErrorOverlay::from_anyhow(&e));
}

View File

@@ -22,7 +22,7 @@ Cargo workspace, 함수 호출 기반 모듈러 모놀리스. UI binary (`kebab-
| OCR | Ollama vision LM (default `gemma4:e4b`) — `OcrEngine` trait 으로 Tesseract / Apple Vision 등 future swap (HOTFIXES P6-2) |
| Image caption | Ollama vision LM, runtime gate `image.caption.enabled` (default OFF) |
| PDF parser | `lopdf` per-page 텍스트, `chunker_version = "pdf-page-v1"` 가 PDF 자산에 하드코딩 (HOTFIXES P7-3) |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). |
| code parser | `tree-sitter` + `tree-sitter-rust` / `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` / `tree-sitter-go` / `tree-sitter-java` / `tree-sitter-kotlin-ng`**parser-side** (`kebab-parse-code`), chunker-side 아님 (design §6.3). chunker versions: Rust = `code-rust-ast-v1`, Python = `code-python-ast-v1`, TypeScript = `code-ts-ast-v1`, JavaScript = `code-js-ast-v1`, Go = `code-go-ast-v1`, Java = `code-java-ast-v1`, Kotlin = `code-kotlin-ast-v1`. `ast_chunk_max_lines = 200` 상수 고정 (HOTFIXES 2026-05-19 — Chunker trait 이 per-medium config 미노출). Kotlin grammar 은 `tree-sitter-kotlin-ng` 사용 — bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착되어 있어 사용 불가. **Tier 2 (p10-2)**: YAML/k8s → `serde_yaml` + `k8s-manifest-resource-v1` (apiVersion+kind per resource), Dockerfile → `dockerfile-file-v1` (whole-file), Cargo.toml/go.mod/.json/.xml/.groovy → `manifest-file-v1` (whole-file). Tier 2 chunkers live in `kebab-chunk`; no tree-sitter grammar needed (structure from file type, not AST). **Tier 3 (p10-3)**: shell scripts (`.sh`/`.bash`/`.zsh`) direct → `code-text-paragraph-v1` (blank-line paragraph segmentation + 80-line / 20-overlap line-window for oversize). Same chunker also serves as fallback when Tier 1/2 emit 0 chunks or Err — non-k8s YAML / invalid YAML / AST extractor failures all picked up. symbol = None; lang preserved from input doc. **Tier 1 family complete (p10-1D)**: C (`tree-sitter-c`, `code-c-ast-v1`, `.c`/`.h`) + C++ (`tree-sitter-cpp`, `code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`). C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → Tier 3 fallback. |
| 1B symbol path | workspace path → module path: Python = dotted prefix (`kebab_eval.metrics.compute_mrr`), TypeScript/JavaScript = slash-style prefix (`src/Foo.Foo.search`). Rust 1A-2 는 file-scope nesting 만 (workspace prefix 없음, 비일관 수용 — HOTFIXES 2026-05-20). |
| TUI | Ratatui + crossterm — P9-1 Library 패널, P9-2/3/4 진행 예정 |
| Desktop | Tauri 2 + `pdfjs-dist` (native PDF render backend 금지) — P9-5 |
@@ -52,7 +52,7 @@ flowchart TB
ppdf["kebab-parse-pdf"]
pimg["kebab-parse-image"]
paud["kebab-parse-audio<br/>(P8 보류)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2)"]
pcode["kebab-parse-code<br/>(P10-1A-2 + P10-1B + P10-1C-Go + P10-1C-JK + P10-2 + P10-3 + P10-1D)"]
ptypes["kebab-parse-types"]
norm["kebab-normalize"]
chunk["kebab-chunk"]
@@ -127,7 +127,7 @@ flowchart TB
UI → store/llm/parse 직접 의존 금지. 모든 user-facing 진입은 `kebab-app` facade 만 통한다 (frozen 설계 §8). `kebab-cli``--config <path>` flag 를 honor 하려면 `kebab_app::*_with_config(cfg, …)` companion 을 통해 Config 을 명시적으로 thread 하는 패턴 — 자세한 이유는 [tasks/HOTFIXES.md](../tasks/HOTFIXES.md) 의 `--config` 항목.
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가, P10-1C-Go 에서 `tree-sitter-go` 추가, P10-1C-JK 에서 `tree-sitter-java` / `tree-sitter-kotlin-ng` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지). Kotlin 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착 — 사용 불가).
`kebab-parse-code` 의 외부 tree-sitter grammar crate 의존: P10-1A-2 에서 `tree-sitter-rust` 추가, P10-1B 에서 `tree-sitter-python` / `tree-sitter-typescript` / `tree-sitter-javascript` 추가, P10-1C-Go 에서 `tree-sitter-go` 추가, P10-1C-JK 에서 `tree-sitter-java` / `tree-sitter-kotlin-ng` 추가, P10-1D 에서 `tree-sitter-c` / `tree-sitter-cpp` 추가. 모두 `kebab-parse-code` 에만 격리 (facade 룰 — UI crate / chunker 가 직접 import 금지). Kotlin 은 `tree-sitter-kotlin-ng` 사용 (bare `tree-sitter-kotlin` 은 tree-sitter 0.210.23 에 고착 — 사용 불가).
## 디렉토리 구조
@@ -165,12 +165,15 @@ kebab/
│ ├── kebab-source-fs/ # 워크스페이스 walk + checksum (P1-1)
│ ├── kebab-parse-md/ # Markdown frontmatter + blocks (P1-2/3)
│ ├── kebab-normalize/ # ParsedBlock → CanonicalDocument (P1-4)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-*-ast-v1 (Tier 1) + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + tier2_shared (P10-2) chunker (P1-5, P7-2, P10-1A-2, P10-1B, P10-1C-Go, P10-1C-JK, P10-2)
│ ├── kebab-chunk/ # heading-aware + pdf-page-v1 + code-*-ast-v1 (Tier 1) + k8s-manifest-resource-v1 + dockerfile-file-v1 + manifest-file-v1 + tier2_shared (P10-2) + code-text-paragraph-v1 (P10-3) chunker (P1-5, P7-2, P10-1A-2, P10-1B, P10-1C-Go, P10-1C-JK, P10-2, P10-3, P10-1D)
│ │ └── src/
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin)
│ │ ├── code_*_ast_v1.rs # Tier 1 AST chunkers (rust/python/ts/js/go/java/kotlin/c/cpp)
│ │ ├── code_c_ast_v1.rs # Tier 1 (p10-1D): C top-level fn / struct / enum / union
│ │ ├── code_cpp_ast_v1.rs # Tier 1 (p10-1D): C++ namespace::Class::method (recursive nesting)
│ │ ├── k8s_manifest_resource_v1.rs # Tier 2 (p10-2): YAML multi-doc, apiVersion+kind per resource
│ │ ├── dockerfile_file_v1.rs # Tier 2 (p10-2): whole-file Dockerfile
│ │ ├── manifest_file_v1.rs # Tier 2 (p10-2): whole-file Cargo.toml / go.mod / .json / .xml / .groovy
│ │ ├── code_text_paragraph_v1.rs # Tier 3 (p10-3): blank-line paragraph + 80/20 line-window fallback
│ │ └── tier2_shared.rs # Tier 2 (p10-2): shared oversize fallback + Chunk builder helpers
│ ├── kebab-store-sqlite/ # SQLite + FTS5 (V001/V002/V003) (P1-6, P2-1, P3-3)
│ ├── kebab-search/ # Lexical + Vector + Hybrid retriever (P2-2, P3-4)
@@ -181,7 +184,7 @@ kebab/
│ ├── kebab-eval/ # golden query runner + metrics (P5-1, P5-2)
│ ├── kebab-parse-image/ # ImageExtractor + Ollama OCR + caption (P6)
│ ├── kebab-parse-pdf/ # lopdf per-page text extractor (P7-1)
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B), Go (P10-1C-Go), Java + Kotlin (P10-1C-JK — java.rs + kotlin.rs); chunker lives in kebab-chunk
│ ├── kebab-parse-code/ # tree-sitter AST extractors: Rust (P10-1A-2), Python + TypeScript + JavaScript (P10-1B), Go (P10-1C-Go), Java + Kotlin (P10-1C-JK — java.rs + kotlin.rs), C + C++ (P10-1D — c.rs + cpp.rs); chunker lives in kebab-chunk
│ ├── kebab-app/ # facade (P0 시그니처 + P3-5/P6-4/P7-3 본체)
│ ├── kebab-tui/ # Ratatui shell + Library 패널 (P9-1)
│ ├── kebab-mcp/ # stdio MCP server — tools: schema, doctor, search, ask (P9-FB-30)

View File

@@ -21,6 +21,30 @@ cargo build --release -p kebab-cli # debug 도 무방. 디버그가 더 빠르
# Mac 등 별도 호스트에서
OLLAMA_HOST=0.0.0.0:11434 ollama serve
ollama pull gemma4:e4b # 기본 default. 더 큰 variant 원하면 gemma4:26b
# CPU only / RAM ≤ 16 GB 환경이면 ≤ 4B Q4 모델 권장 (gemma3:4b / qwen2.5:3b 등) —
# 8B+ 모델은 첫 RAG 답변이 5분 (기본 [models.llm] request_timeout_secs)
# 한도를 넘기 쉬워 `error: kb-rag: llm.generate_stream` 으로 떨어짐.
# 노브 늘리려면 config 에 request_timeout_secs = 1200 추가
# 또는 KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=1200 env. HOTFIXES 2026-05-25 참조.
```
sudo / systemd 없이 격리 디렉토리에 설치하는 경로 (컨테이너 / WSL2 / 회사 머신
유용):
```bash
# tarball 만 받아 사용자 디렉토리에 풀고 OLLAMA_MODELS 로 모델 디렉토리 분리.
mkdir -p /opt/ollama/{models,logs}
curl -fL https://ollama.com/download/ollama-linux-amd64.tar.zst -o /tmp/ollama.tar.zst
zstd -d /tmp/ollama.tar.zst -o /tmp/ollama.tar && tar -xf /tmp/ollama.tar -C /opt/ollama/
OLLAMA_MODELS=/opt/ollama/models OLLAMA_HOST=127.0.0.1:11434 \
/opt/ollama/bin/ollama serve > /opt/ollama/logs/serve.log 2>&1 &
/opt/ollama/bin/ollama pull gemma3:4b
# 종료: pkill -f "ollama serve"
```
cold start 가 긴 모델 (8B+ 또는 첫 호출) 은 `kebab ask --stream` 으로 시도 권장
— 토큰을 stderr 에 ndjson 으로 흘려 받아 5분 timeout 한도 안에서도 첫 토큰이
빨리 surface 됨 (fb-33). 자세한 명령은 아래 "Streaming ask (fb-33)" 절.
```
본 머신에서 reachability 검증:
@@ -140,6 +164,37 @@ KB ask "이 KB 안에서 ..." --mode hybrid --k 5 # 9. RAG 답변 (Ollama
KB --json ask "..." --mode hybrid # 10. 기계 친화 출력 검증
```
### 한국어 trigram 검색 (v0.17.0)
`chunks_fts` 가 FTS5 `trigram` tokenizer 로 동작 — 한국어 query 는 3자 이상 substring 매칭. V007 자동 backfill 이라 기존 KB 의 binary 만 v0.17.0+ 로 교체하면 즉시 적용 (re-ingest 불필요). `kebab.sqlite` 파일 크기가 trigram index 비대화로 ~2-5배 또는 수백 MB 증가.
`fixtures/search/korean/hash-table.md` (또는 등가) 를 워크스페이스에 두고 ingest 한 후:
```bash
# 3자 연속 substring (raw, 원문에 "해시 충돌은" / "충돌은 발생" 가 있음)
KB search --mode lexical "충돌은"
# multi-token Korean — builder 가 ("해시 충돌") OR ("해시" "충돌") 으로
# 변환 (각 토큰 2자라 token-AND 후보는 trigram 비호환, whole-phrase 가 hit)
KB search --mode lexical "해시 충돌"
# 한영 혼합 — 둘 다 3자 이상이라 whole-phrase + token-AND 모두 후보
KB search --mode lexical "Rust 충돌은"
# 2자 query — 정상 0 hit + stderr `[hint] 3자 이상 키워드 권장`
KB search --mode lexical "충돌"
# 동일 케이스의 --json 출력에는 search_response.v1.hint 필드 포함
KB search --mode lexical "충돌" --json | jq '.hits | length, .hint'
# → 0
# → "3자 이상 키워드 권장 (trigram tokenizer 제약)"
# raw FTS5 mode (single quote 로 감싼 입력) — 사용자 명시 의도, hint 미출력
KB search --mode lexical "'충돌'"
```
영어 lexical 도 substring 매칭으로 바뀜 — `KB search --mode lexical "token"``tokenizer` / `tokenize` 도 hit (recall ↑, 단어 경계 정밀도 ↓).
### Stale doc indicator
Each search hit and RAG citation carries `indexed_at` (RFC3339 of the doc's last
@@ -502,6 +557,100 @@ KB --json schema | jq '.stats.code_lang_breakdown'
- **Dockerfile**: `<dockerfile>` (고정 심볼, 전체 파일이 단일 chunk).
- **TOML / JSON / XML / Groovy / go.mod**: `<manifest>` (고정 심볼, 전체 파일이 단일 chunk). 단, 파일이 `tier2_shared` 의 oversize threshold 초과 시 줄 단위 fallback chunk.
## P10-3 Tier 3 paragraph fallback
P10-2 와 동일한 격리 KB 설정. `.sh` 파일은 direct, 비-k8s YAML 은 fallback 으로 들어간다.
```bash
# 1) shell script (direct Tier 3)
cat > /tmp/kebab-smoke/workspace/deploy.sh <<'EOF'
#!/usr/bin/env bash
set -e
echo "ingesting..."
kebab ingest
echo "done"
kebab schema --json | jq '.stats'
EOF
# 2) 비-k8s YAML (Tier 2 가 0 chunk → Tier 3 fallback)
cat > /tmp/kebab-smoke/workspace/docker-compose.yml <<'EOF'
version: '3'
services:
api:
image: nginx:latest
ports:
- 8080:80
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol = None 확인)
KB search --mode hybrid "ingest" --code-lang shell --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "shell", chunker_version = "code-text-paragraph-v1"
KB search --mode hybrid "nginx" --code-lang yaml --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang, chunker: .chunker_version}]}'
# 기대: symbol = null, lang = "yaml", chunker_version = "code-text-paragraph-v1"
# 5) schema stats 에 shell 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"shell": N, "yaml": M, ...}
```
**Tier 3 citation.symbol 컨벤션**: 항상 `null`. 의미 단위 식별 안 함. `lang` 은 원본 lang 보존 (shell → `"shell"`, yaml → `"yaml"` 등).
## P10-1D C + C++ AST chunkers
P10-3 와 동일한 격리 KB 설정. `.c` 와 `.cpp` 파일이 각자의 AST chunker 로 처리된다.
```bash
# 1) C 파일 — top-level function symbol
cat > /tmp/kebab-smoke/workspace/parser.c <<'EOF'
#include <stdio.h>
int parse_record(const char *line) {
if (line == NULL) return -1;
return 0;
}
EOF
# 2) C++ 파일 — namespace::Class::method symbol
cat > /tmp/kebab-smoke/workspace/chunker.cpp <<'EOF'
namespace kebab {
namespace chunk {
class Foo {
public:
void bar() { /* impl */ }
};
} // namespace chunk
} // namespace kebab
EOF
# 3) ingest
KB ingest
# 4) 언어별 검색 (citation.symbol 확인)
KB search --mode hybrid "parse_record" --code-lang c --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "parse_record" (function name only), lang = "c"
KB search --mode hybrid "bar" --code-lang cpp --json | \
jq '{hits: [.hits[] | {symbol: .citation.symbol, lang: .citation.lang}]}'
# 기대: symbol = "kebab::chunk::Foo" 또는 "kebab::chunk::Foo::bar" (namespace::Class[::method]), lang = "cpp"
# 5) schema stats 에 C/C++ 카운트 확인
KB --json schema | jq '.stats.code_lang_breakdown'
# 기대: {"c": N, "cpp": M, ...}
```
**Tier 1 (p10-1D) citation.symbol 컨벤션**: C 는 function name only (`parse_record` 같이 nesting 없음). C++ 는 `namespace::Class::method` (recursive namespace + class nesting). `.h` 파일이 C++ syntax (namespace / template / class) 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback (`code-text-paragraph-v1`) 으로 자동 picked up.
## 검증 체크리스트
- `kebab doctor` 가 `--config` path 를 honor 하고 그 안의 `storage.data_dir` 를 출력 (XDG default 가 아님).
@@ -537,6 +686,8 @@ rm -rf /tmp/kebab-smoke # 통째로 정리
- (P10-1C-Go) `.go` 파일을 워크스페이스에 두면 `kebab ingest` 가 `code-go-ast-v1` 로 처리. `--code-lang go` 검색이 `citation.symbol` 에 `<package>.<Func>` / `<package>.(*Receiver).<Method>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"go": N` 등장 확인.
- (P10-1C-JK) `.java` 파일은 `code-java-ast-v1`, `.kt`/`.kts` 파일은 `code-kotlin-ast-v1` 로 처리. `--code-lang java` / `--code-lang kotlin` 검색이 `citation.symbol` 에 `com.foo.Foo.bar` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"java": N` / `"kotlin": N` 등장 확인.
- (P10-2) `.yaml`/`.yml` 파일은 apiVersion+kind 파싱으로 k8s resource 별 chunk 생성 (`k8s-manifest-resource-v1`). `Dockerfile`/`Dockerfile.*` 는 전체 파일 단일 chunk (`dockerfile-file-v1`). `.toml`/`.json`/`.xml`/`.groovy`/`go.mod` 는 전체 파일 단일 chunk (`manifest-file-v1`). `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` 검색이 `citation.symbol` 에 각각 `Deployment/default/my-app` / `<dockerfile>` / `<manifest>` 형식 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"yaml": N` / `"dockerfile": N` / `"toml": N` 등장 확인.
- (P10-3) `.sh`/`.bash`/`.zsh` 파일은 direct Tier 3 (`code-text-paragraph-v1`). 비-k8s YAML (apiVersion+kind 없는 yaml) 은 k8s chunker 가 0 chunk → Tier 3 fallback 으로 picked up. `--code-lang shell` / `--code-lang yaml` 검색이 `citation.symbol = null`, `chunker_version = "code-text-paragraph-v1"` 결과를 반환하면 wiring 정상. `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"shell": N` 등장 확인.
- (P10-1D) `.c` / `.h` 파일은 `code-c-ast-v1` (function name only symbol). `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` 는 `code-cpp-ast-v1` (`namespace::Class::method` symbol). `--code-lang c` / `--code-lang cpp` 검색 동작 + `kebab schema --json | jq .stats.code_lang_breakdown` 에 `"c": N` / `"cpp": M` 등장 확인. `.h` 파일이 C++ 내용 (namespace 등) 갖고 있으면 자동으로 Tier 3 (`code-text-paragraph-v1`) fallback 으로 picked up.
- (P7-3 + follow-up) 동일 path 에 byte 가 다른 PDF 를 두 번째 ingest 하면 `purge_vector_orphans_for_workspace_path` 가 옛 chunk_id 를 LanceDB 에서 먼저 삭제, 이어서 `purge_orphan_at_workspace_path` 가 옛 doc / chunks / embedding_records 를 SQLite 에서 sweep. 새 byte 가 새 `doc_id` 로 색인됨. `IngestReport` 에 그 자산만 `new+=1` (다른 자산은 `updated`). 두 store 모두 정합 — 옛 본문 검색 시 옛 chunks 가 더 이상 surface 되지 않음.
### Embedding upgrade (fb-39b)

View File

@@ -0,0 +1,930 @@
# p10-1D C + C++ AST Chunkers Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Activate C + C++ code ingest end-to-end. P10 Tier 1 chunker family final entry.
**Architecture:** Same shape as 1B (multi-language single PR) and 1C-JK (JVM family). 2 new tree-sitter grammars + 2 extractors + 2 chunkers + media routing (delegated via `code_lang_for_path`, no change) + app dispatch arms. C symbol = function name only; C++ symbol = `namespace::Class::method` via recursive class/namespace nesting (Java/Kotlin + Python hybrid).
**Tech Stack:** Rust 2024 workspace, `tree-sitter` 0.26 (already), `tree-sitter-c` + `tree-sitter-cpp` (NEW). 1A-2/1B/1C/p10-2/p10-3 infrastructure unchanged.
**Memory note:** Host has been OOM'd previously (재부팅 사례). Per-crate cargo only. ONE full-suite + clippy invocation in Task J. NO `cargo test --workspace` outside that gate.
---
## Pre-flight
Branch `feat/p10-1d-c-cpp` already exists (spec commit `8add684`).
- [ ] **Disk hygiene**: `df -h /` 점검. 80% 넘으면 `cargo clean`.
Reference files:
- 1C-JK extractor: `crates/kebab-parse-code/src/{java,kotlin}.rs` — closest template for source-side identifier prefix (package vs namespace).
- 1B Python extractor: `crates/kebab-parse-code/src/python.rs` — class-nesting recursion model (relevant for C++ class nesting).
- 1A-2 chunker: `crates/kebab-chunk/src/code_rust_ast_v1.rs` — duplicate-with-substitution pattern.
- 1B/1C/p10-2/p10-3 dispatch generalization: `crates/kebab-app/src/lib.rs::ingest_one_code_asset` (~L17962116). Current allowlist + 4-arm match.
- spec: `tasks/p10/p10-1d-c-cpp-ast-chunker.md`.
---
## Task A: Workspace deps (tree-sitter-c + tree-sitter-cpp)
**Files:**
- Modify: `Cargo.toml` (`[workspace.dependencies]`, after `tree-sitter-kotlin-ng`)
- Modify: `crates/kebab-parse-code/Cargo.toml`
- [ ] **Step 1**: `cargo add tree-sitter-c tree-sitter-cpp -p kebab-parse-code`. If either crate's actively-maintained name differs (e.g. `tree-sitter-cpp` vs `tree-sitter-cpp-ng`), verify on crates.io. The `tree-sitter-c` 0.24 / `tree-sitter-cpp` 0.23 line is the most common; verify compatibility with workspace `tree-sitter = "0.26"` (likely already supported via the `tree-sitter-language` shim).
- [ ] **Step 2**: Lift the two resolved versions into `[workspace.dependencies]` (after `tree-sitter-kotlin-ng`):
```toml
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "<resolved>"
tree-sitter-cpp = "<resolved>"
```
Switch crate's `Cargo.toml` entries to `{ workspace = true }`.
- [ ] **Step 3**: `cargo build -p kebab-parse-code` → clean. Unused dep warning is fine.
- [ ] **Step 4**: Commit:
```bash
git add Cargo.toml Cargo.lock crates/kebab-parse-code/Cargo.toml
git commit -m "$(cat <<'EOF'
build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
If a crate's resolved name has a non-obvious fork suffix (e.g. `tree-sitter-cpp-ng`), document it in the commit body.
---
## Task B: C AST extractor (`kebab-parse-code/src/c.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/c.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs` (pub mod + `C_PARSER_VERSION` const)
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/c.rs`. Mirror `crates/kebab-parse-code/src/go.rs` (closest template — single-language, no namespace/package nesting, top-level units). Replace tree-sitter-go with tree-sitter-c:
```rust
//! p10-1D: C AST extractor.
use crate::traits::{Extractor, ExtractContext};
use anyhow::{Context, Result};
use kebab_core::{Block, BlockId, CanonicalDocument, CodeBlock, CommonBlock, /*..*/, SourceSpan, id_for_block, id_for_doc};
use tree_sitter::Parser;
pub const C_PARSER_VERSION: &str = concat!("tree-sitter-c-", env!("CARGO_PKG_VERSION"));
// Or use the tree-sitter-c crate version: better to hardcode for stability.
// Look at how go.rs / rust.rs / etc. set their PARSER_VERSION.
pub struct CAstExtractor {
parser: Parser,
}
impl CAstExtractor {
pub fn new() -> Self {
let mut parser = Parser::new();
parser.set_language(&tree_sitter_c::LANGUAGE.into()).expect("load tree-sitter-c");
Self { parser }
}
}
impl Extractor for CAstExtractor {
fn extract(&mut self, ctx: &ExtractContext, bytes: &[u8]) -> Result<CanonicalDocument> {
// ... mirror go.rs:
// 1. parse the tree
// 2. iterate source_file's named_children
// 3. for each top-level node:
// - function_definition → emit unit (symbol = fn name)
// - struct_specifier (named) → emit unit (symbol = struct name)
// - enum_specifier (named) → emit unit (symbol = enum name)
// - union_specifier (named) → emit unit (symbol = union name)
// - declaration → glue
// - preproc_include / preproc_def / preproc_function_def / preproc_ifdef → glue
// - else → glue
// 4. <top-level> glue chunk if any glue accumulated
// 5. <module> post-pass if 0 units
// ...
todo!("mirror go.rs structure with C-specific node-kind names")
}
}
```
**ACTION**: Read `crates/kebab-parse-code/src/go.rs` in full first. It's the closest template — single-language, no namespace prefix to thread through (C is even simpler than Go since there's no `package`). Port the structure: parse → iterate top-level → match on node-kind → emit units or accumulate glue.
Node-kind name reference (tree-sitter-c): `function_definition`, `struct_specifier`, `enum_specifier`, `union_specifier`, `declaration`, `preproc_*`. Confirm by checking the crate's `node-types.json` if uncertain.
**Function name extraction**: `function_definition` has a `declarator` field. The innermost `identifier` of that declarator is the function name. Mirror how go.rs extracts function names — it uses tree-sitter field traversal.
- [ ] **Step 2**: Register the module in `crates/kebab-parse-code/src/lib.rs`:
```rust
pub mod c;
pub use c::{CAstExtractor, C_PARSER_VERSION};
```
- [ ] **Step 3**: Build:
```bash
cargo build -p kebab-parse-code 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 4**: Commit (no test yet — Task D adds the snapshot test):
```bash
git add crates/kebab-parse-code/src/c.rs crates/kebab-parse-code/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): C AST extractor (tree-sitter-c)
Top-level units: function_definition (symbol = fn name), struct_specifier,
enum_specifier, union_specifier (each emits 1 unit with the symbol being
the named identifier). Preprocessor directives + top-level declarations
group into a <top-level> glue chunk. Empty file or zero units → <module>
post-pass.
C symbol = function name only — no namespace, no class nesting (design §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task C: C++ AST extractor (`kebab-parse-code/src/cpp.rs`)
**Files:**
- Create: `crates/kebab-parse-code/src/cpp.rs`
- Modify: `crates/kebab-parse-code/src/lib.rs`
- [ ] **Step 1**: Create `crates/kebab-parse-code/src/cpp.rs`. The closest template is `crates/kebab-parse-code/src/java.rs` (1C-JK) — it handles package prefix + class nesting via recursion. C++ adds namespace nesting (multiple levels possible).
Pseudocode:
```rust
//! p10-1D: C++ AST extractor.
use crate::traits::{Extractor, ExtractContext};
use anyhow::{Context, Result};
use kebab_core::{/* ... */};
use tree_sitter::{Node, Parser};
pub const CPP_PARSER_VERSION: &str = "tree-sitter-cpp-<resolved>";
pub struct CppAstExtractor { parser: Parser }
impl CppAstExtractor {
pub fn new() -> Self {
let mut parser = Parser::new();
parser.set_language(&tree_sitter_cpp::LANGUAGE.into()).expect("load tree-sitter-cpp");
Self { parser }
}
fn visit(&self, node: Node, source: &[u8], prefix: &[&str], units: &mut Vec<(String, Node)>, glue: &mut Vec<Node>) {
// prefix is the namespace/class chain so far (e.g. ["kebab", "chunk", "MdHeadingV1Chunker"]).
for child in node.named_children(&mut node.walk()) {
match child.kind() {
"namespace_definition" => {
let name = child.child_by_field_name("name")
.and_then(|n| n.utf8_text(source).ok())
.unwrap_or("<anonymous>");
let mut new_prefix = prefix.to_vec();
new_prefix.push(name);
let body = child.child_by_field_name("body").unwrap_or(child);
self.visit(body, source, &new_prefix, units, glue);
}
"class_specifier" | "struct_specifier" if child.child_by_field_name("name").is_some() => {
let name = child.child_by_field_name("name")
.and_then(|n| n.utf8_text(source).ok())
.unwrap_or("<anonymous>");
// Emit the class itself as a unit.
let symbol = build_symbol(prefix, &[], name); // e.g. "kebab::chunk::Foo"
units.push((symbol, child));
// Recurse for nested classes / methods.
let mut new_prefix = prefix.to_vec();
new_prefix.push(name);
let body = child.child_by_field_name("body").unwrap_or(child);
self.visit(body, source, &new_prefix, units, glue);
}
"function_definition" => {
// declarator may be qualified_identifier (out-of-class def) or plain identifier.
let symbol = extract_fn_symbol(child, source, prefix);
units.push((symbol, child));
// Do NOT recurse into function body — inner classes/lambdas left to a future revision.
}
"template_declaration" => {
// Recurse: unwrap to inner declarator (function_definition or class_specifier)
// and treat it as if it were directly there. Template params NOT in symbol.
self.visit(child, source, prefix, units, glue);
}
"enum_specifier" if child.child_by_field_name("name").is_some() => {
let name = child.child_by_field_name("name").and_then(|n| n.utf8_text(source).ok()).unwrap_or("<anonymous>");
let symbol = build_symbol(prefix, &[], name);
units.push((symbol, child));
}
"concept_definition" => {
let name = /* extract */;
let symbol = build_symbol(prefix, &[], &name);
units.push((symbol, child));
}
_ => glue.push(child),
}
}
}
}
fn build_symbol(prefix: &[&str], extras: &[&str], leaf: &str) -> String {
// Join with ::
let mut parts: Vec<&str> = prefix.iter().copied().collect();
parts.extend_from_slice(extras);
parts.push(leaf);
parts.join("::")
}
fn extract_fn_symbol(node: Node, source: &[u8], prefix: &[&str]) -> String {
// function_definition.declarator may be a function_declarator wrapping a
// qualified_identifier (out-of-class def like `void Foo::bar(){}`) or a
// plain identifier (free fn or in-namespace fn).
// Need to walk down to the leaf identifier and any qualifier chain.
// For qualified_identifier "Foo::bar::baz", break into ["Foo", "bar"] qualifier + "baz" leaf.
// ...
todo!("walk declarator → qualified_identifier → assemble symbol with prefix")
}
// Extractor impl: parse, visit(root, ...), emit chunks-of-blocks per (symbol, node) pair + <top-level> glue + <module> fallback.
```
This is the most intricate extractor in p10-1D. **Action**: read `crates/kebab-parse-code/src/java.rs` for the recursion pattern, then `crates/kebab-parse-code/src/python.rs` for the class-nesting pattern, and combine. tree-sitter-cpp's node-types.json (or a quick `tree-sitter parse` against a sample file) confirms exact node-kind names.
- [ ] **Step 2**: Register in `crates/kebab-parse-code/src/lib.rs`:
```rust
pub mod cpp;
pub use cpp::{CppAstExtractor, CPP_PARSER_VERSION};
```
- [ ] **Step 3**: Build:
```bash
cargo build -p kebab-parse-code 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 4**: Commit:
```bash
git add crates/kebab-parse-code/src/cpp.rs crates/kebab-parse-code/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
Symbol = namespace::Class::method via recursive visit. namespace_definition
pushes namespace name (anonymous → <anonymous>). class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition
emits method unit (symbol may include qualified_identifier prefix for
out-of-class definitions). template_declaration unwraps to inner declarator
(template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. extern "C" block content + using/include/define → glue.
Constructor / destructor symbols use Class::Class / Class::~Class
convention. Operator overloads keep operator+ form.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task D: C chunker + snapshot test
**Files:**
- Create: `crates/kebab-chunk/src/code_c_ast_v1.rs`
- Create: `crates/kebab-chunk/tests/fixtures/sample.c`
- Create: `crates/kebab-chunk/tests/code_c_ast_snapshot.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
- [ ] **Step 1**: Create `crates/kebab-chunk/src/code_c_ast_v1.rs`. **Mirror `crates/kebab-chunk/src/code_go_ast_v1.rs`** (closest 1-extractor pattern, no nesting):
```rust
//! p10-1D: C AST chunker.
use crate::tier2_shared::build_chunk;
use crate::{Chunker, ChunkPolicy};
use anyhow::Result;
use kebab_core::{Block, Chunk, Document};
pub const VERSION_LABEL: &str = "code-c-ast-v1";
pub struct CodeCAstV1Chunker;
impl Chunker for CodeCAstV1Chunker {
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
crate::tier2_shared::policy_hash(policy)
}
fn chunk(&self, doc: &Document, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Mirror code_go_ast_v1.rs's body — iterate doc.blocks, each Block::Code
// contributes 1 chunk via build_chunk. Apply oversize fallback per block
// via tier2_shared::push_chunks_with_oversize.
// ...
todo!("mirror code_go_ast_v1.rs verbatim, substituting VERSION_LABEL")
}
}
```
Read `code_go_ast_v1.rs` and port verbatim — the language-agnostic body iterates `doc.blocks` and emits chunks. Only the `VERSION_LABEL` and (potentially) symbol formatting helper change.
- [ ] **Step 2**: Create `tests/fixtures/sample.c` (~30 lines, includes top-level fn, struct, enum, preprocessor):
```c
#include <stdio.h>
#include <stdlib.h>
#define MAX_BUF 4096
typedef enum {
OK = 0,
ERR_PARSE,
ERR_IO,
} status_t;
typedef struct {
int id;
char name[64];
status_t status;
} record_t;
static int counter = 0;
int parse_record(const char *line, record_t *out) {
if (line == NULL || out == NULL) return ERR_PARSE;
return OK;
}
void print_record(const record_t *r) {
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
}
int main(void) {
record_t r = { .id = 1, .name = "foo", .status = OK };
print_record(&r);
return 0;
}
```
Expected snapshot: 3 function units (`parse_record`, `print_record`, `main`) + 1 enum unit (`status_t`) + 1 struct unit (`record_t`) + 1 `<top-level>` glue (preproc + global var). Total ~6 chunks.
- [ ] **Step 3**: Create `tests/code_c_ast_snapshot.rs` mirroring `tests/code_go_ast_snapshot.rs`. Assertions:
```rust
// Pseudocode:
// 1. Load fixture sample.c
// 2. Run CAstExtractor → Document
// 3. Run CodeCAstV1Chunker.chunk(&doc, &policy)
// 4. Assert chunks.len() == expected (6).
// 5. Assert symbols (from chunks[i].source_spans[0]::SourceSpan::Code.symbol) match expected list:
// ["status_t", "record_t", "parse_record", "print_record", "main", "<top-level>"]
// (order matches AST traversal order — verify by running once.)
// 6. Assert all chunks have lang = Some("c").
```
- [ ] **Step 4**: Register module in `crates/kebab-chunk/src/lib.rs`:
```rust
pub mod code_c_ast_v1;
pub use code_c_ast_v1::CodeCAstV1Chunker;
```
- [ ] **Step 5**: Run test:
```bash
cargo test -p kebab-chunk --test code_c_ast_snapshot -- --nocapture 2>&1 | tail -25
```
Expected: PASS. If chunk count or symbol order differs from expectation, INSPECT the actual output and update the test's expected list to match (run once to learn, codify on second run).
- [ ] **Step 6**: Clippy + commit:
```bash
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/src/code_c_ast_v1.rs \
crates/kebab-chunk/src/lib.rs \
crates/kebab-chunk/tests/fixtures/sample.c \
crates/kebab-chunk/tests/code_c_ast_snapshot.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): code-c-ast-v1 chunker + snapshot test
Mirrors code-go-ast-v1's chunker pattern (1 chunk per AST unit + <top-level>
glue + oversize fallback). Snapshot test against tests/fixtures/sample.c
(function + struct + enum + preprocessor) verifies symbol order + lang=c
stamping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task E: C++ chunker + snapshot test
**Files:**
- Create: `crates/kebab-chunk/src/code_cpp_ast_v1.rs`
- Create: `crates/kebab-chunk/tests/fixtures/sample.cpp`
- Create: `crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs`
- Modify: `crates/kebab-chunk/src/lib.rs`
- [ ] **Step 1**: Create `code_cpp_ast_v1.rs`. **Mirror `code_c_ast_v1.rs`** verbatim, only VERSION_LABEL differs:
```rust
pub const VERSION_LABEL: &str = "code-cpp-ast-v1";
pub struct CodeCppAstV1Chunker;
impl Chunker for CodeCppAstV1Chunker {
fn chunker_version(&self) -> &'static str { VERSION_LABEL }
// ... identical body — both languages use the same Block::Code → Chunk emission ...
}
```
The actual symbol-formatting work happens in the EXTRACTOR (Task C). The chunker's job is to iterate blocks the extractor produced and emit Chunks. Both C and C++ chunkers are essentially identical bodies.
- [ ] **Step 2**: Create `tests/fixtures/sample.cpp` (~50 lines, includes namespace + nested class + method + free fn + template):
```cpp
#include <string>
#include <vector>
namespace kebab {
namespace chunk {
class MdHeadingV1Chunker {
public:
MdHeadingV1Chunker() = default;
~MdHeadingV1Chunker() = default;
std::string chunk_doc(const std::string& doc) {
return doc;
}
int operator()(int x) const {
return x * 2;
}
private:
int counter_ = 0;
};
template <typename T>
T identity(T value) {
return value;
}
} // namespace chunk
void global_helper() {
// free function in kebab namespace
}
} // namespace kebab
int main() {
kebab::chunk::MdHeadingV1Chunker c;
return 0;
}
```
Expected snapshot symbols (verify on first run, then codify):
- `kebab::chunk::MdHeadingV1Chunker` (class unit)
- `kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker` (constructor)
- `kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker` (destructor)
- `kebab::chunk::MdHeadingV1Chunker::chunk_doc`
- `kebab::chunk::MdHeadingV1Chunker::operator()`
- `kebab::chunk::identity` (template fn)
- `kebab::global_helper`
- `main` (free fn, no namespace)
- `<top-level>` (include + using)
~9 chunks total.
- [ ] **Step 3**: Create `tests/code_cpp_ast_snapshot.rs` mirroring `code_c_ast_snapshot.rs`. Assert symbol list matches expected (run once to learn the actual order, codify).
- [ ] **Step 4**: Register module in `lib.rs`:
```rust
pub mod code_cpp_ast_v1;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
```
- [ ] **Step 5**: Run test:
```bash
cargo test -p kebab-chunk --test code_cpp_ast_snapshot -- --nocapture 2>&1 | tail -30
```
Expected: PASS.
- [ ] **Step 6**: Clippy + commit:
```bash
cargo clippy -p kebab-chunk --all-targets -- -D warnings
git add crates/kebab-chunk/src/code_cpp_ast_v1.rs \
crates/kebab-chunk/src/lib.rs \
crates/kebab-chunk/tests/fixtures/sample.cpp \
crates/kebab-chunk/tests/code_cpp_ast_snapshot.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
Identical chunker body to code-c-ast-v1; per-language work happens in the
CppAstExtractor (Task C). Snapshot fixture covers nested namespace +
class + ctor/dtor + method + operator overload + template fn + free fn +
top-level main, verifying namespace::Class::method symbol convention per
design §3.4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task F: ingest_one_code_asset dispatch + tier3 fallback list extension
**Files:**
- Modify: `crates/kebab-app/src/lib.rs`
- [ ] **Step 1**: Top-of-file `use kebab_chunk::{...}` extend with `CodeCAstV1Chunker` + `CodeCppAstV1Chunker`:
```rust
use kebab_chunk::{
/* existing items */,
CodeCAstV1Chunker,
CodeCppAstV1Chunker,
};
```
- [ ] **Step 2**: Allowlist (around line 953) extend:
```rust
if matches!(lang.as_str(),
"rust" | "python" | "typescript" | "javascript" | "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "shell"
| "c" | "cpp")
```
- [ ] **Step 3**: `parser_version` match — add C/C++ arms (Tier 1, so they DO get a real parser version):
```rust
let parser_version = match code_lang {
// ... existing 7 Tier 1 + Tier 2 + shell arms ...
"c" => ParserVersion(kebab_parse_code::C_PARSER_VERSION.to_string()),
"cpp" => ParserVersion(kebab_parse_code::CPP_PARSER_VERSION.to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
```
- [ ] **Step 4**: `chunker_version` match — add C/C++ arms:
```rust
let chunker_version = match code_lang {
// ... existing arms ...
"c" => CodeCAstV1Chunker.chunker_version(),
"cpp" => CodeCppAstV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
```
- [ ] **Step 5**: `canonical_result` extract match — add C/C++ arms:
```rust
let canonical_result: anyhow::Result<CanonicalDocument> = match code_lang {
"rust" => RustAstExtractor::new().extract(&ctx, &bytes).context("..."),
// ... existing ...
"c" => CAstExtractor::new().extract(&ctx, &bytes)
.context("kb-parse-code::CAstExtractor::extract (code:c)"),
"cpp" => CppAstExtractor::new().extract(&ctx, &bytes)
.context("kb-parse-code::CppAstExtractor::extract (code:cpp)"),
// ... Tier 2 + shell ...
other => anyhow::bail!("unreachable (extract): {other}"),
};
```
(Add `use kebab_parse_code::{CAstExtractor, CppAstExtractor};` at the top if not already wildcard-imported.)
- [ ] **Step 6**: `chunks_result` match — add C/C++ arms:
```rust
let chunks_result: anyhow::Result<Vec<Chunk>> = if extract_fell_back {
// ... existing ...
} else {
match code_lang {
"rust" => CodeRustAstV1Chunker.chunk(&canonical, chunk_policy).context("..."),
// ... existing ...
"c" => CodeCAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeCAstV1Chunker::chunk (code:c)"),
"cpp" => CodeCppAstV1Chunker.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeCppAstV1Chunker::chunk (code:cpp)"),
// ... existing ...
other => anyhow::bail!("unreachable (chunk): {other}"),
}
};
```
- [ ] **Step 7**: `tier3_fallback_cv` (p10-3 Critical fix) — C/C++ are fallback-eligible (extract may fail on `.h` C++ headers or malformed code):
```rust
let tier3_fallback_cv = match code_lang {
"rust" | "python" | "typescript" | "javascript"
| "go" | "java" | "kotlin"
| "yaml" | "dockerfile" | "toml" | "json" | "xml" | "groovy" | "go-mod"
| "c" | "cpp" // p10-1d:
=> Some(CodeTextParagraphV1Chunker.chunker_version()),
_ => None,
};
```
(The exact location of this match is in `ingest_one_code_asset` between ~lines 1921-1927 per the p10-3 critical fix.)
- [ ] **Step 8**: Build:
```bash
cargo build -p kebab-app 2>&1 | tail -5
```
Expected: clean.
- [ ] **Step 9**: Per-crate test (no regression):
```bash
cargo test -p kebab-app --lib -- --nocapture 2>&1 | tail -10
```
Expected: 52 PASS (existing baseline).
- [ ] **Step 10**: Clippy + commit:
```bash
cargo clippy -p kebab-app --all-targets -- -D warnings
git add crates/kebab-app/src/lib.rs
git commit -m "$(cat <<'EOF'
feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv list with "c" + "cpp" arms. C uses
CAstExtractor + CodeCAstV1Chunker; C++ uses CppAstExtractor +
CodeCppAstV1Chunker. Both langs are Tier 3-fallback-eligible (e.g. .h
file with C++ syntax may fail tree-sitter-c parse → Tier 3 paragraph
fallback).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task G: code_ingest_smoke integration tests (C + C++)
**Files:**
- Modify: `crates/kebab-app/tests/code_ingest_smoke.rs`
- [ ] **Step 1**: Append 2 tests at the end of the file (mirror the existing tier1 tests `c_ast_v1_*` if present; if not, mirror `rust_ast_v1_*` or `go_ast_v1_*`):
```rust
#[test]
fn tier1_c_ingest_searchable() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
std::fs::write(
workspace.join("parser.c"),
"#include <stdio.h>\n\nint parse_record(const char *line) {\n if (line == NULL) return -1;\n return 0;\n}\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1, "expected at least 1 new doc");
let hits = env.search_code_lang("c", "parse_record").expect("search");
assert!(!hits.is_empty(), "expected at least 1 c hit");
match &hits[0].citation {
Citation::Code { symbol, lang, .. } => {
assert_eq!(symbol.as_deref(), Some("parse_record"), "C symbol must be function name only");
assert_eq!(lang.as_deref(), Some("c"));
}
other => panic!("expected Citation::Code, got {other:?}"),
}
assert_eq!(
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-c-ast-v1"),
);
}
#[test]
fn tier1_cpp_ingest_searchable() {
let env = TestEnv::lexical_only();
let workspace = env.workspace_root();
std::fs::write(
workspace.join("chunker.cpp"),
"namespace kebab {\nnamespace chunk {\nclass Foo {\npublic:\n void bar() { /* impl */ }\n};\n}\n}\n",
)
.unwrap();
let report = env.ingest().expect("ingest");
assert!(report.new_docs >= 1);
let hits = env.search_code_lang("cpp", "bar").expect("search");
assert!(!hits.is_empty(), "expected at least 1 cpp hit");
match &hits[0].citation {
Citation::Code { symbol, lang, .. } => {
// Symbol could be "kebab::chunk::Foo::bar" or "kebab::chunk::Foo" depending on which chunk hits first.
assert!(
symbol.as_deref().map_or(false, |s| s.starts_with("kebab::chunk::Foo")),
"C++ symbol must start with namespace::Class prefix, got {:?}", symbol
);
assert_eq!(lang.as_deref(), Some("cpp"));
}
other => panic!("expected Citation::Code, got {other:?}"),
}
assert_eq!(
hits[0].chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-cpp-ast-v1"),
);
}
```
- [ ] **Step 2**: Run tests:
```bash
cargo test -p kebab-app --test code_ingest_smoke tier1_c_ingest tier1_cpp_ingest -- --nocapture 2>&1 | tail -30
```
Expected: 2 PASS.
- [ ] **Step 3**: Full smoke regression:
```bash
cargo test -p kebab-app --test code_ingest_smoke -- --nocapture 2>&1 | tail -30
```
Expected: 18 PASS (16 existing + 2 new).
- [ ] **Step 4**: Clippy + commit:
```bash
cargo clippy -p kebab-app --tests -- -D warnings
git add crates/kebab-app/tests/code_ingest_smoke.rs
git commit -m "$(cat <<'EOF'
test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
= function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
symbol starts with namespace::Class prefix, lang = "cpp",
chunker_version = "code-cpp-ast-v1".
Brings code_ingest_smoke to 18 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1 + shell 1 +
yaml-fallback 1 + 2 reingest-unchanged regression + c 1 + cpp 1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task H: frozen design §10 activation log
**Files:**
- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
- [ ] **Step 1**: Find §10 activation log. Add p10-1D entry right after the p10-3 entry:
```
**p10-1D 활성화 (C + C++) (2026-05-21)**: Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only; C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up.
```
- [ ] **Step 2**: Commit:
```bash
git add docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md \
docs/superpowers/specs/2026-04-27-kebab-final-form-design.md 2>/dev/null
git add docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
git commit -m "$(cat <<'EOF'
docs(p10-1d): activate C + C++ in frozen design §10
P10 Tier 1 chunker family complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task I: README + HANDOFF + ARCHITECTURE + SMOKE + tasks/INDEX + tasks/p10/INDEX
**Files:**
- Modify: `README.md` (Mermaid + ingest row), `HANDOFF.md`, `docs/ARCHITECTURE.md`, `docs/SMOKE.md`, `tasks/INDEX.md`, `tasks/p10/INDEX.md`
- [ ] **Step 1 — README.md**: Update the `kebab ingest` row's supported-langs list to include `.c` / `.h``code-c-ast-v1` and `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx``code-cpp-ast-v1`. Extend `--code-lang c` / `--code-lang cpp` in the enumeration. Update the Mermaid `chunker[...]` node to include `code-c-ast-v1, code-cpp-ast-v1` in the brace.
- [ ] **Step 2 — HANDOFF.md**: P10 row append `, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)**`. Update 한 줄 요약 to include C/C++. Update 다음 후보 (drop p10-1D; remaining: P9-5 desktop / P8 audio).
- [ ] **Step 3 — docs/ARCHITECTURE.md**: code parser table row: append C + C++ row mention. Flowchart `pcode` node: append `+ P10-1D`. Directory tree chunkers list: add `code_c_ast_v1.rs` + `code_cpp_ast_v1.rs`.
- [ ] **Step 4 — docs/SMOKE.md**: Add a "## P10-1D C + C++ AST chunker" section after the P10-3 section. Walkthrough with sample.c + sample.cpp ingest + `--code-lang c` / `--code-lang cpp` search assertions. Append verification checklist entry.
- [ ] **Step 5 — tasks/INDEX.md + tasks/p10/INDEX.md**: Flip p10-1D row ⏳ → ✅ (v0.16.0).
- [ ] **Step 6**: Commit:
```bash
git add README.md HANDOFF.md docs/ARCHITECTURE.md docs/SMOKE.md tasks/INDEX.md tasks/p10/INDEX.md
git commit -m "$(cat <<'EOF'
docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++). Tier 2 (k8s + dockerfile + manifest) and Tier 3
(paragraph fallback) already active. p10-1D 활성화 + ✅ flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task J: workspace test gate + clippy
- [ ] **Step 1**: Disk check (`df -h /`) + optional `cargo clean`.
- [ ] **Step 2**: `cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -80`. Expected: all PASS.
- [ ] **Step 3**: `cargo clippy --workspace --all-targets -- -D warnings 2>&1 | tail -30`. Expected: clean.
---
## Task K: version bump + gitea PR + release
**Files:**
- Modify: `Cargo.toml`
- [ ] **Step 1**: Workspace `version = "0.15.0"``"0.16.0"`.
- [ ] **Step 2**: `cargo build -p kebab-cli` to refresh Cargo.lock.
- [ ] **Step 3**: Commit:
```bash
git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
- [ ] **Step 4**: Push branch + open gitea PR via REST API. Title: `feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete`.
- [ ] **Step 5**: Wait for code-reviewer APPROVE → merge via gitea REST API → cut `gitea-release v0.16.0`.
---
## Verification matrix
| 검증 | 명령 | 기대 |
|------|------|------|
| C symbol | `kebab search --code-lang c --json` | `Citation::Code.symbol = "<fn_name>"` |
| C++ symbol | `kebab search --code-lang cpp --json` | `Citation::Code.symbol = "namespace::Class::method"` |
| .h fallback | `.h` with C++ syntax → ingest | Tier 3 fallback: `chunker_version = "code-text-paragraph-v1"`, lang = c |
| code_lang_breakdown | `kebab schema --json` | `"c": N`, `"cpp": M` |
---
## Risks reminder (구현 중 주의)
- **tree-sitter grammar version resolution**: tree-sitter 0.26 호환 grammar. crates.io 최신 버전 default.
- **tree-sitter-cpp 의 node-kind 명**: spec 의 가정 (`namespace_definition`, `class_specifier`, `function_definition`, `template_declaration`, `concept_definition`, etc.) 이 실제 grammar 와 일치하는지 fixture parse 로 검증.
- **out-of-class method def 의 prefix 복원**: `void Foo::bar()` 의 declarator 가 `function_declarator > qualified_identifier > namespace_identifier "Foo" + identifier "bar"`. spec 의 `extract_fn_symbol` 이 이 chain 정확히 walk.
- **Operator overload**: tree-sitter-cpp 의 `operator_name` 또는 `field_identifier` "operator+" 형태. fixture 로 검증.
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,396 @@
# 한국어 trigram FTS tokenizer + dogfood 버그픽스 구현 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** kebab 의 FTS5 tokenizer 를 `unicode61``trigram` 으로 교체해 한국어 lexical 검색을 가능하게 하고, 같은 도그푸딩 라운드의 작은 버그 둘(C typedef struct 미노출, code_lang_breakdown 집계 단위)을 함께 닫는다.
**Architecture:** 3개 독립 변경을 별도 PR(A/B/C)로 진행. PR-A 는 V007 migration 으로 `chunks_fts` shadow 테이블만 재구축(원본 `chunks`·embedding 불변) + `lexical.rs::build_match_string()` trigram 대응 재설계 + CLI/TUI 짧은 query 안내. PR-B 는 C extractor 에 typedef alias unit 방출 추가 + **`parser_version` `code-c-v1``code-c-v2` bump + same-workspace_path orphan purge** (Codex round 2 검증으로 추가). PR-C 는 wire additive 필드 + 기존 stats 필드 설명 정정. 셋 머지 후 v0.17.0 release cut.
**Tech Stack:** Rust 2024, SQLite FTS5, refinery migrations, tree-sitter-c, cargo test.
**작업 방식:** 코드는 Claude(`executor` agent), 각 PR diff 는 Codex + Gemini 가 리뷰(`/ask codex`·`/ask gemini`), PR 은 gitea-ops. design: `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`.
---
## File Structure
**생성:**
- `migrations/V007__fts_trigram.sql` — chunks_fts 를 trigram tokenizer 로 재구축 + backfill
**수정:**
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` — §5.5 verbatim SQL 블록 + contentless 표현 정정
- `crates/kebab-store-sqlite/tests/fts.rs` — CI diff-check 테스트 + 한국어 trigram (3자 이상) + 2자 query 0-hit 핀 + 영어 substring 테스트
- `crates/kebab-search/src/lexical.rs``build_match_string()` 재설계 (필수) + BM25 snapshot 갱신
- `crates/kebab-cli/src/main.rs` (또는 search wrapper) — 2자 미만 query + 0 결과 시 안내 메시지
- `crates/kebab-tui/src/search.rs` — 동일 안내
- `crates/kebab-parse-code/src/c.rs` — typedef-wrapped struct → synthetic unit + `PARSER_VERSION` bump
- `crates/kebab-app/src/lib.rs` — ingest 경로의 same-workspace_path orphan purge (parser_version mismatch + asset 동일 케이스)
- `crates/kebab-store-sqlite/src/fts.rs` — 모듈 헤더 주석의 "contentless FTS5" 표현 정정 (실제는 일반 FTS5 shadow)
- `crates/kebab-store-sqlite/src/store.rs``code_lang_chunk_breakdown()` (JOIN documents)
- `crates/kebab-app/src/schema.rs``Stats.code_lang_chunk_breakdown`
- `docs/wire-schema/v1/schema.schema.json``code_lang_chunk_breakdown` additive 필드
- `docs/SMOKE.md` — 한국어 검색 시나리오 추가
- `README.md`, `HANDOFF.md`, `tasks/HOTFIXES.md`, `tasks/p10/p10-1d-c-cpp-ast-chunker.md`
- `Cargo.toml` — workspace `version`
(`crates/kebab-cli/src/wire.rs` 는 수정하지 않음 — `wire_schema()``SchemaV1` 을 serde 로 통째 직렬화하므로 변경 2 의 새 필드가 자동 포함됨.)
---
## PR-A — FTS5 trigram tokenizer
브랜치: `feat/korean-trigram-tokenizer`. design doc 도 이 PR 에 포함(아직 main 에 commit 안 됨).
### Task A1: 현재 query builder 동작 파악 + SQLite 버전 확인
Codex 리뷰로 현재 `build_match_string()` (lexical.rs:177) 이 trigram 비호환이라는 점은 이미 확정 (whitespace split → `"..."` AND 결합 → 한국어 multi-token 0-hit). 본 task 는 builder 의 정확한 동작 기록과 SQLite 버전 확인이 목적이며, 재설계 자체는 Task A5 (필수).
**Files:**
- Read: `crates/kebab-search/src/lexical.rs` (`build_match_string()` 본문, MATCH query 빌드 라인 260-290, lexical snapshot 라인 506 부근)
- [x] **Step 1: builder 동작 기록**`build_match_string()` (lexical.rs:177-200) baseline:
1. `text.trim()` → trimmed. 빈 → `None` 반환.
2. `strip_single_quotes(trimmed)` 매치 시 (= `'...'` 전체 감싸기, closing quote 가 trimmed 의 마지막 char) → inner.trim() 빈 아니면 `Some(inner.to_string())` (raw FTS5 verbatim mode).
3. 그 외 → `trimmed.split_whitespace().map(escape_fts5_token).collect()` → 빈이면 `None`, 아니면 ` ` join (FTS5 default implicit AND).
- `escape_fts5_token` (lexical.rs:218): 토큰을 `"..."` 으로 wrap, inner `"` 은 doubling.
- prefix `*` 별도 처리 없음 — 사용자가 raw mode 로 입력해야.
- raw mode 진입 조건: 사용자가 single quote `'...'` 로 trimmed 전체를 감싼 경우 (`lexical.rs:167` 주석에 명시).
- MATCH 호출: lexical.rs:281 `WHERE chunks_fts MATCH ?` (bound parameter).
- [x] **Step 2: SQLite 버전 확인**`Cargo.toml`: `rusqlite = { version = "0.32", features = ["bundled"] }` + `Cargo.lock` `libsqlite3-sys = "0.30.1"` (system sqlite 무관, in-tree 빌드). libsqlite3-sys 0.30.1 의 번들 SQLite ~3.46.x — trigram (3.34+) 사용 가능. design 결정대로 `tokenize = 'trigram'` 단독 사용 (case-insensitive 기본). `remove_diacritics` 옵션 미사용.
- [x] **Step 3: lexical snapshot 위치 확인** — Codex round 1 의 "lexical.rs:506" 은 `fn normalize_bm25` (BM25 score → (0,1] mapping) 였음 — numerical transformation 이라 token stream 영향 없음. 진짜 snapshot 은:
- `crates/kebab-search/tests/lexical.rs:1012` `lexical_snapshot_run_1` — fixture 기반, `KEBAB_UPDATE_SNAPSHOTS=1` env 로 regenerate, "baseline snapshot must exist; run with KEBAB_UPDATE_SNAPSHOTS=1 to seed".
- `crates/kebab-search/tests/hybrid.rs:121` `hybrid_snapshot_run_1` — 동일 패턴 (`hybrid_snapshot drift`). 한국어 trigram 영향 받음 (token stream 변경).
- inline `crates/kebab-search/src/lexical.rs:592` `normalize_bm25_top_score_in_unit_interval` — numerical, 영향 없음 (회귀 없음 확인만).
Task A4 Step 5 에서 lexical_snapshot_run_1 + hybrid_snapshot_run_1 둘 다 regenerate.
### Task A2: V007 migration 작성
**Files:**
- Create: `migrations/V007__fts_trigram.sql`
- Read: `migrations/V002__fts.sql` (trigger 본문 verbatim 복사용)
- [x] **Step 1: V007 작성** — 아래 내용으로 생성. 컬럼 구성은 V002 와 동일, `tokenize` 만 교체. trigger 본문은 V002 와 동일.
```sql
-- V007__fts_trigram.sql
-- Replace the chunks_fts tokenizer: unicode61 -> trigram.
-- Korean is agglutinative; unicode61 tokenizes whole eojeol (with
-- particles attached) so substring matching fails. trigram indexes
-- 3-character grams, enabling Korean partial matches. See design §5.5
-- and tasks/HOTFIXES.md (2026-05-22).
--
-- chunks_fts is a shadow of chunks; this migration rebuilds it in
-- place and backfills from chunks, so no re-ingest is required.
DROP TRIGGER IF EXISTS chunks_au;
DROP TRIGGER IF EXISTS chunks_ad;
DROP TRIGGER IF EXISTS chunks_ai;
DROP TABLE IF EXISTS chunks_fts;
CREATE VIRTUAL TABLE chunks_fts USING fts5(
chunk_id UNINDEXED,
doc_id UNINDEXED,
heading_path,
text,
tokenize = 'trigram'
);
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;
```
> Step 1 전에 `migrations/V002__fts.sql` 의 `CREATE VIRTUAL TABLE` 컬럼 목록과 trigger 본문을 실제로 대조해, 위 SQL 이 V002 와 trigger 본문·컬럼명(`heading_path_json` 등)에서 정확히 일치하는지 확인한다. 다르면 V002 를 source 로 맞춘다.
- [x] **Step 2: migration 적용 확인**`cargo test -p kebab-store-sqlite` 통과 (10/10 fts tests + 모든 store test PASS). V007 backfill 도 정상 동작.
### Task A3: design §5.5 verbatim + CI diff-check 갱신
**Files:**
- Modify: `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` (§5.5, 라인 ~1024-1043)
- Modify: `crates/kebab-store-sqlite/tests/fts.rs` (`fts_v002_matches_design_section_5_5_verbatim`, 라인 ~408-435)
- [x] **Step 1: diff-check 테스트 baseline 확인** — A2 검증에서 `fts_v002_matches_design_section_5_5_verbatim` 는 PASS (V002 vs design 둘 다 unicode61 시점이라 match). V007 추가 자체는 기존 test 안 깨뜨림.
- [x] **Step 2: design §5.5 갱신**`tokenize = 'unicode61 remove_diacritics 2'``'trigram'`. §5.5 본문 위에 한국어 trigram 채택 사유 + trade-off + "contentless 가 아님" 명시 prose 한 단락 추가.
- [x] **Step 3: diff-check 테스트를 V007 대상으로 갱신**`extract_migration_5_5_verbatim_block()``include_str!` path 를 `V007__fts_trigram.sql` 로, 함수명 `fts_v002_matches_design_section_5_5_verbatim``fts_v007_matches_design_section_5_5_verbatim`, assertion msg 갱신.
- [x] **Step 4: 테스트 통과 확인**`cargo test -p kebab-store-sqlite --test fts` → 10/10 PASS (`fts_v007_matches_design_section_5_5_verbatim` 포함).
- [ ] **Step 5: Commit** — A2 + A3 한 묶음으로 commit.
### Task A4: 한국어/영어 trigram 매칭 테스트
**Files:**
- Create: `fixtures/search/korean/hash-table.md` (또는 동등) — 도그푸딩 한국어 문서 복사
- Modify: `crates/kebab-store-sqlite/tests/fts.rs`
- Modify: `crates/kebab-app/tests/search_korean.rs` (회귀 핀 + multi-token assert + fixture 통합)
- Update: lexical BM25 snapshot (A1 Step 3 위치)
- [x] **Step 0: 한국어 fixture 도입 (Gemini round 3 medium)** — 도그푸딩 실 문서 (`/build/cache/dogfood-p10b/workspace/docs/hash-table.md`, 한국어 위키 mediawiki HTML 출력 4512줄, CC-BY-SA) 는 크기·라이선스 부담으로 직접 commit 회피. 대신 도그푸딩 query 들 (`충돌은`/`해시 충돌`/`시 충`/`해시충`/`충돌`) 을 모두 cover 하는 **합성 fixture** `fixtures/search/korean/hash-table.md` 작성 + commit. 검증 query 별 기대 동작:
- raw `MATCH '충돌은'` → hit (`해시 충돌은 발생한다` 가 원문에 있음)
- quoted `MATCH '"해시 충돌"'` → hit (whole phrase)
- quoted `MATCH '"시 충"'` → hit (phrase)
- raw `MATCH '해시충'` → 0-hit (원문에 공백 없는 `해시충` 연속 없음)
- raw `MATCH '충돌'` (2자) → 0-hit (trigram 구조)
실 위키 문서 fixture 가 필요한 후속 검증은 별도 task 로 deferral.
- [x] **Step 1: 한국어 trigram 매칭 테스트**`fts_trigram_korean_3char_substring_hits` (fts.rs §7). 5개 assert (raw 3자 hit, quoted phrase hit, `해시충` 0-hit) 모두 통과.
- [x] **Step 1b: 2자 query 0-hit 핀**`fts_trigram_korean_short_query_zero_hit_pinned` (`충돌`/`키` 0-hit).
- [ ] **Step 1c: multi-token 한국어 query 테스트**`crates/kebab-search` 또는 `crates/kebab-app` 통합 레벨. 사용자 query `해시 충돌``build_match_string()` 을 통해 hit. Expected: A4 시점 FAIL (현재 builder 가 `"해시" "충돌"` AND 로 trigram 0-hit), Task A5 builder 재설계 후 PASS.
- [x] **Step 2: 영어 substring 동작 핀**`fts_trigram_english_substring_hits` (`token``tokenizer`, `to` 0-hit).
- [x] **Step 3: 통과 확인 (부분)**`cargo test -p kebab-store-sqlite --test fts` → 13/13 PASS (Step 1/1b/2 + 기존 10). Step 1c 는 A5 후.
- [ ] **Step 4: 통합 회귀 확인**`cargo test -p kebab-app search_korean` (`러스트` 3자라 trigram 으로도 통과). `search_korean.rs``해시 충돌` multi-token assert 추가 (A5 후 통과).
- [ ] **Step 5: lexical BM25 snapshot 갱신** — A1 Step 3 에서 식별한 snapshot 파일을 trigram token stream 기준으로 갱신 (`cargo insta accept` 또는 수동). snippet token 단위가 trigram 으로 바뀌므로 word budget 관련 테스트 기대값도 함께 검토.
- [ ] **Step 6: Commit**`git commit` (test: korean + english trigram matching + bm25 snapshot).
### Task A5: lexical.rs query builder 재설계 (필수)
Codex 검증: 현재 `build_match_string()` (lexical.rs:177) 은 whitespace split 후 각 토큰을 `"..."` 로 감싸 implicit AND 결합. 각 토큰이 2자 이하면 trigram MATCH 가 0-hit → `해시 충돌` 같은 multi-token 한국어 query 가 깨짐. 본 task 는 builder 를 trigram 대응으로 재설계.
**사용자 결정** (2자 이하 한국어 query 정책): lexical core 는 정상 0-hit (변경 없음), 안내 메시지는 CLI/TUI 레이어가 출력 ("3자 이상 키워드 권장").
**A1 baseline 노트** (Task A1 Step 1 에서 채움):
`build_match_string(text: &str) -> Option<String>` (lexical.rs:177-200) baseline:
1. `text.trim()` → trimmed. 빈 → `None`.
2. `strip_single_quotes(trimmed)` 매치 시 (single quote `'...'` 가 trimmed 전체 감쌈, closing quote 가 마지막 char — `'foo' bar` 는 raw 아님) → inner.trim() 빈 아니면 `Some(inner.to_string())` (raw FTS5 verbatim).
3. 그 외 → `trimmed.split_whitespace().map(escape_fts5_token).collect()` → 빈이면 `None`, 아니면 ` ` join (FTS5 default implicit AND).
`escape_fts5_token(tok)` (lexical.rs:218): `"..."` wrap + inner `"` doubling.
재설계 시 회귀 방지 — raw mode (single quote `'...'`) 진입 조건은 그대로 유지. escape_fts5_token 도 그대로 (trigram 도 FTS5 special char escape 필요). 변경은 비-raw 경로의 토큰 합성만.
SQLite: rusqlite 0.32 + libsqlite3-sys 0.30.1 **bundled** (in-tree). SQLite ~3.46.x → trigram 사용 가능.
Snapshot: `crates/kebab-search/tests/lexical.rs::lexical_snapshot_run_1` + `crates/kebab-search/tests/hybrid.rs::hybrid_snapshot_run_1` (둘 다 `KEBAB_UPDATE_SNAPSHOTS=1` 로 regenerate). inline `normalize_bm25_top_score_in_unit_interval` 는 numerical 영향 없음.
**Files:**
- Modify: `crates/kebab-search/src/lexical.rs` (`build_match_string()`)
- Modify: `crates/kebab-cli/src/main.rs` 또는 search 결과 처리 wrapper — 안내 메시지
- Modify: `crates/kebab-tui/src/search.rs` 또는 결과 렌더 — 안내 메시지
- [ ] **Step 1: builder 재설계 테스트 작성 (실패 확인)**`해시 충돌` multi-token 한국어 query + 한영 혼합 query (`Rust 충돌은`) 가 hit 하는 테스트. raw FTS mode 진입 (사용자가 single quote `'...'` 로 감싼 경우, `lexical.rs:167`) 회귀 테스트. Expected: FAIL.
- [ ] **Step 2: `build_match_string()` 재설계** — Codex round 2 권장안 (검증된 알고리즘):
1. raw single-quote mode (사용자가 single quote `'...'` 로 감싼 경우, `lexical.rs:167`) 는 기존 유지.
2. `whole = escape_fts5_phrase(trimmed)` 를 항상 첫 후보로 — 단 `trimmed.chars().count() >= 3` 일 때만.
3. whitespace 로 분리된 토큰 중 `chars().count() >= 3` 만 escaped token AND 후보 생성.
4. 후보가 둘 다 있으면 `(<whole>) OR (<token_and>)`, 하나만 있으면 그대로.
5. **후보가 하나도 없으면 `None` 반환 (빈 MATCH 금지 — FTS5 syntax error).** 호출자는 None 시 SQL 실행 자체를 회피하고 빈 결과를 반환.
이러면 `해시 충돌` (각 토큰 2자, whole 5자) → whole phrase 후보로 hit, `충돌` (whole 2자, token 0개) → None → 0-hit, `Rust 충돌은` (token 2개 모두 ≥3) → AND + whole 모두 후보 → OR hit. escape 는 trigram 도 `"`, `*` 처리 필요 — 기존 로직 보강.
- [ ] **Step 3: 테스트 통과 확인** — Step 1 신규 + Task A4 Step 1c·4 (`해시 충돌`) PASS.
- [ ] **Step 4: 안내 메시지 — CLI**`crates/kebab-cli/src/main.rs``kebab search` 결과 처리에서, 결과가 비어 있고 **`query.trim().chars().count() < 3`** (trimmed 전체 기준) 일 때 stderr 에 "3자 이상 키워드 권장 (trigram tokenizer 제약)" 한 줄. **"모든 토큰이 3자 미만" 조건은 사용 금지** (Codex round 3 medium) — `해시 충돌` 같은 valid whole-phrase query 에 false trigger 회피. `--json` 모드에서는 stderr 안내 미출력 (wire hint 는 Step 4b 에서 별도 전달).
- [ ] **Step 4b: wire `search_response.v1` 에 `hint` 필드 추가 (MCP 가시성, Gemini round 3 high)**`--json` 모드와 MCP 가 사용하는 search response 에도 hint 가 전달돼야 LLM/agent 가 "0 결과 + 3자 미만" 케이스를 이해함. 변경:
- `crates/kebab-app/src/schema.rs` (또는 search 응답 type 정의 위치) 의 `SearchResponse``hint: Option<String>` additive 필드 추가.
- search 실행 결과가 비어 있고 query trimmed.chars().count() < 3 일 때 `hint = Some("3자 이상 키워드 권장 (trigram tokenizer 제약)")`, 그 외 None.
- `crates/kebab-mcp``search` tool 결과 직렬화에 hint 포함 (serde 자동이면 OK, 확인).
- `docs/wire-schema/v1/search_response.schema.json` (또는 search 응답 스키마 파일) 에 `hint: { type: ["string", "null"] }` additive 필드 명세.
- CLI 의 Step 4 stderr 안내는 사람 가시성, wire hint 는 agent 가시성 — 둘은 보완적, 같은 조건 사용.
- [ ] **Step 5: 안내 메시지 — TUI** — Codex round 2/3 권장 구현 (`search.rs`/`app.rs`/`run.rs` 실제 구조 기반):
- `SearchState` (`crates/kebab-tui/src/app.rs:116` 근처) 에 `short_query_hint: Option<String>` 필드 추가.
- **Stale hint 방지 (Codex round 3 high)**: 현재 generation 은 `fire_search` 때만 증가하고 input mutation 때는 증가 안 함 — `poll_worker` 가 worker 결과 수신 시 `last_query == 현재 SearchState.input.content && last_mode == 현재 mode` 일치 시만 hint 를 세팅한다. 불일치 시 (사용자가 새 query 입력 중) hint 세팅 skip — stale worker 결과로 새 input 화면이 덮이지 않게.
- 추가로 input 이 변경되면 (`set_input` 등) `short_query_hint = None` reset.
- hint 세팅 조건: `last_query.trim().chars().count() < 3` (trimmed 전체 기준, Codex round 3 medium 으로 통일 — 토큰 기반 분기 사용 금지) + hits 비어 있음 + raw mode 아님.
- 표시: `dynamic_status` (`crates/kebab-tui/src/run.rs:389` 근처) 또는 Search pane 의 결과 영역 empty render 분기에서 `short_query_hint` 가 Some 일 때 한 줄 표시.
- [ ] **Step 6: 안내 메시지 테스트** — CLI stderr 캡처 + 미출력 케이스 (`--json`, 3자 이상 query, 결과 ≥ 1) 각각 테스트. TUI 안내 표시 unit 테스트.
- [ ] **Step 7: 전체 검증**`cargo test -p kebab-search -p kebab-cli -p kebab-tui` → 신규 + 기존 PASS.
- [ ] **Step 8: Commit**`git commit` (feat: trigram-aware query builder + short-query guidance).
### Task A6: 사용자 문서 동기화
**Files:**
- Modify: `README.md`, `HANDOFF.md`, `tasks/HOTFIXES.md`, `docs/SMOKE.md`
- [ ] **Step 1: README** — 검색/Configuration 절에 한 줄: 한국어 포함 KB 의 `--mode lexical`/`hybrid` 가 trigram 3-gram substring 으로 동작 (3자 이상 query 권장). SQLite 파일 (`kebab.sqlite`) 크기가 trigram 인덱스 비대화로 증가 (도그푸딩 KB 기준 ~2-5배 또는 수백 MB 단위, Gemini round 3 low) 한 줄.
- [ ] **Step 2: HANDOFF** — "머지 후 발견된 버그/결정" 절의 2026-05-22 한국어 lexical 항목을 "v0.17.0 trigram 으로 해소" 로 갱신. "P10 dogfooding 백로그" 의 한국어 tokenizer 항목 상태 갱신.
- [ ] **Step 3: HOTFIXES** — 2026-05-22 한국어 lexical 항목의 "Next step (미진행)" 을 v0.17.0 / V007 으로 closure 처리. trigram 채택, 영어 동작 변경, 디스크 용량 증가, `heading_path` JSON 노이즈 후속을 dated 항목으로 기록.
- [ ] **Step 4: SMOKE.md** — 한국어 검색 시나리오 추가 (Codex round 3 high: hit query 가 자기 단언과 모순되지 않게):
- fixture: A4 Step 0 에서 commit 한 `fixtures/search/korean/hash-table.md` (또는 동등) 를 ingest.
- `kebab search --mode lexical '충돌은'` (원문에 공백 없이 3자 연속 substring) → hit 확인.
- `kebab search '해시 충돌'` (multi-token, builder 가 whole phrase 후보로 hit) → hit 확인.
- `kebab search --mode lexical '충돌'` (2자) → 0-hit + "3자 이상 키워드 권장" stderr 안내 확인.
- `kebab search --mode lexical '충돌' --json` → 결과 hits 빈 배열 + `hint` 필드 (Step 4b) 포함 확인.
- V007 자동 backfill (re-ingest 불필요) + SQLite 파일 크기 증가 안내 (도그푸딩 KB 기준 ~2-5배 또는 수백 MB).
- [ ] **Step 4b: SKILL.md (Gemini round 3 medium)**`integrations/claude-code/kebab/SKILL.md``mcp__kebab__search` 섹션 또는 Don't 섹션에 한 줄 추가: "한국어 lexical 검색 시 3자 이상의 키워드를 사용하는 것이 검색 품질·recall 측면에서 유리. 2자 이하 한국어 query (예: '값', '키', '충돌') 는 trigram tokenizer 구조상 lexical 0-hit — search_response 의 `hint` 필드 확인 권장."
- [ ] **Step 5: Commit**`git commit` (docs: trigram tokenizer — README/HANDOFF/HOTFIXES/SMOKE/SKILL).
### Task A7: PR-A 생성 + 리뷰 루프
- [ ] **Step 1: 전체 검증**`cargo test --workspace --no-fail-fast -j 1` + `cargo clippy --workspace --all-targets -- -D warnings`. 둘 다 통과 확인.
- [ ] **Step 2: PR 생성** — gitea-ops 로 `feat/korean-trigram-tokenizer` → main PR. 본문에 design doc 링크 + V007 자동 backfill(re-ingest 불필요) 명시.
- [ ] **Step 3: 리뷰** — PR diff 를 `/ask codex` + `/ask gemini` 로 리뷰. 두 리뷰 종합 후 반영 — 반영 시 같은 브랜치에 commit, 재검증.
- [ ] **Step 4: 머지** — 리뷰 반영 완료 + CI green 후 머지.
---
## PR-B — C typedef-wrapped struct fix
브랜치: `feat/c-typedef-struct-unit`.
### Task B1: typedef extractor fix (TDD)
**Files:**
- Modify: `crates/kebab-parse-code/src/c.rs` (extractor 라인 ~254-262, `PARSER_VERSION` 라인 34, 테스트 라인 ~492-505)
- [ ] **Step 1: 기존 테스트 재작성(실패 확인)**`c_extractor_typedef_struct_falls_into_glue``c_extractor_typedef_struct_emits_unit` 으로 바꾼다. `typedef struct { int x; int y; } Point;` 입력에서 `Point` 라는 이름의 unit 이 방출되는지 assert. Expected: FAIL (현재는 glue 로 빠짐).
- [ ] **Step 2: extractor 수정** — top-level `type_definition` 노드 처리: 내부에 anonymous `struct_specifier`/`enum_specifier`/`union_specifier`(name 필드 없음)가 있으면, `type_definition``declarator`(typedef alias)에서 이름을 추출해 그 이름으로 unit 을 방출한다. named struct 경로는 그대로 둔다. 코드 변경 전 `c.rs` 의 현재 노드 분기(`struct_specifier | enum_specifier | union_specifier` arm)와 tree-sitter-c 의 `type_definition` 자식 구조를 읽고 맞춘다.
- [ ] **Step 3: 테스트 통과 확인**`cargo test -p kebab-parse-code c_extractor_typedef` → PASS.
- [ ] **Step 4: named struct 회귀 확인**`cargo test -p kebab-parse-code` 전체 → 기존 C extractor 테스트(named struct, glue 등) 모두 PASS.
- [ ] **Step 5: parser_version bump**`crates/kebab-parse-code/src/c.rs:34``PARSER_VERSION = "code-c-v1"``"code-c-v2"` 로 bump. **chunker (`crates/kebab-chunk/src/code_c_ast_v1.rs` 의 `code-c-ast-v1`) 는 건드리지 않는다** — extractor output 만 바뀌고 chunker 로직 동일. C extractor 스냅샷/통합 테스트가 `parser_version` 문자열을 assert 하면 `code-c-v2` 로 갱신.
- [ ] **Step 5b: same-workspace_path orphan purge (Codex round 2 critical)** — parser_version bump 만으로 doc_id 가 갱신되지만, **파일 bytes 동일 (asset_id 동일) 케이스에서 기존 ingest 의 `stale_chunk_ids_at` (asset_id 변경 기반) 가 발동하지 않아 옛 doc_id row + 옛 chunk row + Lance vector 가 orphan 으로 남고 `idx_docs_workspace_path` UNIQUE 충돌이 날 수 있다**. 보강:
- **신규 helper 도입 (Codex round 3 medium)**: P7-3 의 `stale_chunk_ids_at` (`store.rs:440`) / `purge_orphan_at_workspace_path` (`store.rs:497`) 는 `asset_id != new_asset_id` 전용이라 parser-only bump 케이스에 no-op. 기존 helper 그대로 호출/확장보다 새 helper 두 개를 `crates/kebab-store-sqlite/src/store.rs` 에 추가:
- `stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path, new_doc_id) -> Vec<ChunkId>` — 같은 workspace_path 의 다른 doc_id 가 가진 chunk_ids 수집.
- `purge_document_at_workspace_path_except_doc_id(workspace_path, new_doc_id)` — 같은 workspace_path 의 다른 doc_id row 와 그 chunks 제거.
- `crates/kebab-app/src/lib.rs` 의 code asset ingest 분기 (parser mismatch 판정 직후, `lib.rs:812`/`882` 근처) 에서 위 두 helper 순차 호출: chunk_ids 수집 → `VectorStore::delete_by_chunk_ids` (P7-3 hotfix helper, 이건 chunk_id 기반이라 재사용 가능) → document/chunks row delete → 새 doc_id 로 정상 ingest 계속.
- 테스트: fixture C 파일을 `code-c-v1` 로 한 번 ingest → `PARSER_VERSION``v2` 로 모의 변경 후 같은 fixture 재 ingest → 옛 doc_id row 사라지고 새 doc_id 만 남음 + Lance vector 도 새 chunk_ids 만 존재 + UNIQUE 충돌 없음 확인.
- [ ] **Step 5c: 회귀 테스트 — 다른 asset 시 기존 purge 동작 유지** — bytes 가 실제로 바뀐 케이스 (asset_id 변경) 에서 `stale_chunk_ids_at` 가 기존대로 정리하는지 확인 (Step 5b 변경이 기존 경로 안 깨뜨리는지).
- [ ] **Step 6: 테스트 통과 확인**`cargo test -p kebab-parse-code` 전체 → PASS.
- [ ] **Step 7: Commit**`git commit` (fix: C typedef-wrapped struct emits named unit, parser_version code-c-v2).
### Task B2: HOTFIXES + spec 갱신, PR-B
**Files:**
- Modify: `tasks/HOTFIXES.md`, `tasks/p10/p10-1d-c-cpp-ast-chunker.md`
- [ ] **Step 1: HOTFIXES** — 2026-05-21 "typedef-wrapped struct/enum in C falls into glue" 항목의 Status/Next step 을 v0.17.0 closure 로 갱신.
- [ ] **Step 2: spec Risks**`p10-1d-c-cpp-ast-chunker.md` 의 Risks/notes 에 typedef alias unit 방출(top-level 한정, nested 익명 struct 는 여전히 glue) 을 한 줄로 갱신. frozen spec 본문은 건드리지 않고 Risks 절만.
- [ ] **Step 3: Commit + PR**`git commit` (docs) → gitea-ops 로 PR-B 생성.
- [ ] **Step 4: 리뷰 루프**`/ask codex` + `/ask gemini` 리뷰 → 반영 → 머지.
---
## PR-C — code_lang_chunk_breakdown
브랜치: `feat/code-lang-chunk-breakdown`.
### Task C1: store 함수 추가 (TDD)
**Files:**
- Modify: `crates/kebab-store-sqlite/src/store.rs` (`code_lang_breakdown` 인접, 라인 ~801-825)
- [ ] **Step 1: 테스트 작성(실패 확인)**`code_lang_chunk_breakdown()``chunks` 테이블 기준 언어별 chunk 수를 반환하는지 보는 store 테스트 추가. 한 doc 에 여러 chunk 인 fixture 로 doc 집계와 다른 값이 나옴을 확인. Expected: FAIL (함수 미존재).
- [ ] **Step 2: 함수 구현** — 기존 `code_lang_breakdown()` 패턴을 그대로 따르되 source 를 `chunks` 로: 언어 식별 컬럼을 `chunks` 에서 끌어온다. `chunks` 에 code_lang 이 직접 없으면 `chunks JOIN documents``documents` 의 code_lang 을 끌어 `COUNT(chunks)`. Step 2 전에 `chunks``documents` 스키마에서 code_lang 이 어디에 있는지 확인한다. 반환 타입은 `code_lang_breakdown` 과 동일한 `BTreeMap<String, u32>`.
- [ ] **Step 3: 테스트 통과 확인**`cargo test -p kebab-store-sqlite code_lang_chunk` → PASS.
- [ ] **Step 4: Commit**`git commit` (feat: code_lang_chunk_breakdown store query).
### Task C2: wire 필드 추가 (TDD)
**Files:**
- Modify: `crates/kebab-app/src/schema.rs` (`Stats`, 라인 ~69·170·202-219)
- Modify: `docs/wire-schema/v1/schema.schema.json` (`code_lang_chunk_breakdown` 필드)
- [ ] **Step 1: stats 테스트 확장 (실패 확인)**`schema.rs``stats_includes_code_lang_and_repo_breakdown_fields` 테스트에 `code_lang_chunk_breakdown` 필드 존재·값 검증 추가. fixture 는 한 doc 에 여러 chunks (doc count 와 chunk count 가 다른 값으로 채워지는지 확인). Expected: FAIL (필드 미존재).
- [ ] **Step 2: Stats 필드 추가**`Stats``code_lang_chunk_breakdown: BTreeMap<String, u32>` 추가, stats 빌드 지점에서 Task C1 의 `code_lang_chunk_breakdown()` 호출로 채운다. 기존 `code_lang_breakdown` 필드는 유지 (제거 시 wire breaking).
- [ ] **Step 3: wire.rs 자동 직렬화 확인**`crates/kebab-cli/src/wire.rs::wire_schema()``SchemaV1` 을 serde 로 통째 직렬화하므로 별도 코드 수정 불필요. 신규 필드가 wire JSON 출력에 자동 포함됨을 `cargo test -p kebab-cli wire` 의 기존 schema wrapper 테스트가 확인 (또는 신규 assertion 추가).
- [ ] **Step 4: 테스트 통과 확인**`cargo test -p kebab-app schema` + `cargo test -p kebab-cli` → PASS.
- [ ] **Step 5: wire schema JSON 갱신 (필수) + 기존 필드 설명 정정**`docs/wire-schema/v1/schema.schema.json``Stats` 정의에:
- `code_lang_chunk_breakdown` 을 기존 `code_lang_breakdown` 과 동일한 형태 (`{"type": "object", "additionalProperties": {"type": "integer", "minimum": 0}}`) 로 additive 추가.
- Gemini round 2 발견: 기존 `code_lang_breakdown`·`repo_breakdown` 의 description 이 "chunk count" 로 잘못 적혀 있으면 (실제 구현은 doc count) "doc count" 로 정정. 추가 필드 `code_lang_chunk_breakdown` description 은 "chunk count" 로 명시.
CI 가 schema-vs-impl 대조를 한다면 함께 통과 확인.
- [ ] **Step 6: Commit + PR**`git commit` (feat: code_lang_chunk_breakdown wire field) → gitea-ops 로 PR-C 생성.
- [ ] **Step 7: 리뷰 루프**`/ask codex` + `/ask gemini` 리뷰 → 반영 → 머지.
---
## Release — v0.17.0
### Task R1: version bump + release cut
- [ ] **Step 1: 선행 확인** — PR-A·B·C 셋 다 main 에 머지됐는지 확인. `git pull``cargo test --workspace --no-fail-fast -j 1` green.
- [ ] **Step 2: version bump**`Cargo.toml` workspace `version` `0.16.1``0.17.0`. `cargo build``Cargo.lock` 자동 갱신.
- [ ] **Step 3: Commit**`git commit` (`chore: bump version 0.16.1 → 0.17.0`).
- [ ] **Step 4: release** — gitea-ops 의 `gitea-release v0.17.0`. release notes: 한국어 lexical 검색 trigram 동작, 영어 lexical substring 동작 변경, C typedef symbol 노출, `schema.v1.stats.code_lang_chunk_breakdown` 신규 필드, V007 자동 마이그레이션(re-ingest 불필요).
- [ ] **Step 5: HANDOFF/INDEX**`HANDOFF.md` 한 줄 요약의 version (`v0.17.0`)·Phase 표 갱신. `tasks/INDEX.md` 의 P10 섹션 하단에 "P10 Dogfooding Feedback" 섹션을 만들어 v0.17.0 작업 (한국어 trigram + C typedef + code_lang_chunk_breakdown) 을 listup (P9 의 fb-01~42 형식 참고, Gemini round 2 권장).
---
## Self-Review (Codex+Gemini 리뷰 반영 후)
**Spec coverage:** design §3(변경 1)→PR-A Task A1-A7, §4(변경 2)→PR-C, §5(변경 3)→PR-B, §6(PR 구성/release)→Task R1, §8(테스트)→각 task 의 test step + A4 의 2자/multi-token/snippet, §9 Risks→A5(builder 재설계)·A4(영어 동작/heading_path 노이즈)·B1(nested typedef). §10 버전 cascade→B1 Step 5 (parser_version), R1 (workspace version). 누락 없음.
**Placeholder scan:** Task A5 의 "A1 baseline 노트" 는 의도적 plan-내 동적 슬롯 — A1 Step 1 이 채워 A5 가 참조. 그 외 "TBD/TODO" 없음. V007 SQL 전문 박음. 정확한 코드 (build_match_string 재설계, c.rs typedef 노드 분기, chunks JOIN documents 위치) 는 "해당 파일을 읽어 구현" 으로 명시 — placeholder 가 아닌 실행 지시.
**Type consistency:** `code_lang_chunk_breakdown` 명칭이 store 함수(C1)·Stats 필드(C2 Step 2)·wire JSON schema(C2 Step 5) 전체 동일. `BTreeMap<String, u32>` 반환 타입이 기존 `code_lang_breakdown` 과 일치. `chunks_fts` 컬럼명이 V007·design §5.5·diff-check 테스트 동일. `parser_version = "code-c-v2"` 문자열이 B1 Step 5·테스트 갱신·design §5·§10 일치.
**리뷰 반영 변경 (round 1):**
- 변경 1 본체에 `lexical.rs::build_match_string()` 재설계 추가 (A5 필수화).
- 2자 이하 한국어 query 정책 = 0-hit + CLI/TUI 안내 (사용자 결정).
- C typedef cascade 를 chunker_version → **parser_version** 으로 정정 (`code-c-v1``code-c-v2`).
- design §3.1 의 "contentless" 표현 정정 (V002 는 일반 FTS5 shadow).
- heading_path JSON 노이즈, 디스크 용량 증가, BM25 snapshot drift 를 Risks 등재.
- 누락 task 추가: SMOKE.md 갱신 (A6 Step 4), `docs/wire-schema/v1/schema.schema.json` 갱신 (C2 Step 5).
- 잘못된 task 제거: `wire.rs` 수정 (serde 자동 직렬화이므로 불필요).
**리뷰 반영 변경 (round 2):**
- **[Critical]** PR-B 에 same-workspace_path orphan purge step 추가 (B1 Step 5b/5c) — parser_version bump 만으로는 같은-asset 케이스에서 옛 doc_id/chunk/vector 가 orphan, UNIQUE 충돌 위험. design §5 본문에 실제 cascade 동작 명시.
- **[High]** design §2 표 + plan Architecture 의 잔존 "code-c-ast-v2 chunker bump" → "code-c-v2 parser_version bump" 로 정정.
- **[High]** A4 Step 1 의 trigram 테스트 예시를 Codex sqlite 3.45.1 검증 동작으로 정정 — quoted phrase 와 공백 없는 연속 substring 으로 (`'해시충'`/`'시 충'` 는 0-hit 가 맞음).
- **[High]** A5 Step 2 의 builder 알고리즘을 Codex 권장안으로 — whole phrase 후보 + 3자 이상 토큰 AND → OR 결합, 후보 없음 시 `None` 반환 (빈 MATCH 금지).
- **[Medium]** A5 Step 5 의 TUI 안내 구현을 `SearchState.short_query_hint` 필드 + `poll_worker` 세팅 + `dynamic_status` 표시로 구체화.
- **[Low]** File Structure 에 `crates/kebab-store-sqlite/src/fts.rs` (코드 주석의 contentless 정정) 추가.
- **[Low]** C2 Step 5 에 기존 stats 필드 (`code_lang_breakdown`·`repo_breakdown`) description 정정 추가 (실제는 doc count).
- **[Low]** R1 Step 5 의 INDEX.md 갱신 위치를 "P10 Dogfooding Feedback" 섹션으로 구체화.
**리뷰 반영 변경 (round 3):**
- **[Codex High]** SMOKE.md 시나리오의 hit query 를 `해시충` (원문 미존재) → `충돌은` (3자 연속) + `해시 충돌` (whole phrase) 로 정정. JSON 모드 hint 필드 검증도 시나리오에 포함.
- **[Codex High]** TUI short_query_hint 의 stale 방지 — `poll_worker``last_query == 현재 input + mode` 일치 시만 hint 세팅, input 변경 시 reset.
- **[Gemini High]** `search_response.v1``hint: Option<String>` additive 필드 추가 (A5 Step 4b) — `--json`/MCP 가시성 보강. CLI stderr 안내와 보완적.
- **[Codex Medium]** PR-B helper 이름 명시 — `stale_chunk_ids_for_workspace_path_except_doc_id` + `purge_document_at_workspace_path_except_doc_id` 새 helper. P7-3 helper 의 asset_id 조건 우회.
- **[Codex Medium]** raw FTS mode 표기 single quote `'...'` 로 통일 (A1 Step 1, A5 Step 1, A5 Step 2 권장안 1) — 실제 코드 `lexical.rs:167` 기준.
- **[Codex Medium]** short-query CLI 조건을 `query.trim().chars().count() < 3` 으로 고정 — "모든 토큰 < 3" 분기 제거 (valid whole-phrase query false trigger 회피). TUI 도 동일.
- **[Gemini Medium]** A4 Step 0 — `fixtures/search/korean/` 으로 한국어 도그푸딩 fixture 복사·commit, LICENSE 표기.
- **[Gemini Medium]** A6 Step 4b — `integrations/claude-code/kebab/SKILL.md` 에 3자 권장 + hint 필드 안내 한 줄.
- **[Gemini Low]** README 디스크 용량 수치화 (~2-5배 또는 수백 MB 단위).

View File

@@ -1004,6 +1004,17 @@ CREATE INDEX idx_blocks_doc_id ON blocks(doc_id);
### 5.5 Chunks + FTS5
Tokenizer = `trigram` (V007, 2026-05-23). 한국어 어절(조사·어미가 붙은 단위)이
unicode61 에서 단일 토큰화돼 lexical 부분 매칭이 불가능했던 문제를 해소
(2자 미만 한국어 query 는 trigram 구조상 여전히 0-hit — 단일 토큰 측면에서는
회귀 아님, multi-token query 는 `lexical.rs::build_match_string()` 가 whole-phrase
후보 OR 결합으로 매칭). trade-off: 영어 lexical 도 substring 매칭으로 이동
(recall↑, 단어 경계 정밀도↓), BM25 raw score 분포 변경 (RRF rank 기반 hybrid
는 영향 미미), SQLite 파일 크기 ~2-10× 증가. 자세한 내용 = `tasks/HOTFIXES.md`
(2026-05-22) + `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`.
`chunks_fts` 는 일반 FTS5 shadow table 이며 contentless 가 아님 (V002 / V007
DDL 에 `content=''` 없음).
```sql
CREATE TABLE chunks (
chunk_id TEXT PRIMARY KEY,
@@ -1026,7 +1037,7 @@ CREATE VIRTUAL TABLE chunks_fts USING fts5(
doc_id UNINDEXED,
heading_path,
text,
tokenize = 'unicode61 remove_diacritics 2'
tokenize = 'trigram'
);
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
@@ -1551,6 +1562,10 @@ transitional 형태) 의 source of truth.
**p10-2 활성화 (Tier 2 chunker) (2026-05-20)**: Tier 2 resource-aware chunker 3종 활성화 — k8s-manifest-resource-v1 (`.yaml`/`.yml`), dockerfile-file-v1 (`Dockerfile`), manifest-file-v1 (`Cargo.toml` 등 설정 파일). 추가 code_lang 매핑: XML (`.xml`, `pom.xml`), Groovy (`build.gradle`, `.gradle`), Go module (`go.mod`).
**p10-3 활성화 (Tier 3 paragraph fallback) (2026-05-21)**: Tier 3 chunker `code-text-paragraph-v1` 활성화. shell script (`.sh`/`.bash`/`.zsh`) direct routing + Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback 으로 retry. 비-k8s YAML / invalid YAML / AST 실패 케이스 모두 picked up. lang 은 입력 보존 (shell → "shell", yaml → "yaml" 등), symbol 은 항상 None.
**p10-1D 활성화 (C + C++) (2026-05-21)**: P10 Tier 1 chunker family 완료 — C (`code-c-ast-v1`, `.c`/`.h`) + C++ (`code-cpp-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`) AST chunker 활성화. C symbol = function name only (no nesting); C++ symbol = `namespace::Class::method` (recursive namespace + class nesting). `.h` 가 C++ syntax 만나면 tree-sitter-c parse 실패 → p10-3 Tier 3 fallback 으로 자동 picked up.
### 10.2 MCP server transport (fb-30)
`kebab mcp` 가 stdio JSON-RPC server. Rust SDK = `rmcp 1.6`. Tool surface

View File

@@ -0,0 +1,143 @@
---
title: "v0.17.0 설계 — 한국어 trigram FTS tokenizer + P10 round-2 dogfood 버그픽스"
date: 2026-05-22
status: draft
contract_sections: ["§5.5", "§9"]
---
# v0.17.0 설계 — 한국어 trigram FTS tokenizer + P10 round-2 dogfood 버그픽스
## 1. 배경
P10 종합 도그푸딩 round 2 (2026-05-22, `tasks/HOTFIXES.md`) 에서 세 가지가 드러났다:
- 한국어 `kebab search --mode lexical` 이 FTS5 `unicode61` 토크나이저에서 거의 0 hit. unicode61 은 공백·구두점 경계로만 토큰을 끊어, 한국어 어절(조사·어미 포함)이 통째로 한 토큰이 되고 부분 매칭이 안 된다.
- `code_lang_breakdown` 이 chunk 가 아닌 doc 수를 집계 — 코드가 많은 KB 에서 언어별 chunk 분포 granularity 가 떨어진다.
- C `typedef struct {...} Foo;` 의 alias 가 검색 symbol 로 노출되지 않는다.
이 설계는 셋을 v0.17.0 한 release 사이클에 묶어 처리한다. 본체는 한국어 tokenizer (변경 1), 나머지 둘은 같은 도그푸딩 라운드의 작은 버그픽스 (변경 2·3).
## 2. 범위
| # | 변경 | crate | cascade |
|---|------|-------|---------|
| 1 | FTS5 `unicode61``trigram` tokenizer | kebab-store-sqlite, migrations | V007 migration, design §5.5 갱신, release cut |
| 2 | `code_lang_chunk_breakdown` wire 필드 | kebab-store-sqlite, kebab-app, kebab-cli | wire additive (release 트리거 아님) |
| 3 | C typedef-wrapped struct → synthetic unit | kebab-parse-code, kebab-app(ingest), kebab-store-sqlite(purge) | **`parser_version`** bump (`code-c-v1``code-c-v2`) + same-workspace_path orphan purge |
3개는 서로 독립적인 코드 경로다. 각각 별도 PR 로, 한 작업 세션에서 연속 진행하고, 셋 다 머지된 뒤 v0.17.0 release 를 한 번 cut 한다.
## 3. 변경 1 — FTS5 trigram tokenizer (본체)
### 3.1 현재 상태
`migrations/V002__fts.sql``chunks_fts` 는 FTS5 가상 테이블 (V002 DDL 에 `content=''` 가 없어 contentless 가 아닌 일반 FTS5 shadow table) 이고 `tokenize = 'unicode61 remove_diacritics 2'` 로 생성된다. `chunks` 테이블의 INSERT/UPDATE/DELETE 가 trigger (`chunks_ai` / `chunks_ad` / `chunks_au`) 로 `chunks_fts` 와 동기화된다. 즉 `chunks` 가 source-of-truth, `chunks_fts` 는 검색용 shadow 다.
design §5.5 (`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` 라인 1024-1043) 에 동일한 SQL 이 verbatim 으로 박혀 있고, 테스트 `fts_v002_matches_design_section_5_5_verbatim` (`crates/kebab-store-sqlite/tests/fts.rs`) 이 둘을 whitespace-normalized 로 대조하는 CI diff-check 다.
### 3.2 변경 내용
새 마이그레이션 `migrations/V007__fts_trigram.sql`:
1. `DROP TRIGGER` (`chunks_ai`/`chunks_ad`/`chunks_au`) + `DROP TABLE chunks_fts;` — 가상 테이블과 연결 trigger 를 명시적으로 제거.
2. `CREATE VIRTUAL TABLE chunks_fts USING fts5(..., tokenize = 'trigram');` — 컬럼 구성(`chunk_id`/`doc_id` UNINDEXED, `heading_path`, `text`)은 V002 와 동일, tokenizer 만 교체.
3. `chunks_ai`/`chunks_ad`/`chunks_au` trigger 재생성 — V002 와 동일 본문.
4. `INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text) SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;` — 기존 chunk 전부 재색인 (V002 backfill 과 동일 패턴).
`chunks` 원본·embedding·vector index 는 전혀 건드리지 않는다. 마이그레이션이 FTS shadow 만 재구축하므로 **사용자는 `kebab ingest` 를 다시 돌릴 필요가 없다** — 0.17.0 바이너리가 기존 DB 를 열면 V007 이 자동 적용되며 backfill 까지 끝난다. 비싼 fastembed 재계산이 없다.
### 3.3 동반 갱신
- design §5.5 verbatim 블록을 V007 의 SQL 로 갱신한다. frozen design 변경이므로 release 트리거 중 하나다. design 본문 어디든 "contentless" 표현이 있으면 함께 "shadow / non-contentless" 로 정정.
- CI diff-check 테스트: 함수명에 `v002` 가 박혀 있으므로 `fts_v007_matches_design_section_5_5_verbatim` 으로 갱신하고, 대조 대상을 V007 파일로 바꾼다.
- `crates/kebab-store-sqlite/src/fts.rs``rebuild_chunks_fts` 는 컬럼 구성이 동일하므로 코드 변경이 불필요하다 (tokenizer 는 테이블 DDL 에만 존재). 동작만 확인.
- `crates/kebab-search/src/lexical.rs:177``build_match_string()` **재설계가 본 PR 의 본체다**. Codex 리뷰 검증 결과: 현재 builder 는 whitespace split 후 각 토큰을 `"..."` 로 감싸 implicit AND 결합 → trigram 에서 2자 이하 토큰 (예: `해시`, `충돌`) 은 매칭 불가 → `해시 충돌` 같은 multi-token 한국어 query 가 0-hit. trigram 대응 재설계 필요 — 권장: 3자 미만 토큰을 drop 또는 raw 처리, 전체 query 가 3자 이상이면 전체 query phrase 도 OR 후보로 추가.
- **2자 이하 한국어 query 정책 (사용자 결정)**: lexical core 는 정상 0-hit (변경 없음), CLI/TUI 레이어가 결과 0 + query 3자 미만일 때 "3자 이상 키워드 권장 (trigram tokenizer 제약)" 한 줄 안내. `--json` 모드는 wire 무결성 위해 안내 미출력. hybrid 모드는 vector 가 결과를 받쳐 안내가 안 나오는 케이스가 많다.
- `crates/kebab-search/src/lexical.rs:506` 부근의 lexical BM25 snapshot 테스트 갱신 — token stream 이 word → trigram 으로 바뀌어 raw score 분포·`snippet()` token 단위가 달라진다.
- `docs/wire-schema/v1/schema.schema.json` 에 변경 2 의 `code_lang_chunk_breakdown` 추가 (PR-C 에서 처리).
- `docs/SMOKE.md` 에 한국어 검색 시나리오 추가 (PR-A 에서 처리).
### 3.4 trade-off
- trigram 은 3자 (Unicode chars) 이상 substring 만 색인한다 (Codex 가 sqlite 3.45.1 로 검증). 3자 미만 query (`값`/`키`/`충돌`) 는 lexical 0-hit — unicode61 에서도 어절 단위 토큰화라 단일 토큰 부분 매칭은 안 됐으므로 단일 토큰 측면은 회귀가 아니다.
- 단 multi-token 한국어 query (`해시 충돌`) 는 §3.3 의 query builder 재설계가 동반돼야 hit 한다. builder 재설계가 본 PR 의 본체.
- 2자 이하 query 0-hit 시 CLI/TUI 가 안내 출력 (§3.3, 사용자 결정).
- 영어 lexical 검색도 substring 매칭으로 바뀐다: recall 상승, 단어 경계 정밀도 하락 가능. lexical-only KB 의 영어 검색 동작이 변경된다 — 의도된 동작 변경, 테스트로 핀.
- **BM25 score 분포 변경**: 알고리즘은 유지되지만 token stream 이 word → overlapping trigram 으로 바뀌어 raw score, term frequency, document length 모두 달라진다. lexical snapshot 갱신 (§3.3). `snippet()` 의 token 도 trigram 기준이라 word budget 의미가 달라진다. hybrid (RRF) 는 rank 기반이라 ranking 자체 영향은 미미, 단 `retrieval.lexical_score` 노출값은 변동.
- **DB 디스크 용량 증가**: trigram 인덱스는 unicode61 대비 통상 2-10배 크다 (chunk 본문 + heading_path 모두 trigram 색인). 기존 KB 가 V007 적용 후 `kebab.sqlite` 파일 크기 증가. release notes 명시.
- **`heading_path_json` JSON 노이즈**: trigram 이 JSON 표기 (`[`, `"`, `,`) 와 그 안의 단어 (예: `app`, `src`) 까지 3-gram 색인 → query 가 우연히 JSON 구문이나 흔한 경로 단어와 겹쳐 false positive 가능. v0.17.0 에서는 컬럼 구성 유지 (column filter / 평문 heading 변환 결정은 도그푸딩 후), Risks 등재.
- `remove_diacritics` 는 trigram tokenizer 에서 SQLite 버전 의존 (3.45.0+). 호환성 위해 `tokenize = 'trigram'` 단독 사용 (case-insensitive 기본). 빌드 환경 SQLite 버전은 plan 단계에서 확인.
### 3.5 사용자 영향
- 옛 binary (≤0.16.x) 는 V007 적용 DB 와 비호환 → v0.17.0 release cut 이 필요하다 (CLAUDE.md release cascade: V00X migration 트리거).
- 한국어 문서 KB 에서 `--mode lexical` / `--mode hybrid` 가 정상 동작한다 (3자 이상 substring). 도그푸딩에서 확인된 "한국어 hybrid 의 lexical 기여가 0" 문제가 해소된다.
- `kebab.sqlite` 파일 크기가 trigram 인덱스 비대화로 증가한다 (V007 자동 backfill 후). release notes 에 안내.
- 2자 이하 query 검색 시 lexical 0-hit + CLI/TUI 안내 메시지 표시 (§3.3).
## 4. 변경 2 — code_lang_chunk_breakdown
`crates/kebab-store-sqlite/src/store.rs` 의 기존 `code_lang_breakdown()` (doc 수, `documents` GROUP BY) 는 그대로 두고, `code_lang_chunk_breakdown()` 을 추가한다. `chunks` 테이블에는 `code_lang` 컬럼이 직접 없으므로 `chunks JOIN documents ON chunks.doc_id = documents.doc_id``documents.metadata_json``code_lang` 을 끌어와 `COUNT(chunks.chunk_id)` GROUP BY. 반환 타입은 기존과 동일 `BTreeMap<String, u32>`.
`crates/kebab-app/src/schema.rs``Stats``code_lang_chunk_breakdown: BTreeMap<String, u32>` 필드를 추가하고, stats 빌드 지점에서 신규 함수 호출로 채운다. `crates/kebab-cli/src/wire.rs::wire_schema()``SchemaV1` 을 serde 로 통째 직렬화하므로 **별도 수정 불필요** — 신규 필드가 자동으로 wire 출력에 포함된다. 단 `docs/wire-schema/v1/schema.schema.json``code_lang_chunk_breakdown` 을 additive 로 추가 (필수).
기존 `code_lang_breakdown` 필드는 유지 (제거 시 wire breaking). additive 추가 → migration·`schema_version` bump 불필요, release 트리거 아님.
## 5. 변경 3 — C typedef-wrapped struct fix
`crates/kebab-parse-code/src/c.rs` 의 extractor 가 top-level `type_definition` 노드를 만나면, 그 내부의 anonymous `struct_specifier`/`enum_specifier`/`union_specifier` 를 탐지해 **typedef alias 이름** (`type_definition``declarator` 에서 추출) 으로 synthetic unit 을 방출한다. named struct 는 기존 경로를 그대로 유지한다.
**`parser_version` bump** (`crates/kebab-parse-code/src/c.rs:34``PARSER_VERSION = "code-c-v1"``"code-c-v2"`) 가 본 변경의 cascade 키다 — extractor output 이 바뀌기 때문이다. design §9 cascade: `doc_id``(workspace_path, asset_id, parser_version)` 기반이라 parser_version bump 만으로 doc_id 가 갱신된다. chunker (`crates/kebab-chunk/src/code_c_ast_v1.rs``code-c-ast-v1`) 는 **건드리지 않는다** — chunker 로직 동일.
**Cascade 실제 동작 (Codex round 2 검증)**: parser_version 만 바뀌고 파일 bytes 가 동일하면 `asset_id` 가 같아 기존 ingest 경로의 `stale_chunk_ids_at` (asset_id 변경 기반) 가 발동하지 않는다. 새 doc_id 로 `documents` INSERT 시 `idx_docs_workspace_path` UNIQUE 가 충돌하거나, 옛 doc_id row 와 옛 chunk/vector row 가 orphan 으로 잔존한다. 따라서 본 PR 은 **same-workspace_path orphan purge** 를 동반해야 한다 — ingest 의 parser-mismatch 분기에서 `(workspace_path, 다른 doc_id)` 옛 row 의 chunk_id 를 수집해 `VectorStore::delete_by_chunk_ids` (P7-3 hotfix helper) 호출 + `documents` row 교체. plan B1 에 별도 step.
현재는 dogfood 단계라 prod KB 가 없다.
기존 테스트 `c_extractor_typedef_struct_falls_into_glue` 는 동작이 반대로 바뀌므로 `c_extractor_typedef_struct_emits_unit` 으로 재작성한다. HOTFIXES 2026-05-21 항목을 closure 로 갱신하고, spec `tasks/p10/p10-1d-c-cpp-ast-chunker.md` 의 Risks/notes 를 갱신한다.
## 6. PR 구성 / release
- **PR-A**: 변경 1 (trigram tokenizer). `feat/*` 브랜치 — 코드 + V007 migration + design §5.5 + task spec 을 한 PR 에 (design 변경과 그것을 참조하는 task spec 은 같은 PR 규칙).
- **PR-B**: 변경 3 (C typedef). `feat/*` 브랜치.
- **PR-C**: 변경 2 (code_lang_chunk_breakdown). `feat/*` 브랜치.
- 셋 머지 후 `chore: bump version 0.16.1 → 0.17.0` 같은 commit 직후 같은 commit 에 `gitea-release v0.17.0`. release notes 는 도그푸딩 영향 surface 위주 — 한국어 lexical 검색 동작, C symbol 노출, `schema.v1.stats` 신규 필드.
PR-A 가 design 변경을 포함하므로 README/HANDOFF/ARCHITECTURE sync 규칙이 적용된다 — 한국어 검색 동작을 README 검색/Configuration 절에 한 줄, HANDOFF "머지 후 발견된 버그/결정" 절, HOTFIXES round-2 항목 status 갱신.
## 7. 작업 방식 (team)
- **코드 작성**: Claude Code — OMC `executor` agent, migration·extractor 같은 복잡 부분은 `model=opus`.
- **리뷰**: Codex + Gemini 가 각 PR 의 diff 를 리뷰한다 (`/ask codex`, `/ask gemini` — OMC ask 라우팅). Claude 가 두 리뷰를 종합해 반영한다.
- **PR 생성·머지**: gitea-ops skill (Gitea REST API).
- 각 PR = 구현 → codex+gemini 리뷰 → 반영 → 머지 루프.
## 8. 테스트 전략
- 변경 1:
- `crates/kebab-store-sqlite/tests/fts.rs`: V007 ↔ design §5.5 diff-check (테스트명 `fts_v007_matches_design_section_5_5_verbatim` 으로 rename).
- 한국어 trigram 매칭 테스트 — **3자 이상 연속 substring 만 hit**. fixture `"해시 충돌은 키와 값을 매핑할 때 발생한다"` 기준 (Codex sqlite 3.45.1 검증): raw `MATCH '충돌은'` hit (공백 없는 3자 연속), `MATCH '"해시 충돌"'` quoted phrase hit, `MATCH '"시 충"'` quoted phrase hit; 반면 raw `MATCH '해시충'`/`MATCH '시 충'` 은 0-hit (전자는 원문에 해당 trigram 없음, 후자는 FTS5 가 raw 입력의 공백을 토큰 경계로 처리). quoted phrase 또는 공백 없는 연속 substring 으로 테스트.
- **2자 query 0-hit 핀 테스트** — `MATCH '충돌'` 같은 2자 query 가 반드시 0 결과 (trigram 구조 회귀 감지).
- **multi-token 한국어 query 테스트** (kebab-search / kebab-app 통합) — 사용자 query `해시 충돌` 이 재설계된 `build_match_string()` 을 거쳐 hit (whole phrase 후보 `"해시 충돌"` 경로). A4 작성 시점 FAIL, A5 후 PASS.
- 영어 substring 동작 핀 (`token` query 가 `tokenizer`/`testbed` 등 hit).
- lexical BM25 snapshot (`crates/kebab-search/src/lexical.rs:506` 근처 또는 `crates/kebab-search/tests/`) 갱신.
- 기존 `crates/kebab-app/tests/search_korean.rs` 회귀 핀 (`러스트` 3자) + `해시 충돌` multi-token assert 추가.
- CLI/TUI 안내 메시지 (3자 미만 query + 0 결과) 테스트 — `kebab-cli` stderr 검증, `kebab-tui` Search pane 단위 테스트.
- 변경 2: `crates/kebab-app/src/schema.rs` stats 테스트에 `code_lang_chunk_breakdown` 필드 검증 (한 doc 다중 chunks fixture 로 doc count 와 다른 값). `docs/wire-schema/v1/schema.schema.json` JSON 검증.
- 변경 3: `c.rs` typedef 테스트 재작성 (`Point` alias 가 unit 방출), `parser_version = "code-c-v2"` 확인, named struct 회귀 없음.
- 전체: `cargo test --workspace --no-fail-fast -j 1`, `cargo clippy --workspace --all-targets -- -D warnings`.
## 9. Risks / notes
- `lexical.rs::build_match_string()` 재설계가 본 PR 의 본체 — multi-token 한국어 query, 3자 미만 토큰 정책, lexical snapshot drift. Codex 검증으로 현재 builder 가 trigram 비호환임이 확정됨 (`해시 충돌` 0-hit). 빈 MATCH 는 FTS5 syntax error 이므로 후보 없음 시 `None` 반환 (SQL 미실행).
- PR-B 의 parser_version cascade — 같은 bytes + parser bump 케이스 (orphan vector/document row) 가 ingest 의 기존 asset_id 기반 purge 로 정리 안 됨 (Codex round 2 검증). same-workspace_path 명시 purge 가 PR-B 의 구성 요소. (미래의 모든 parser_version bump 에도 같은 보강이 필요할 수 있는 일반 케이스.)
- `heading_path_json` JSON 노이즈 — v0.17.0 에서는 컬럼 구성 유지, 도그푸딩 후 column filter (lexical query 를 `{text} : <q>` 한정) 또는 평문 heading 변환 재검토. HOTFIXES 후속 entry 로 등재.
- SQLite 파일 크기 증가 (trigram 인덱스) — release notes 명시. 검색 정확도와 무관.
- 영어 lexical 동작 변경 (substring 매칭) — release notes 명시.
- lexical BM25 raw score 분포 변경 — hybrid (RRF) 는 rank 기반이라 ranking 영향 미미, 단 `retrieval.lexical_score` 노출값 변동. wire schema 는 그대로지만 score 값 비교 기반 외부 도구가 있다면 영향.
- C typedef fix synthetic unit naming: nested typedef (`typedef struct { struct {...} inner; } Outer;`) 의 inner 익명 struct 는 여전히 glue. 1차 범위는 top-level typedef alias 만. spec Risks 명시.
## 10. contract_sections / 버전 cascade
- design §5.5 (Chunks + FTS5) — 변경 1 이 갱신 (tokenize 값 + "shadow / non-contentless" 표현).
- design §9 (versioning cascade) — 변경 3 의 **`parser_version` bump** (`code-c-v1``code-c-v2`) 가 cascade 사례. doc_id 가 `(workspace_path, asset_id, parser_version)` 기반이라 parser bump 만으로 다음 ingest 가 전체 재처리. chunker_version 은 chunk_id 에만 영향이라 본 fix 에는 불필요.
- 버전: workspace `Cargo.toml``version` 을 0.16.1 → 0.17.0 (minor bump, pre-1.0 단계 surface 변경 누적).

View File

@@ -81,12 +81,17 @@
},
"code_lang_breakdown": {
"type": "object",
"description": "p10-1A-1: per-language code chunk count. Key = lowercase language name (e.g. 'rust', 'python'). Populated after 1A-2 lands; empty on markdown-only corpora.",
"description": "p10-1A-1: per-language **doc** count (one entry per indexed code document). Key = lowercase language name (e.g. 'rust', 'python'). Empty on markdown-only corpora. Pair with `code_lang_chunk_breakdown` for chunk-level granularity (one file's 200 chunks vs one doc).",
"additionalProperties": { "type": "integer", "minimum": 0 }
},
"repo_breakdown": {
"type": "object",
"description": "p10-1A-1: per-repo code chunk count. Key = repo name as detected by kebab-parse-code::repo. Empty on markdown-only corpora.",
"description": "p10-1A-1: per-repo **doc** count. Key = repo name as detected by kebab-parse-code::repo. Empty on markdown-only corpora.",
"additionalProperties": { "type": "integer", "minimum": 0 }
},
"code_lang_chunk_breakdown": {
"type": "object",
"description": "v0.17.0 PR-C: per-language **chunk** count (closes HOTFIXES 2026-05-22 'code_lang_breakdown chunk granularity'). Companion to `code_lang_breakdown` (doc count) — chunk-level granularity is the indexing-pressure metric (a 200-chunk PDF + a 5-chunk Rust file both appear as `1 doc` but `200` vs `5` chunks). Key = lowercase language name. Empty on markdown-only corpora.",
"additionalProperties": { "type": "integer", "minimum": 0 }
}
}

View File

@@ -29,6 +29,10 @@
}
}
}
},
"hint": {
"type": "string",
"description": "v0.17.0 A5 Step 4b: advisory string set when the empty hit list is likely due to a query shorter than the FTS5 trigram tokenizer's 3-char minimum. Field is omitted when no advisory applies. Raw FTS5 mode ('...') opts out. MCP / agent consumers should surface this so users understand the empty result rather than retrying the same short query."
}
}
}

View File

@@ -0,0 +1,27 @@
# 해시 테이블
해시 테이블은 키와 값을 매핑하는 자료 구조다. 해시 함수로 키를 인덱스로
변환해 평균 상수 시간에 조회·삽입·삭제한다.
## 해시 충돌
두 개 이상의 서로 다른 키가 같은 인덱스로 매핑될 때 해시 충돌이 발생한다.
해시 충돌은 잘 설계된 해시 함수에서도 피할 수 없으며, 적재율이 올라갈수록
충돌 빈도가 증가한다.
### 해시 충돌 해결법
- **체이닝**: 같은 버킷에 연결 리스트로 충돌한 항목들을 묶는다. 구현이
단순하고 적재율이 1을 넘어도 동작한다.
- **개방 주소법**: 빈 버킷을 찾아 다음 위치에 저장한다. 선형 탐사, 제곱
탐사, 이중 해싱이 있다.
## 적재율과 재해싱
적재율은 저장된 항목 수를 버킷 수로 나눈 값이다. 임계 적재율을 넘으면
테이블을 키워 재해싱한다 — 모든 항목을 새 테이블에 다시 매핑한다.
## 응용
캐시, 색인, 중복 제거, 데이터베이스 인덱스, 컴파일러의 심볼 테이블 등
광범위하게 쓰인다.

View File

@@ -60,6 +60,7 @@ Input:
- Cite back to the user as `doc_path § heading_path[-1]` so they can open the source.
- When `truncated: true`, the budget loop modified the page (snippet shortening or k reduction). `next_cursor` is **independent** — non-null whenever more hits may be reachable. Caller may widen `max_tokens` (re-issue same query for fuller snippets / more hits per page) or follow `next_cursor` (advance through more hits) or both. Mismatched cursor (corpus_revision changed) returns `error.v1.code = stale_cursor` — re-issue the search to obtain a fresh one.
- **`trace: true` (p9-fb-37)** — debug aid. Response carries an extra `trace` block: `lexical[]` + `vector[]` (pre-fusion candidates), `rrf_inputs[]` (RRF union before final cut), and `timing` (`lexical_ms`, `vector_ms`, `fusion_ms`, `total_ms`). Trace bypasses the search cache (always cold). Use sparingly — it bloats the wire response and is for diagnosing "why did this hit / not hit", not normal retrieval.
- **`hint` (v0.17.0)** — optional advisory string on `search_response.v1`. Present only when the result is empty AND the trimmed query is shorter than the FTS5 trigram tokenizer's 3-char minimum. Surface it to the user instead of retrying the same short query. Korean lexical search benefits most from ≥3-char keywords (`충돌` zero-hit, `충돌은` substring-hit). Raw FTS5 mode (`'...'`) opts out — the user opted into FTS5 syntax. Vector / hybrid modes carry the field too but it's rarely triggered (semantic embeddings handle short queries).
### `mcp__kebab__bulk_search`

View File

@@ -0,0 +1,60 @@
-- V007__fts_trigram.sql — Replace chunks_fts tokenizer: unicode61 → trigram.
--
-- Per design §5.5 (chunks_fts virtual table + chunks_ai/ad/au triggers).
-- The CREATE VIRTUAL TABLE / CREATE TRIGGER block below is reproduced
-- VERBATIM from `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
-- §5.5; CI diff-checks this against the design doc (test
-- `fts_v007_matches_design_section_5_5_verbatim` in
-- `crates/kebab-store-sqlite/tests/fts.rs`).
--
-- Tokenizer choice: trigram. Korean is agglutinative — unicode61 tokenizes
-- whole eojeol (조사·어미 attached) so substring matching fails. trigram
-- indexes 3-character grams, enabling Korean partial matches. Trade-offs:
-- DB size grows (~2-10×), English lexical also moves to substring match
-- (recall↑, precision↓), BM25 score distribution shifts. See
-- `tasks/HOTFIXES.md` (2026-05-22) and the v0.17.0 design doc.
--
-- chunks_fts is a shadow of chunks (NOT contentless — V002 DDL has no
-- `content=''`); this migration drops the old shadow, recreates it with
-- the new tokenizer, recreates the sync triggers (verbatim from V002),
-- and backfills from `chunks`. The `chunks` table and embeddings are
-- untouched, so users do NOT need to re-ingest after upgrading to
-- v0.17.0 — the migration is fully automatic.
DROP TRIGGER IF EXISTS chunks_au;
DROP TRIGGER IF EXISTS chunks_ad;
DROP TRIGGER IF EXISTS chunks_ai;
DROP TABLE IF EXISTS chunks_fts;
-- ── §5.5 verbatim block ────────────────────────────────────────────────
CREATE VIRTUAL TABLE chunks_fts USING fts5(
chunk_id UNINDEXED,
doc_id UNINDEXED,
heading_path,
text,
tokenize = 'trigram'
);
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
-- ── End §5.5 verbatim block ───────────────────────────────────────────
-- One-shot backfill from existing chunks. Mirrors the V002 backfill
-- pattern — direct INSERT into chunks_fts bypasses chunks_ai trigger
-- (trigger fires on chunks INSERT, not chunks_fts INSERT), so no
-- double-insert. Refinery runs V007 exactly once via its bookkeeping
-- table, so this is naturally idempotent across restarts.
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;

View File

@@ -14,6 +14,129 @@ historical contract that was implemented; this file accumulates the
deltas so phase 5+ readers can find the live behavior without diffing
git history.
## 2026-05-25 — v0.17.0 post-dogfood: `[models.llm] request_timeout_secs` 노브 + 권장 모델 가이드
v0.17.0 후속 도그푸딩에서 발견: 사용자가 default `gemma4:e4b` (8B Q4, 9.6 GB) 를 CPU only / 16 GB RAM 환경에서 시도 시 첫 RAG 답변이 5 분 (hard-coded 300 s) 한도를 항상 넘겨 `error: kb-rag: llm.generate_stream` 으로 떨어졌다. 메모리도 ollama RSS 10.7 GB / free 2 GB 까지 압박. 후속 도그푸딩 32 분 / 199 mem-monitor sample 결과는 `tasks/HOTFIXES.md` 의 본 entry 와 conversation 의 도그푸딩 보고 참조.
**변경**:
- `crates/kebab-config/src/lib.rs::LlmCfg``request_timeout_secs: u64` additive 필드 (`#[serde(default = "default_llm_request_timeout_secs")]`, default `300`). 옛 config 가 필드 누락해도 그대로 파싱 + 동일 동작 (3 신규 unit test 가 default / env override / legacy parse 핀).
- env override `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS`.
- `crates/kebab-llm-local/src/ollama.rs``REQUEST_TIMEOUT` 상수 제거. `OllamaLanguageModel::new``Duration::from_secs(llm.request_timeout_secs)` 로 reqwest blocking client 빌드. doc comment 도 동일하게 갱신.
- `README.md` 사전 요구 절 + `docs/SMOKE.md` 의 ollama 안내에 권장 모델 (≤ 4B Q4 — `gemma3:4b` / `qwen2.5:3b` / `phi3:mini`) + timeout 노브 anchor 한 줄. 8B+ 시도 시 timeout 패턴 사전 안내.
- `crates/kebab-config/src/lib.rs::Config::defaults` 의 LlmCfg literal 에 `request_timeout_secs: default_llm_request_timeout_secs()` + comment 한 줄로 CPU only 권장 안내.
**미진행 (scope 밖)**:
- `crates/kebab-parse-image/src/ocr.rs::REQUEST_TIMEOUT` 도 동일한 hard-coded 300 s — OCR 이 보통 짧아 LLM 만큼 부담 안 되지만, 일관성 측면에서 다음 round 에 같은 노브 (또는 별 노브) 로 재검토.
- `kebab ask --stream` (fb-33) 권장 강조: 5분 cold-start 동안 첫 token 빠르게 surface — UX 개선. README/SKILL.md 추가 한 줄 후속.
**후속 도그푸딩 baseline 보존**: `/build/cache/dogfood-v017/` (466 MB workspace + DB + memory.log), `/build/cache/ollama/` (21 GB binary + gemma3:4b/gemma4:e4b 모델). 다음 round 회귀 비교용.
Cross-link: `crates/kebab-config/src/lib.rs::LlmCfg::request_timeout_secs`, `crates/kebab-llm-local/src/ollama.rs::OllamaLanguageModel::new`.
## 2026-05-24 — v0.17.0: 한국어 trigram FTS5 tokenizer 채택 (closure of 2026-05-22 한국어 lexical)
V007 migration 으로 `chunks_fts` 의 tokenizer 를 `unicode61``trigram` 으로 교체. `chunks` 원본 + embedding + vector index 는 그대로, FTS shadow 만 재구축 + 자동 backfill — 사용자는 `kebab ingest` 재실행 불필요 (binary 만 교체하면 다음 open 시 V007 가 즉시 적용). 같은 라운드의 다른 두 follow-up (`code_lang_chunk_breakdown`, C typedef) 은 별 PR (PR-C / PR-B).
**한국어 lexical 동작**: 3자 이상 substring 매칭. `해시 충돌` 같은 2자 토큰 multi-token query 는 `crates/kebab-search/src/lexical.rs::build_match_string` 의 trigram-aware 재설계로 `("해시 충돌") OR ("해시" "충돌")` 형태가 되어 whole-phrase 후보로 hit (각 토큰 2자라 token-AND 후보는 trigram 에서 0-hit, 자동 drop). 한영 혼합 `Rust 충돌은` (둘 다 ≥3자) 도 OR-combined. 2자 이하 query (`충돌` / `키`) 는 정상 0 hit + CLI stderr `[hint] 3자 이상 키워드 권장 (trigram tokenizer 제약)` + `search_response.v1.hint` additive 필드 + TUI status bar 동일 안내. raw FTS5 single-quote mode (`'...'`) 는 사용자 명시 의도이므로 hint 안 나옴. 회귀 핀: `lexical_multi_token_korean_query_hits` + `lexical_mixed_korean_english_multi_token_query_hits` (`crates/kebab-app/tests/search_korean.rs`).
**영어 lexical 동작 변경**: substring 매칭으로 바뀜. `token` query 가 `tokenizer` 도 hit (recall ↑, 단어 경계 정밀도 ↓). 의도된 변경, 회귀 핀 = `fts_trigram_english_substring_hits` (`crates/kebab-store-sqlite/tests/fts.rs`).
**lexical BM25 score 분포**: 알고리즘 동일하지만 token stream 이 word → overlapping trigram 으로 바뀌어 raw score / TF / doc-length 모두 달라짐. `crates/kebab-search/tests/lexical.rs::lexical_snapshot_run_1` + `crates/kebab-search/tests/hybrid.rs::hybrid_snapshot_run_1` 둘 다 trigram baseline 으로 regenerate. hybrid (RRF) 는 rank 기반이라 ranking 영향 미미하나 `retrieval.lexical_score` 노출값은 변동.
**디스크 용량**: trigram 인덱스는 unicode61 대비 통상 2-10배. V007 자동 backfill 후 `kebab.sqlite` 파일 크기 증가 (도그푸딩 KB 기준 ~2-5배 또는 수백 MB). release notes 명시.
**`heading_path_json` JSON 노이즈 (관찰, 미수정)**: trigram 이 JSON 표기 (`[`, `"`, `,`) 와 그 안의 단어 (`app`, `src`) 까지 3-gram 색인 → query 가 우연히 JSON 구문 / 흔한 경로 단어와 겹쳐 false positive 가능. v0.17.0 에서는 컬럼 구성 유지, 도그푸딩 후 column filter (`{text} : <q>` 한정) 또는 평문 heading 변환 결정. 후속 도그푸딩 entry 로 등재 예정.
**MCP / agent 가시성**: `search_response.v1``hint: Option<String>` additive 필드. 결과가 비어 있고 query trimmed.chars().count() < 3 + raw mode 아닐 때만 set (helper `kebab_app::short_query_hint`). `integrations/claude-code/kebab/SKILL.md` 의 search 절에 "한국어 lexical 은 3자 이상 권장, `hint` 필드 확인" 안내 추가.
Cross-link: `migrations/V007__fts_trigram.sql`, `crates/kebab-search/src/lexical.rs::build_match_string`, design §5.5, `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`.
## 2026-05-22 — p10 종합 도그푸딩 (round 2): 한국어 lexical 검색 한계 + code_lang_breakdown
**Origin**: P10 종합 도그푸딩 round 2 (`/build/cache/dogfood-p10b/`). 다양한 OSS 코드베이스 8 repo (rust / python / go / ts / js / java / c / cpp) + 한국어 위키 기술 문서 10편 (pandoc HTML→gfm 변환). `multilingual-e5-small` embedding 활성화 후 ingest — `scanned=2663 updated=2080 errors=0` (k8s multi-resource chunk_id collision 은 같은 라운드에서 발견·수정 — 아래 2026-05-21 항목).
### 한국어 lexical 검색이 FTS5 unicode61 토크나이저에서 무용 (vector/hybrid 가 우회)
**Symptom**: `kebab search --mode lexical` 의 한국어 query 가 거의 0 hit. "충돌" 은 hash-table.md 본문에 37회(21회 단독 어절) 등장하나 lexical 0 hit. 4개 한국어 query 측정 — lexical: `충돌` 0 / `해시 충돌` 0 / `컴파일러 최적화` 0 / `트리 순회 방법` 1.
**원인**: `chunks_fts``tokenize = 'unicode61 remove_diacritics 2'` (`migrations/V002__fts.sql:24`, design §5.5 verbatim 블록). unicode61 은 공백·구두점 경계로만 토큰을 끊는다 — 한국어는 어절 전체가 한 토큰이 되고 조사·어미가 붙은 채라 부분 매칭이 안 된다. V002 헤더 주석이 이미 "Korean morphological tokenizer is a P+ note" 로 예고한 사항.
**검증 (vector/hybrid 우회 확인)**: 동일 4 query 를 `--mode vector` / `--mode hybrid` 로 측정 — 전부 10 hit. `multilingual-e5-small` semantic 검색이 한국어를 정상 처리. 즉 embedding 켠 KB 는 **기본 hybrid 모드에서 한국어 검색이 동작**한다. 단 hybrid 는 RRF(lexical+vector) fusion 이라 한국어 query 는 lexical 기여가 0 → 사실상 vector-only 로 reduced (score 증거: lexical 도 hit 한 `트리 순회 방법` 만 hybrid score 1.000, 나머지 한국어 query 는 0.500).
**Status**: ✅ closed — v0.17.0 (2026-05-24) 에서 V007 trigram migration + `lexical.rs::build_match_string` trigram-aware 재설계로 해소. 영향은 위 2026-05-24 절 참조. 이하는 closure 전 원래 round-2 관찰 기록 (frozen).
**Workaround (pre-v0.17.0)**: 한국어 문서 KB 는 embedding 활성화 (`[models.embedding] provider = "fastembed"`) 가 사실상 필수였다 — vector / hybrid 가 한국어를 carry.
**Resolution (v0.17.0)**: FTS5 builtin `trigram` tokenizer 채택. `chunks_fts` 재생성 = V007 migration (`chunks` 원본 / embedding / vector 불변, FTS shadow 만 자동 backfill — re-ingest 불필요). design §5.5 verbatim 블록 + CI diff-check (`fts_v007_matches_design_section_5_5_verbatim`) 동반 갱신.
### code_lang_breakdown 이 chunk 수가 아닌 doc 수를 집계
**Symptom**: `schema.v1.stats.code_lang_breakdown` 이 언어별 *문서* 수를 보고. 코드가 많은 KB 에서 언어별 chunk 분포를 보려 할 때 granularity 가 doc 단위라 덜 유용.
**Status**: LOW. `code_lang_breakdown` 은 p10-1A-2 가 의도적으로 doc count 로 구현 (`store.rs::code_lang_breakdown` doc 주석 + `COUNT(*) FROM documents GROUP BY code_lang`). design §3.5 의 "언어별 분포" 의도와 엄밀히는 어긋나나 통계 표시 한정 — 검색/ingest 동작 무관.
**Next step**: chunk 단위 집계를 추가/교체하는 소규모 follow-up. wire schema 영향 시 additive 필드 (`code_lang_chunk_breakdown`) 로 처리 검토.
### ranking — glue chunk 이 top hit (deferred 유지)
multi-root 도그푸딩(2026-05-20)에서 관찰한 본문 vs 테스트 / glue chunk ranking 편향이 round 2 에서도 재확인됨. 자동 heuristic 은 user intent misalignment 위험 → 사용자 명시 요청 전까지 surface 변경 0 으로 유지 (project memory `project_ranking_deferred` 결정 그대로).
Cross-link: `tasks/p10/INDEX.md`, `migrations/V002__fts.sql`, design §5.5 / §3.5.
## 2026-05-24 — v0.17.0 PR-B: C typedef-wrapped struct/enum/union 이 typedef alias unit 으로 방출 (closure of 2026-05-21)
`crates/kebab-parse-code/src/c.rs::extract_blocks``type_definition` 분기 추가. 내부 anonymous `struct_specifier` / `enum_specifier` / `union_specifier` (name field 없음) 인 typedef 일 때 declarator 의 typedef alias identifier 를 추출해 synthetic unit 방출. named inner aggregate (`typedef struct Pt { ... } P;`) 와 plain alias (`typedef int MyInt;`) 는 기존대로 glue (top-level typedef-wrapped anonymous aggregate 만 v2 의 1차 범위).
**parser_version cascade**: `PARSER_VERSION` `code-c-v1``code-c-v2` bump. design §9 — `doc_id = (workspace_path, asset_id, parser_version)`. 같은 file (asset_id 불변) + 새 parser_version → 새 doc_id. 즉 같은 workspace_path 에 옛 doc_id 와 새 doc_id 가 동시 INSERT 시도 → `idx_docs_workspace_path` UNIQUE 충돌.
**Same-workspace_path orphan purge (B1 Step 5b)**: `crates/kebab-store-sqlite/src/store.rs` 에 두 helper 신규 — `stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path, keep_doc_id)` (chunk_ids 수집) + `purge_document_at_workspace_path_except_doc_id(workspace_path, keep_doc_id)` (CASCADE document/chunks 제거). `crates/kebab-app/src/lib.rs::try_skip_unchanged` 의 parser_mismatch 분기에서 `purge_workspace_path_for_parser_bump` wrapper 호출 → 옛 chunk_ids 의 LanceDB orphan 도 `delete_by_chunk_ids` 로 정리 후 SQLite document row 제거 → 이후 `Ok(None)` 반환 → caller 가 새 doc_id 로 INSERT. 기존 `purge_orphan_at_workspace_path` (asset_id 변경 케이스) 는 그대로 — bytes 변경 경로 회귀 없음.
**사용자 영향**: 기존 v0.16.x KB 의 C 파일은 v0.17.0 binary 로 다음 ingest 시 자동 재처리 (parser_version mismatch → cleanup → 새 doc). 명시적 re-ingest 명령 불필요 (다음 `kebab ingest` 가 자연스럽게 처리). `typedef struct {...} Foo;``Citation::Code.symbol = "Foo"` 로 search 에 노출.
**미해결 (Risks)**: nested typedef (`typedef struct { struct {...} inner; } Outer;`) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는 top-level typedef alias 만.
Cross-link: `crates/kebab-parse-code/src/c.rs::recover_typedef_alias`, `tasks/p10/p10-1d-c-cpp-ast-chunker.md` Risks/notes section.
## 2026-05-24 — v0.17.0 PR-C: `code_lang_chunk_breakdown` additive wire 필드 (closure of 2026-05-22 LOW)
`schema.v1.stats``code_lang_chunk_breakdown: { <lang>: <chunk_count> }` additive 필드 추가. 기존 `code_lang_breakdown` (doc 수) 와 sister — chunk 수 집계로 indexing 압력 granularity 노출. 한 PDF spec → 200 chunks vs 한 Rust file → 5 chunks 가 동일한 `1 doc` 으로 보이던 한계 closure.
**구현**: `crates/kebab-store-sqlite/src/store.rs::code_lang_chunk_breakdown()``chunks INNER JOIN documents``json_extract(d.metadata_json, '$.code_lang')` GROUP BY, `COUNT(c.chunk_id)`. `BTreeMap<String, u32>` 반환 (기존 helper 와 동일 shape). `crates/kebab-app/src/schema.rs::Stats` 에 동일 이름 필드 추가 + `collect_stats` builder 에서 호출. `docs/wire-schema/v1/schema.schema.json` 에 additive 필드 명세. **additive 변경 — wire breaking 아님, `schema_version` bump 불필요.**
**Gemini round 2 권고 반영**: 기존 `code_lang_breakdown` / `repo_breakdown` 의 JSON schema description 이 "code chunk count" 로 잘못 적혀 있던 (실제는 doc count) 부분을 "doc count" 로 정정. 신규 필드만 "chunk count" 로 명시. 사용자가 두 metric 의 의미 차이를 schema 만 보고도 구분 가능.
**사용자 영향**: `kebab schema --json` 출력에 신규 키 등장. MCP `schema` tool 도 동일. 옛 v0.16.x 가 보낸 호출은 그대로 동작 (additive).
Cross-link: `crates/kebab-store-sqlite/src/store.rs::code_lang_chunk_breakdown`, `docs/wire-schema/v1/schema.schema.json`.
## 2026-05-21 — p10-2: k8s multi-resource YAML chunk_id collision
**Origin**: P10 종합 도그푸딩 (`/tmp/kebab-p10-dogfood/`, 16 파일). 한 파일에 2+ k8s document (Deployment + Service, `---` 구분) 인 YAML 이 ingest 실패.
**Symptom**: `DocumentStore::put_chunks (code): UNIQUE constraint failed: chunks.chunk_id`. document row 는 생성되나 chunk 0개 → 검색 불가. p10-2 의 통합 테스트 `tier2_k8s_yaml_ingest_searchable` 가 single-Deployment fixture 만 써서 미발견.
**원인**: `tier2_shared::push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. `K8sManifestResourceV1Chunker` 가 resource 마다 호출 — 같은 document 의 모든 resource 가 `doc_id` + `chunker_version` + `base_policy_hash` 공유 + `split_key = None` → 동일 `id_hash` → 동일 `chunk_id`. p10-3 의 `code_text_paragraph_v1` 가 같은 버그였고 `df3c5b8` 에서 fix 됐지만 그건 `build_chunk_no_symbol` 직접 호출 경로, `push_chunks_with_oversize` 경로는 미수정.
**Fix** (PR #158, v0.16.1): `push_chunks_with_oversize``base_split_key: Option<u32>` 추가. k8s chunker 가 `Some(resource.line_start)` 전달 → resource 별 distinct chunk_id. dockerfile / manifest 는 `None` (파일당 1 chunk, 충돌 없음, chunk_id 불변).
**Deviation note**: single-resource k8s YAML 의 chunk_id 도 `None → Some(1)` 으로 바뀜 (`id_hash``base_policy_hash``base_policy_hash#L1`). `chunker_version` (`k8s-manifest-resource-v1`) 은 의도적으로 bump 안 함 — p10-2 가 v0.14.0 (~1주 전) 머지된 dogfood 단계라 prod KB 없음. v0.14.0~v0.16.0 사이 single-resource k8s 를 색인한 KB 는 re-ingest 시 old chunk 가 orphan 될 수 있으나 (UNIQUE 충돌 아님 — 다른 id), `kebab reset` 또는 re-ingest sweep 으로 정리됨. dogfood-only 단계라 chunker_version bump (전체 re-process) 보다 가벼운 선택.
Cross-link: `tasks/p10/p10-2-tier2-resource-aware.md` Risks/notes section.
## 2026-05-21 — p10-1D: typedef-wrapped struct/enum in C falls into glue
**Origin**: PR #156 (p10-1d) code-reviewer review. Verified during dogfood.
**Symptom**: `typedef struct { ... } Foo;` in a `.c` file does NOT emit a struct-level unit. tree-sitter-c classifies the construct as a top-level `type_definition` with an *anonymous* inner `struct_specifier` (no `name` field), so the extractor's `struct_specifier` arm doesn't fire — the whole declaration falls into `<top-level>` glue. The named typedef alias `Foo` is therefore not searchable as a symbol.
**Status**: ✅ closed — v0.17.0 (2026-05-24) PR-B 에서 extractor 의 `type_definition` 분기 추가로 해소. 영향은 위 2026-05-24 PR-B 절 참조. 이하는 closure 전 round-2 dogfood 관찰 기록 (frozen).
**Workaround (pre-v0.17.0)**: search the struct by its field/function names, or use `--code-lang c` to broaden scope. Typedef-aliased struct names won't surface as `Citation::Code.symbol`.
**Resolution (v0.17.0)**: extractor 가 top-level `type_definition` 노드를 만나 내부 anonymous `struct_specifier` / `enum_specifier` / `union_specifier` 가 있으면 `declarator` field 의 typedef alias 이름으로 synthetic unit 방출. `PARSER_VERSION` `code-c-v1``code-c-v2` bump. design §9 cascade 동작 — 같은 `(workspace_path, asset_id)``doc_id` 가 새 parser_version 으로 다르게 계산됨. 옛 doc/chunks row + LanceDB orphan 회피용 same-workspace_path orphan purge helper 동반 (`stale_chunk_ids_for_workspace_path_except_doc_id` + `purge_document_at_workspace_path_except_doc_id`).
Cross-link: `tasks/p10/p10-1d-c-cpp-ast-chunker.md` Risks/notes section.
## 2026-05-20 — p10-1B: Rust 1A-2 symbol path is file-scope-only; 1B+ uses workspace path → module prefix
**무엇이 바뀌었나**: P10-1A-2 의 Rust `code-rust-ast-v1` chunker 가 생성하는 symbol 은 file-scope mod-path nesting 만 사용한다 (예: `Foo::double`). P10-1B 이후 Python / TypeScript / JavaScript 의 symbol 은 workspace 경로 → module path prefix 를 포함한다 (예: `kebab_eval.metrics.compute_mrr`, `src/Foo.Foo.search`).

View File

@@ -144,9 +144,17 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
- [p10-1B Python + TS/JS AST chunkers](p10/p10-1b-py-ts-js-ast-chunkers.md) — 🟡 PR 오픈 (코드 완성, 머지 대기)
- p10-1C-Go Go AST chunker — 🟡 PR 오픈 (v0.12.0, `code-go-ast-v1`)
- p10-1C-JavaKotlin Java + Kotlin AST chunkers — 🟢 PR 오픈 (v0.13.0, `code-java-ast-v1` / `code-kotlin-ast-v1`)
- p10-1D C + C++ AST chunkers —
- p10-1D C + C++ AST chunkers — ✅ 머지 (v0.16.0, `code-c-ast-v1` + `code-cpp-ast-v1`)
- p10-2 Tier 2 resource-aware — ✅ 머지 (v0.14.0, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1`)
- p10-3 Tier 3 paragraph + line-window fallback —
- p10-3 Tier 3 paragraph + line-window fallback — ✅ 머지 (v0.15.0, `code-text-paragraph-v1`)
### 🎯 P10 Dogfooding Feedback (v0.17.0)
도그푸딩 round 2 (2026-05-22) 에서 발견된 follow-up 셋. spec + plan: `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`, `docs/superpowers/plans/2026-05-22-korean-trigram-tokenizer.md`. release: [v0.17.0](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.0).
- **PR-A 한국어 trigram FTS5 tokenizer + lexical builder + hint** — ✅ 머지 (#159, 2026-05-24). `chunks_fts` 가 V007 migration 으로 `unicode61``trigram`. `lexical.rs::build_match_string` trigram-aware 재설계 (whole-phrase OR token-AND, 3자 미만 토큰 drop, raw FTS5 mode 유지). `SearchResponse.hint` additive 필드 + CLI/TUI 안내. 영어 lexical 도 substring 매칭으로 동작 변경.
- **PR-B C typedef alias unit + parser_version cascade** — ✅ 머지 (#160, 2026-05-24). `type_definition` 분기 — top-level typedef-wrapped anonymous struct/enum/union 의 alias 이름으로 synthetic unit. `PARSER_VERSION code-c-v1``code-c-v2` bump + same-workspace_path orphan purge cascade.
- **PR-C `code_lang_chunk_breakdown` additive wire field** — ✅ 머지 (#161, 2026-05-24). `schema.v1.stats` 에 chunk 수 집계 sister 필드 + 기존 `code_lang_breakdown` / `repo_breakdown` JSON schema description 정정 ("chunk count" 오기재 → "doc count").
## Post-merge 핫픽스

View File

@@ -7,8 +7,8 @@
| 1B | Python + TS/JS AST chunkers | 🟡 PR 오픈 (코드 완성, 머지 대기) |
| 1C-Go | Go AST chunker (`code-go-ast-v1`) | 🟡 PR 오픈 (v0.12.0) |
| 1C-JavaKotlin | Java + Kotlin AST chunkers (`code-java-ast-v1` / `code-kotlin-ast-v1`) | 🟢 PR 오픈 (v0.13.0) |
| 1D | C + C++ AST chunkers | |
| 1D | C + C++ AST chunkers | ✅ 머지 (v0.16.0) |
| 2 | Tier 2 resource-aware (k8s / Dockerfile / manifest) | ✅ 머지 (v0.14.0) |
| 3 | Tier 3 paragraph + line-window fallback | |
| 3 | Tier 3 paragraph + line-window fallback | ✅ 머지 (v0.15.0) |
Design: [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md)

View File

@@ -0,0 +1,120 @@
# p10-1D — C + C++ AST chunkers
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-c-ast-v1` + `code-cpp-ast-v1`), §3.4 (symbol path — C `func_name`, C++ `namespace::Class::method`), §3.5 (code_lang `c` + `cpp`, ext `.c`/`.h` / `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx`), §6.1 (`kebab-parse-code/src/{c,cpp}.rs`), §6.2 (`kebab-chunk/src/code_{c,cpp}_ast_v1.rs`), §9.1 (Tier 1 AST per-language + oversize fallback), §10 (activation log).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1D (C + C++ 부분).
**Plan:** [2026-05-21-p10-1d-c-cpp-ast-chunker.md](../../docs/superpowers/plans/2026-05-21-p10-1d-c-cpp-ast-chunker.md).
## Goal
p10-1A-2 / 1B / 1C / p10-2 / p10-3 인프라 위에 C + C++ AST chunker 2종을 단일 PR 로 활성화. P10 의 Tier 1 chunker family 마지막. 머지 시점부터 `.c` / `.h` / `.cpp` / `.cc` / `.cxx` / `.hpp` / `.hh` / `.hxx` 파일 dogfooding 가능.
`.h` 가 design 명시대로 C 매핑 — C++ 프로젝트의 `.h` 는 tree-sitter-c 의 parse 가 namespace / template 같은 C++ syntax 에 실패할 가능성. 실패 시 p10-3 의 Tier 3 fallback 으로 자동 picked up (이미 wired).
## 동결된 설계 결정 (이 task 로 확정)
### C extractor (`code-c-ast-v1`)
- **Symbol** = function name only. design §3.4 그대로 — no nesting, no namespace. 예: `parse_blocks`.
- **Top-level units**:
- `function_definition` (named) → 1 unit, symbol = function name
- `struct_specifier` (named, top-level) → 1 unit, symbol = struct name
- `enum_specifier` (named, top-level) → 1 unit, symbol = enum name
- `union_specifier` (named, top-level) → 1 unit, symbol = union name
- `declaration` (top-level — typedef / global var / fn prototype) → glue `<top-level>`
- `preproc_include` / `preproc_def` / `preproc_function_def` / `preproc_ifdef` 등 preprocessor → glue `<top-level>`
- **Static / extern / inline fn**: 일반 fn 과 동일 처리 (storage class qualifier 무시 — symbol 은 declarator 의 fn name 만).
- **Inner struct / enum 안의 nested declaration** (C 도 가능): 1B Python class-nesting 미적용 — C 의 inner type 은 흔치 않고 outer 가 typedef wrapper 인 패턴이라 top-level 만 emit.
- **Empty file 또는 unit 0개** → `<module>` post-pass (1A-2 패턴).
### C++ extractor (`code-cpp-ast-v1`)
- **Symbol** = `namespace::Class::method` (design §3.4 그대로). namespace 가 없으면 `Class::method` 또는 `func_name`. 예: `kebab::chunk::MdHeadingV1Chunker::chunk_doc`.
- **Top-level units + recursion**:
- `namespace_definition` (named) → recurse with namespace name pushed (Python class-nesting + Java/Kotlin package-prefix hybrid).
- **Anonymous namespace** (`namespace { ... }`) → namespace name = `<anonymous>` push (Python `<unnamed>` 패턴 일관).
- `class_specifier` / `struct_specifier` (top-level or in namespace or nested in class, named) → recurse with class name pushed.
- `function_definition` (top-level or in namespace or in class) → 1 unit, symbol per nesting (`namespace::Class::method` / `namespace::func` / `Class::method` / `func_name`).
- `template_declaration` → 내부 declarator type 따라 recurse / emit (function template → method emit, class template → class recurse). template type params (`<T>`, `<typename T>`) 는 symbol 미포함 (Go generic 처리와 동일).
- `enum_specifier` (named) → 1 unit, symbol per nesting.
- `concept_definition` (C++20) → 1 unit, symbol per nesting (treat as type-level definition).
- `using_declaration` / `using_directive` / `preproc_include` / `preproc_def` 등 → glue `<top-level>`.
- `extern "C"` 블록 안의 정의 → 일반 fn 처리 (block 자체는 glue).
- **Method out-of-class definition** (`Class::method` 형태로 namespace 밖에서 정의): tree-sitter-cpp 의 `function_declarator``qualified_identifier` 따라 prefix 복원 — declarator 의 `Class::method` 자체에서 추출.
- **Operator overload** (`operator+`, `operator()` 등): symbol = `Class::operator+` 그대로.
- **Constructor / destructor**: symbol = `Class::Class` / `Class::~Class` (convention).
- **Empty file 또는 unit 0개** → `<module>` post-pass.
### 공통
- **`<top-level>` glue grouping**: preprocessor + global var + using 선언 등 의미 단위 외 → 1 glue chunk per file.
- **Oversize fallback**: 1A-2 의 `AST_CHUNK_MAX_LINES = 200` 동일.
- **`.h` 의 fallback 보장**: C parser 실패 시 p10-3 의 Tier 3 fallback wrapper (이미 wired) 가 picked up → `Citation::Code { symbol: None, lang: "c" }` + `code-text-paragraph-v1`.
### Module layout
```
crates/kebab-parse-code/src/
├── c.rs [신규] — C AST extractor (PARSER_VERSION `tree-sitter-c-<ver>`)
├── cpp.rs [신규] — C++ AST extractor (PARSER_VERSION `tree-sitter-cpp-<ver>`)
└── lib.rs [edit] — pub use + C_PARSER_VERSION / CPP_PARSER_VERSION 상수 노출
crates/kebab-chunk/src/
├── code_c_ast_v1.rs [신규] — VERSION_LABEL `code-c-ast-v1`. 1A-2 패턴 (canonical Document → Vec<Chunk>).
├── code_cpp_ast_v1.rs [신규] — VERSION_LABEL `code-cpp-ast-v1`. 동일 패턴.
└── lib.rs [edit] — pub use 2개
crates/kebab-source-fs/src/media.rs [편집 불요] — code_lang_for_path 위임 패턴 그대로 (Task C of p10-2 이후 단일 source of truth).
crates/kebab-parse-code/src/lang.rs [편집 불요] — `.c`/`.h`/`.cpp` 등 매핑은 1A-1 시점부터 이미 존재.
crates/kebab-app/src/lib.rs [edit] — ingest_one_code_asset 의 allowlist + 4-arm match 에 "c" + "cpp" 추가. tier3 fallback list 에도 둘 추가.
crates/kebab-chunk/tests/ [신규]
├── fixtures/sample.c — C fixture (top-level fn + struct)
├── fixtures/sample.cpp — C++ fixture (namespace + class + method)
├── code_c_ast_snapshot.rs — C snapshot test
└── code_cpp_ast_snapshot.rs — C++ snapshot test
crates/kebab-app/tests/code_ingest_smoke.rs [edit] — 2 신규 integration test (c + cpp). 16 + 2 = 18.
Cargo.toml workspace.dependencies [edit] — tree-sitter-c + tree-sitter-cpp.
crates/kebab-parse-code/Cargo.toml [edit] — 위 2 dep 신규 entry.
```
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- C fixture (`tests/fixtures/sample.c`) + C++ fixture (`tests/fixtures/sample.cpp`) ingest → chunk snapshot 안정. C snapshot 의 chunks 가 모두 `Citation::Code { lang: "c", symbol: Some(<fn|struct|enum name>), ... }`. C++ snapshot 의 chunks 가 namespace + class nesting 포함 (`kebab::chunk::Foo::bar`).
- 격리 TempDir KB 에 `.c` / `.cpp` 파일 두고 `kebab search --code-lang c --json` / `--code-lang cpp --json` 가 각각 `Citation::Code` 반환. integration test `tier1_c_ingest_searchable` + `tier1_cpp_ingest_searchable` (기존 16 + 2 = 18).
- `kebab schema --json | jq .stats.code_lang_breakdown``"c"` + `"cpp"` 카운트 등장 (.c/.cpp 파일 ingest 후).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design 2026-04-27 §10 activation log 한 줄.
- workspace `Cargo.toml` minor bump (0.15.0 → 0.16.0), gitea-release v0.16.0.
## Allowed dependencies
- `kebab-parse-code``tree-sitter-c` + `tree-sitter-cpp` workspace deps 추가. 기존 deps 유지.
- `kebab-chunk` 의 새 모듈 2개 (`code_c_ast_v1.rs`, `code_cpp_ast_v1.rs`) — language-agnostic body, tree-sitter import 금지. 기존 `tier2_shared::build_chunk` (pub(crate)) 재사용.
- `kebab-app`, `kebab-source-fs` — 새 crate dep 없음.
## Forbidden dependencies
- `kebab-chunk` 가 tree-sitter-c / tree-sitter-cpp 직접 import 금지 (boundary §6.3).
- `kebab-parse-code` 가 store / embed / llm / rag 직접 import 금지.
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **tree-sitter-c / tree-sitter-cpp 호환성**: tree-sitter 0.26 (현재 workspace) 과 호환 필요. resolve 시 `tree-sitter-language` shim 사용 fork (1C-JK 의 tree-sitter-kotlin-ng 패턴) 가능성 — crate.io 의 가장 활발한 maintainer 우선. 실패 시 별도 fork 검토.
- **`.h` parse 실패**: C++ 헤더 (`namespace`, `template`, `class`) 를 C parser 가 만나면 partial parse + error nodes. 1A-2 의 extractor 패턴이 error node 무시 + recoverable parse 진행 — emit 결과가 *불완전* 할 가능성. 그럴 때 chunks 가 0 으로 떨어지면 p10-3 Tier 3 fallback 으로 자동 picked up (이미 wired). 부분 emit 시 일부만 색인 — Tier 3 fallback 안 함. dogfood 후 HOTFIXES 검토.
- **Method out-of-class definition** (`Class::method` 형식): tree-sitter-cpp 의 `function_definition` 의 declarator 가 `qualified_identifier` 일 때 prefix 복원. fixture 로 검증.
- **Template specialization** (`template<> class Foo<int>`): tree-sitter-cpp 의 `template_declaration` 안의 `class_specifier` name 만 추출 — `Foo` 만 symbol 에 들어가고 `<int>` 미포함. design 의 generic 무시 룰 일관.
- **`extern "C"` block 안의 fn**: 일반 fn 처리. 외부 wrapping block 은 glue.
- **Anonymous union / struct** (`struct { int x; }` 변수 안에): 흔치 않음 + named 만 unit. anonymous 는 glue.
- **typedef-wrapped struct/enum idiom** (`typedef struct { ... } Foo;`) — ✅ v0.17.0 (2026-05-24) PR-B 에서 해소. extractor 의 `type_definition` 분기가 inner anonymous `struct_specifier` / `enum_specifier` / `union_specifier` 를 탐지해 declarator 의 typedef alias 이름으로 synthetic unit 방출. `PARSER_VERSION` `code-c-v1``code-c-v2` bump + same-workspace_path orphan purge cascade 동반. **잔여 미해결**: nested typedef (`typedef struct { struct {...} inner; } Outer;`) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는 top-level typedef alias 만. See [HOTFIXES.md 2026-05-21 entry](../HOTFIXES.md) (frozen 관찰) + 2026-05-24 closure entry.
- **Macro-heavy code** (Linux kernel 등): `#define FOO(x) ...` 매크로가 function-like 라도 parser 가 fn 으로 인식 안 함. preprocessor glue 로 처리 — symbol 안 잡힘. 의도된 동작 (parser 의 macro expansion 안 함).
- **`__attribute__((...))`** annotations: tree-sitter-c 의 attribute 노드는 declarator 옆 sibling. 무시 가능. function name 추출에 영향 없음.
- **fixture 크기**: sample.c 는 ~30 line (top-level fn + struct + enum + preprocessor), sample.cpp 는 ~50 line (nested namespace + class + method + template + free fn). oversize fallback 의 별도 검증은 1A-2 의 long_section_snapshot 패턴이 이미 cover (필요 시 별도 fixture).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.

View File

@@ -118,3 +118,4 @@ _ → skip (p10-3 fallback 의 자리)
- **`pom.xml` aggregate parent POM** — 매우 큼 (수백~수천 줄). oversize fallback 으로 split. 거대 fixture 로 한 번 검증.
- **`media.rs` 정리** — 1A-1 부터 누적된 inline `match extension` duplication 을 `code_lang_for_path` 호출로 교체. 기존 단위 테스트 동작 보존 (테스트는 결과 값만 보므로 통과해야 함).
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` 에 one-line cross-link.
- **[HOTFIXES 2026-05-21]** multi-resource k8s YAML (2+ document) 이 `chunk_id` 충돌로 ingest 실패 — `push_chunks_with_oversize` 의 non-oversize 분기가 `split_key = None` 하드코딩. PR #158 (v0.16.1) 에서 `base_split_key` 파라미터로 fix. See `tasks/HOTFIXES.md` 2026-05-21 entry.

View File

@@ -0,0 +1,116 @@
# p10-3 — Tier 3 paragraph + line-window fallback chunker
**Status:** 🟡 진행 중
**Contract sections:** §3.3 (chunker_version `code-text-paragraph-v1`), §3.5 (code_lang routing — `shell` 활성화 + "미지원 / Tier 3 fallback" 명확화), §6.2 (`kebab-chunk/src/code_text_paragraph_v1.rs`), §6.3 (`tier2_shared::build_chunk``pub(crate)` 노출), §9.3 (Tier 3 정의), §10.1 (deactivation log 한 줄).
**Design:** [2026-05-15-kebab-code-ingest-design.md](../../docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md) §1.3 (Phase 3) + §9.3.
**Plan:** [2026-05-20-p10-3-tier3-paragraph-fallback.md](../../docs/superpowers/plans/2026-05-20-p10-3-tier3-paragraph-fallback.md).
## Goal
p10-1A-2 / 1B / 1C / 1A-1 의 framework + p10-2 Tier 2 인프라 위에 Tier 3 paragraph fallback chunker 활성화. 단일 PR. 머지 시점부터:
- `.sh` / `.bash` / `.zsh` 파일이 paragraph 단위로 색인.
- p10-2 의 비-k8s YAML / invalid YAML / Tier 1 AST extractor 실패 등 0-chunk 결과가 자동으로 Tier 3 로 fallback 되어 색인 — 이전에 skip 되던 파일이 search 가능.
## 동결된 설계 결정 (이 task 로 확정)
### chunker (`code-text-paragraph-v1`)
- **Input**: `Document` with single `Block::Code { text, lang, ... }`. Tier 2 의 `synthesize_tier2_document` 와 동일한 모양 — fallback wrapper 가 같은 doc 재사용.
- **VERSION_LABEL**: `"code-text-paragraph-v1"`.
- **Paragraph 분할**: `text.lines()` 순회. 빈 줄 (정확히 빈 줄 또는 only-whitespace) 을 paragraph boundary 로. 빈 줄 자체는 어느 paragraph 에도 포함되지 않음 (chunk 의 line range 에 미포함). 빈 paragraph (전부 whitespace) skip.
- **Paragraph 크기 룰** (design §9.3 default 그대로, hardcoded):
- paragraph line count ≤ 80 → 1 chunk emit.
- paragraph line count > 80 → line-window split with window size 80 / overlap 20 (stride 60). 즉 line 1-80, 61-140, 121-200, … 마지막 window 는 EOF 까지 (≤ 80 lines).
- `FALLBACK_LINES_PER_CHUNK = 80`, `FALLBACK_LINES_OVERLAP = 20` 둘 다 hardcoded constants (1A-2 의 `AST_CHUNK_MAX_LINES = 200` 패턴 그대로 — 사용자 config 노출 안 함, 미래 HOTFIXES 시 노출 검토).
- **Citation**: `SourceSpan::Code { line_start, line_end, symbol: None, lang: <input lang> }`. `symbol = None` 통일 (Tier 3 는 의미 단위 식별 안 함). `lang` 은 입력 Document 의 `Block::Code.lang` 그대로 보존 — shell → `"shell"`, k8s skip → `"yaml"`, Rust extractor 실패 → `"rust"` 등.
- **chunk_id 충돌 방지**: 동일 paragraph 의 line-window split 시 `id_for_chunk``split_key``window_start` 전달 (Tier 2 `#L{k}` 패턴 동일).
- **Edge cases**:
- 전체 파일이 빈 줄만 → 0 chunk emit (fallback 의 fallback 없음). `tracing::warn!`.
- 단일 paragraph + ≤ 80 lines → 1 chunk, line range 1..N.
- 빈 줄 없는 거대 파일 (한 paragraph 전체) → line-window split.
### Routing / fallback wrapper
- **`code_lang_for_path`** 변경 없음 (shell 매핑은 1A-1 시점부터 이미 존재).
- **`ingest_one_code_asset` allowlist** (`crates/kebab-app/src/lib.rs:953`) 에 `"shell"` 추가.
- **4-arm match (parser_version / chunker_version / extract / chunks)** 에 `"shell"` arm 추가:
- parser_version = `"none-v1"` (Tier 2 sentinel 재사용).
- chunker_version = `CodeTextParagraphV1Chunker.chunker_version()`.
- extract = `synthesize_tier2_document(asset, &bytes, "shell", &parser_version)?` (재사용).
- chunks = `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)?`.
- **Fallback wrapper** (핵심 신규 로직) — chunks match 직후 후처리:
- Tier 1/2 lang 의 결과가 `Err(_)` 또는 `Ok(empty_vec)` 이면 Tier 3 retry.
- retry 시:
- `chunker_version``code-text-paragraph-v1` 로 swap (downstream stamping 정확성).
- `canonical.parser_version``"none-v1"` 로 swap (Tier 1 의 `RUST_PARSER_VERSION` 등이 misleading 하므로).
- `CodeTextParagraphV1Chunker.chunk(&canonical, chunk_policy)` 실행.
- 실패 사유는 `tracing::warn!("tier1/2 emitted 0 chunks or errored for {workspace_path} ({code_lang}); falling back to tier 3")`.
- **Tier 3 자체가 0 chunk 또는 Err** 인 경우는 그대로 fail/skip (fallback 의 fallback 없음).
### `tier2_shared::build_chunk` 노출
- 현재 module-private `fn build_chunk`. Tier 3 가 동일 Chunk 생성 (hash / token / policy_hash 일관) 을 위해 호출 — `pub(crate) fn build_chunk(...)` 으로 visibility 만 변경. signature 동일.
### Lang 보존 정책
- Tier 3 chunk 의 `Citation::Code.lang` = 입력 Document 의 `Block::Code.lang` 그대로. 명시적으로 표:
| Source | input lang | Tier 3 output lang |
|--------|-----------|----------|
| shell direct | `"shell"` | `"shell"` |
| k8s 0-chunk fallback | `"yaml"` | `"yaml"` |
| Rust AST 실패 fallback | `"rust"` | `"rust"` |
| manifest 0-chunk (이론상, 거의 발생 안 함) | `"toml"` 등 | 유지 |
- 검색 시 `--code-lang shell` / `--code-lang yaml` 등이 fallback chunk 도 매칭 — search filter 동작 자연.
### Non-scope
- **미지원 확장자 wiring**: `.txt` / `.log` / `.scala` / `.rb` 등은 본 PR scope 밖. `code_lang_for_path` 의 매핑은 unchanged. Tier 3 chunker 자체는 만들어두고, 미래에 `code_lang_for_path` 에 새 lang 추가 시 자동 picked up (1A-2 패턴).
- **config 노출**: `FALLBACK_LINES_PER_CHUNK` / `FALLBACK_LINES_OVERLAP` hardcoded. config.toml 노출 없음.
### Frozen design 갱신
- `docs/superpowers/specs/2026-05-15-kebab-code-ingest-design.md` §10.1 활성화 로그 한 줄.
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §10 activation log 한 줄.
- §3.5 의 "미지원 / Tier 3 fallback → null" 표현은 그대로 유지 (해당 표현이 본 phase 의 정확한 의미 — Tier 3 chunk 의 lang 은 입력 lang 보존이므로 "null" 은 미지원 확장자 wire 시 적용).
## Acceptance criteria
- `cargo test --workspace --no-fail-fast -j 1` PASS (memory-conscious `-j 1`).
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- 4 신규 unit test in `crates/kebab-chunk/tests/code_text_paragraph_v1.rs`:
- `shell_multi_paragraph_splits_on_blank_lines` — 3-paragraph fixture → 3 chunk, symbol=None, lang=shell, contiguous (exclusive of blank lines).
- `single_long_paragraph_line_window_split` — 200+ line single paragraph → window split, distinct chunk_ids, expected line ranges (1-80, 61-140, 121-200, …).
- `empty_file_emits_zero_chunks` — 빈 텍스트 → `Ok(vec![])`.
- `lang_field_preserved_from_input_doc` — lang=yaml 입력 → emit chunk lang=yaml.
- 2 신규 integration test in `crates/kebab-app/tests/code_ingest_smoke.rs`:
- `tier3_shell_ingest_searchable``.sh` 파일 ingest → `--code-lang shell` 검색 → `Citation::Code { symbol: None, lang: "shell" }`, `chunker_version: "code-text-paragraph-v1"`.
- `tier3_yaml_fallback_picks_up_non_k8s_yaml` — apiVersion+kind 없는 yaml ingest → fallback 발동 → `Citation::Code { symbol: None, lang: "yaml" }`, chunker_version `code-text-paragraph-v1`.
- 기존 12 smoke test + 2 신규 = 14 testing surface. (Tier 1 9 + Tier 2 3 + Tier 3 2.)
- `kebab schema --json | jq .stats.code_lang_breakdown``"shell"` 카운트 등장 (.sh 파일 ingest 후). 비-k8s YAML 도 `"yaml"` 카운트에 누적 (Tier 2 와 Tier 3 가 같은 lang).
- README + HANDOFF + docs/ARCHITECTURE + docs/SMOKE + tasks/INDEX + tasks/p10/INDEX 갱신.
- frozen design §10.1 + §10 activation log 한 줄씩.
- workspace `Cargo.toml` minor bump (0.14.0 → 0.15.0), gitea-release v0.15.0.
## Allowed dependencies
- `kebab-chunk` 의 새 모듈 `code_text_paragraph_v1.rs` — kebab-core + anyhow + tracing. tier2_shared 의 `build_chunk` 호출 (visibility `pub(crate)` 로 노출). tree-sitter / serde_yaml 비사용.
- `kebab-app::ingest_one_code_asset` — 4-arm match + allowlist + fallback wrapper 확장. 새 crate dep 없음.
- `kebab-parse-code` — 변경 없음 (lang.rs 의 shell 매핑은 1A-1 부터 존재).
- `kebab-source-fs` — 변경 없음 (media.rs 이미 `code_lang_for_path` 위임).
## Forbidden dependencies
- `kebab-chunk` 가 store / embed / llm / rag / tree-sitter 직접 import 금지 (boundary §6.3 유지).
- UI crate (`kebab-cli` / `kebab-mcp` / `kebab-tui` / `kebab-desktop`) 가 `kebab-parse-code` / `kebab-chunk` 직접 import 금지 — `kebab-app` facade 만.
## Risks / notes
- **Fallback infinite loop 방지**: Tier 3 자체가 0 chunk 또는 Err 인 경우는 그대로 fail/skip — fallback 의 fallback 없음. 명시 spec.
- **chunker_version swap 시 `try_skip_unchanged` 일관성**: fallback 발동 후 stored chunker_version = `code-text-paragraph-v1`. 다음 ingest 에 동일 파일 → 동일 chunker_version 으로 lookup 매칭 (skip 동작 OK). Tier 1 chunker 가 미래에 작동하기 시작하면 (예: tree-sitter grammar fix) cascade rule 로 incremental cache miss → 자동 reprocess 가 정상 동작.
- **lang 보존 vs fallback 의미**: fallback chunk 의 lang 이 원본 lang 유지라 search filter `--code-lang yaml` 가 Tier 2 와 Tier 3 chunk 둘 다 매칭. 의도된 동작 — 사용자가 "yaml 파일 검색" 했을 때 모든 yaml 결과 표시.
- **line-window overlap 의미**: 80/20 (stride 60) 은 design §9.3 default. 거대 paragraph (예: minified JSON 한 줄) 의 경우에도 동일 알고리즘 — 단 한 줄 = 한 line 이라 split 발생 안 함 (length 80 lines 기준). minified 의 경우 chunk 한 개에 매우 긴 텍스트가 들어가는데 이는 paragraph 분할 정책의 inherent limitation. 미래 HOTFIXES 검토.
- **빈 줄 처리**: `^\s*$` 매칭 (whitespace-only) 줄을 paragraph boundary 로. 탭만 있는 줄 / CR-only 줄 등 edge case fixture 로 검증.
- **shell line-comment 처리**: shell script 의 `# comment` 줄은 일반 line. paragraph 분할에 영향 없음 (빈 줄 아님). chunk 안에 그대로 보존.
- **fallback wrapper 의 `canonical.parser_version` mutation**: Document 의 parser_version 을 Tier 3 fallback 시 `"none-v1"` 로 swap. CanonicalDocument 가 `mut` 로 받아져야 함. 이미 `let mut canonical = match ...` 이라 mut 가능. plan 단계 검증.
- **머지 후 deviation** 은 `tasks/HOTFIXES.md` dated 로그 + 본 spec `Risks / notes` cross-link.